Re: Asynchronous Append on postgres_fdw nodes.
On Fri, Feb 12, 2021 at 5:30 PM Kyotaro Horiguchi wrote: > It seems too specific to async Append so I look it as a PoC of the > mechanism. Are you saying that the patch only reconsiders async ForeignScans? > It creates a hash table that keyed by connection umid to record > planids run on the connection, triggerd by core planner via a dedicate > API function. It seems to me that ConnCacheEntry.state can hold that > and the hash is not needed at all. I think a good thing about the hash table is that it can be used by other FDWs that support async execution in a similar way to postgres_fdw, so they don’t need to create their own hash tables. But I’d like to know about the idea of using ConnCacheEntry. Could you elaborate a bit more about that? > | postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext > *acxt) > | { > | ... > | /* > |* If the connection used for the ForeignScan node is used in other > parts > |* of the query plan tree except async subplans of the parent Append > node, > |* disable async execution of the ForeignScan node. > |*/ > | if (!bms_is_subset(fsplanids, asyncplanids)) > | return false; > > This would be a reasonable restriction. Cool! > | /* > |* If the subplans of the Append node are all async-capable, and use > the > |* same connection, then we won't execute them asynchronously. > |*/ > | if (requestor->as_nasyncplans == requestor->as_nplans && > | !bms_nonempty_difference(asyncplanids, fsplanids)) > | return false; > > It is the correct restiction? I understand that the currently > intending restriction is one connection accepts at most one FDW-scan > node. This looks somethig different... People put multiple partitions in a remote PostgreSQL server in sharding, so the patch allows multiple postgres_fdw ForeignScans beneath an Append that use the same connection to be executed asynchronously like this: postgres=# create table t1 (a int, b int, c text); postgres=# create table t2 (a int, b int, c text); postgres=# create table t3 (a int, b int, c text); postgres=# create foreign table p1 (a int, b int, c text) server server1 options (table_name 't1'); postgres=# create foreign table p2 (a int, b int, c text) server server2 options (table_name 't2'); postgres=# create foreign table p3 (a int, b int, c text) server server2 options (table_name 't3'); postgres=# create table pt (a int, b int, c text) partition by range (a); postgres=# alter table pt attach partition p1 for values from (10) to (20); postgres=# alter table pt attach partition p2 for values from (20) to (30); postgres=# alter table pt attach partition p3 for values from (30) to (40); postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM') from generate_series(0, 99) i; postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM') from generate_series(0, 99) i; postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM') from generate_series(0, 99) i; postgres=# analyze pt; postgres=# explain verbose select count(*) from pt; QUERY PLAN -- Aggregate (cost=314.25..314.26 rows=1 width=8) Output: count(*) -> Append (cost=100.00..313.50 rows=300 width=0) -> Async Foreign Scan on public.p1 pt_1 (cost=100.00..104.00 rows=100 width=0) Remote SQL: SELECT NULL FROM public.t1 -> Async Foreign Scan on public.p2 pt_2 (cost=100.00..104.00 rows=100 width=0) Remote SQL: SELECT NULL FROM public.t2 -> Async Foreign Scan on public.p3 pt_3 (cost=100.00..104.00 rows=100 width=0) Remote SQL: SELECT NULL FROM public.t3 (9 rows) For this query, p2 and p3, which use the same connection, are scanned asynchronously! But if all the subplans of an Append are async postgres_fdw ForeignScans that use the same connection, they won’t be parallelized at all, and the overhead of async execution may cause a performance degradation. So the patch disables async execution of them in that case using the above code bit. Thanks for the review! Best regards, Etsuro Fujita
Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)
On Sat, Feb 13, 2021 at 09:33:48PM -0300, Ranier Vilela wrote: > Em sáb., 13 de fev. de 2021 às 20:32, Michael Paquier > escreveu: > >> You are missing the point here. What you are proposing here would not >> be backpatched. However, reusing the same words as upthread, this has >> a cost in terms of *future* maintenance. In short, any *future* >> potential bug fix that would require to be backpatched in need of >> using this function or touching its area would result in a conflict. > > Ok. +1 for back-patching. Please take the time to read again my previous email. And also, please take the time to actually test patches you send, because the list of things getting broken is impressive. At least you make sure that the internals of cryptohash.c generate an error as they should because of those incorrect sizes :) git diff --check complains, in various places. @@ -330,7 +330,8 @@ SendBackupManifest(backup_manifest_info *manifest) * twice. */ manifest->still_checksumming = false; - if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf) < 0) + if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf, + sizeof(checksumbuf) - 1) < 0) elog(ERROR, "failed to finalize checksum of backup manifest"); This breaks backup manifests, due to an incorrect calculation. @@ -78,7 +78,8 @@ pg_md5_hash(const void *buff, size_t len, char *hexsum) if (pg_cryptohash_init(ctx) < 0 || pg_cryptohash_update(ctx, buff, len) < 0 || - pg_cryptohash_final(ctx, sum) < 0) + pg_cryptohash_final(ctx, sum, + sizeof(sum) - 1) < 0) This one breaks MD5 hashing, due to an incorrect size calculation, again. @@ -51,7 +51,8 @@ scram_HMAC_init(scram_HMAC_ctx *ctx, const uint8 *key, int keylen) return -1; if (pg_cryptohash_init(sha256_ctx) < 0 || pg_cryptohash_update(sha256_ctx, key, keylen) < 0 || - pg_cryptohash_final(sha256_ctx, keybuf) < 0) + pg_cryptohash_final(sha256_ctx, keybuf, + sizeof(keybuf) - 1) < 0) [...] - if (pg_cryptohash_final(ctx->sha256ctx, h) < 0) + if (pg_cryptohash_final(ctx->sha256ctx, h, + sizeof(h) - 1) < 0) This breaks SCRAM authentication, for the same reason. In three places. I think that in pg_checksum_final() we had better save the digest length in "retval" before calling pg_cryptohash_final(), and use it for the size passed down. contrib/uuid-ossp/ fails to compile. contrib/pgcrypto/ fix is incorrect, requiring h->result_size(h) in three places. I think that as a whole we should try to minimize the number of times we use any DIGEST_LENGTH variable, relying a maximum on sizeof(). This v4 is a mixed bad of that. Once you switch to that, there is an interesting result with uuid-ossp, where you can notice that there is a match between the size of dce_uuid_t and MD5_DIGEST_LENGTH. No need to send a new patch, the attached taking care of those issues, and it is correctly indented. I'll just look at that again tomorrow, it is already late here. -- Michael diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h index 32d7784ca5..541dc844c8 100644 --- a/src/include/common/cryptohash.h +++ b/src/include/common/cryptohash.h @@ -32,7 +32,7 @@ typedef struct pg_cryptohash_ctx pg_cryptohash_ctx; extern pg_cryptohash_ctx *pg_cryptohash_create(pg_cryptohash_type type); extern int pg_cryptohash_init(pg_cryptohash_ctx *ctx); extern int pg_cryptohash_update(pg_cryptohash_ctx *ctx, const uint8 *data, size_t len); -extern int pg_cryptohash_final(pg_cryptohash_ctx *ctx, uint8 *dest); +extern int pg_cryptohash_final(pg_cryptohash_ctx *ctx, uint8 *dest, size_t len); extern void pg_cryptohash_free(pg_cryptohash_ctx *ctx); #endif /* PG_CRYPTOHASH_H */ diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c index 8d857f39df..b9b6d464a0 100644 --- a/src/backend/libpq/auth-scram.c +++ b/src/backend/libpq/auth-scram.c @@ -1429,7 +1429,7 @@ scram_mock_salt(const char *username) if (pg_cryptohash_init(ctx) < 0 || pg_cryptohash_update(ctx, (uint8 *) username, strlen(username)) < 0 || pg_cryptohash_update(ctx, (uint8 *) mock_auth_nonce, MOCK_AUTH_NONCE_LEN) < 0 || - pg_cryptohash_final(ctx, sha_digest) < 0) + pg_cryptohash_final(ctx, sha_digest, sizeof(sha_digest)) < 0) { pg_cryptohash_free(ctx); return NULL; diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c index 0cefd181b5..32bb0efb3d 100644 --- a/src/backend/replication/backup_manifest.c +++ b/src/backend/replication/backup_manifest.c @@ -330,12 +330,13 @@ SendBackupManifest(backup_manifest_info *manifest) * twice. */ manifest->still_checksumming = false; - if (pg
Re: How to get Relation tuples in C function
On Sun, Feb 14, 2021 at 09:29:08AM +0800, Andy Fan wrote: > Thank you tom for the reply. What would be the difference between the > SPI and "write a pure SQL UDF" and call it with DirectFunctionCall1? I > just ran into a similar situation some days before. Currently I think > DirectFunctionCall1 doesn't need to maintain a connection but SPI has to > do that. Hard to say without knowing your use case. A PL function is more simple to maintain than a C function, though usually less performant from the pure point of view of its operations. A SQL function could finish by being inlined, allowing the planner to apply optimizations as it would know the function body. Going with SPI has the advantage to have code able to work without any changes across major versions, which is a no-brainer when it comes to long-term maintenance. -- Michael signature.asc Description: PGP signature
Re: Some regular-expression performance hacking
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote: >0001-invent-rainbow-arcs-2.patch >0002-recognize-matchall-NFAs-2.patch I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset. Out of the 1489489 (pattern, text string) pairs tested, there was only one single deviation: This 100577 bytes big regex (pattern_id = 207811)... \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... |wmflabs\.org|yolasite\.com|za\.net|za\.org)$ ...previously raised... error invalid regular expression: regular expression is too complex ...but now goes through: is_match => t captured => {de} error invalid regular expression: regular expression is too complex => Nice. The patched regex engine is apparently capable of handling even more complex regexes than before. The test that found the deviation tests each (pattern, text string) individually, to catch errors. But I've also made a separate query to just test regexes known to not cause errors, to allow testing all regexes in one big query, which fully utilizes the CPU cores and also runs quicker. Below is the result of the performance test query: \timing SELECT tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern), tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern), COUNT(*) FROM tests JOIN subjects ON subjects.subject_id = tests.subject_id JOIN patterns ON patterns.pattern_id = subjects.pattern_id JOIN server_versions ON server_versions.server_version_num = tests.server_version_num WHERE server_versions.server_version = current_setting('server_version') AND tests.error IS NULL GROUP BY 1,2 ORDER BY 1,2; -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a: ?column? | ?column? | count --+--+- t| t| 1448212 (1 row) Time: 592196.145 ms (09:52.196) -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches: ?column? | ?column? | count --+--+- t| t| 1448212 (1 row) Time: 461739.364 ms (07:41.739) That's an impressive 22% speed-up! /Joel
Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)
Em dom., 14 de fev. de 2021 às 08:22, Michael Paquier escreveu: > On Sat, Feb 13, 2021 at 09:33:48PM -0300, Ranier Vilela wrote: > > Em sáb., 13 de fev. de 2021 às 20:32, Michael Paquier < > mich...@paquier.xyz> > > escreveu: > > > >> You are missing the point here. What you are proposing here would not > >> be backpatched. However, reusing the same words as upthread, this has > >> a cost in terms of *future* maintenance. In short, any *future* > >> potential bug fix that would require to be backpatched in need of > >> using this function or touching its area would result in a conflict. > > > > Ok. +1 for back-patching. > > Please take the time to read again my previous email. > > And also, please take the time to actually test patches you send, > because the list of things getting broken is impressive. At least > you make sure that the internals of cryptohash.c generate an error as > they should because of those incorrect sizes :) > > git diff --check complains, in various places. > > @@ -330,7 +330,8 @@ SendBackupManifest(backup_manifest_info *manifest) > * twice. > */ > manifest->still_checksumming = false; > - if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf) < 0) > + if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf, > + > sizeof(checksumbuf) - 1) < 0) > elog(ERROR, "failed to finalize checksum of backup > manifest"); > This breaks backup manifests, due to an incorrect calculation. > Bad habits. sizeof - 1, I use with strings. > I think that in pg_checksum_final() we had better save the digest > length in "retval" before calling pg_cryptohash_final(), and use it > for the size passed down. > pg_checksum_final I would like to see it like this: case CHECKSUM_TYPE_SHA224: retval = PG_SHA224_DIGEST_LENGTH; break; case CHECKSUM_TYPE_SHA256: retval = PG_SHA256_DIGEST_LENGTH; break; case CHECKSUM_TYPE_SHA384: retval = PG_SHA384_DIGEST_LENGTH; break; case CHECKSUM_TYPE_SHA512: retval = PG_SHA512_DIGEST_LENGTH; break; default: return -1; } if (pg_cryptohash_final(context->raw_context.c_sha2, output, retval) < 0) return -1; pg_cryptohash_free(context->raw_context.c_sha2); What do you think? regards, Ranier Vilela
Re: Some regular-expression performance hacking
"Joel Jacobson" writes: > I've successfully tested both patches against the 1.5M regexes-in-the-wild > dataset. > Out of the 1489489 (pattern, text string) pairs tested, > there was only one single deviation: > This 100577 bytes big regex (pattern_id = 207811)... > ... > ...previously raised... > error invalid regular expression: regular expression is too complex > ...but now goes through: > Nice. The patched regex engine is apparently capable of handling even more > complex regexes than before. Yeah. There are various limitations that can lead to REG_ETOOBIG, but the main ones are "too many states" and "too many arcs". The RAINBOW change directly reduces the number of arcs and thus makes larger regexes feasible. I'm sure it's coincidental that the one such example you captured happens to be fixed by this change, but hey I'll take it. regards, tom lane
Re: Extensibility of the PostgreSQL wire protocol
On Thu, 11 Feb 2021 at 09:28, Robert Haas wrote: > On Wed, Feb 10, 2021 at 2:04 PM Tom Lane wrote: > > That is a spot-on definition of where I do NOT want to end up. Hooks > > everywhere and enormous extensions that break anytime we change anything > > in the core. It's not really clear that anybody is going to find that > > more maintainable than a straight fork, except to the extent that it > > enables the erstwhile forkers to shove some of their work onto the PG > > community. > > +1. > > Making the lexer and parser extensible seems desirable to me. It would > be beneficial not only for companies like EDB and Amazon that might > want to extend the grammar in various ways, but also for extension > authors. However, it's vastly harder than Jan's proposal to make the > wire protocol pluggable. The wire protocol is pretty well-isolated > from the rest of the system. As long as you can get queries out of the > packets the client sends and package up the results to send back, it's > all good. I would have to disagree that the wire protocol is well-isolated. Sending and receiving are not in a single file The codes are not even named constants so trying to find a specific one is difficult. Anything that would clean this up would be a benefit That being said, I'm not in favor of transferring maintenance work to > the community for this set of hooks any more than I am for something > on the parsing side. In general, I'm in favor of as much extensibility > as we can reasonably create, but with a complicated proposal like this > one, the community should expect to be able to get something out of > it. And so far what I hear Jan saying is that these hooks could in > theory be used for things other than Amazon's proprietary efforts and > those things could in theory bring benefits to the community, but > there are no actual plans to do anything with this that would benefit > anyone other than Amazon. Which seems to bring us right back to > expecting the community to maintain things for the benefit of > third-party forks. > if this proposal brought us the ability stream results that would be a huge plus! Dave Cramer www.postgres.rocks > >
Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
On Sat, Feb 6, 2021 at 7:40 PM Andres Freund wrote: > Looks like a mistake on my part... Probably a rename regex that somehow > went wrong - I went back and forth on those names way too many > times. Want me to push the fix? Spotted another one: Shouldn't ReadNextFullTransactionId() actually be called ReadNewFullTransactionId()? Actually, the inverse approach looks like it produces fewer inconsistencies: you could instead rename ReadNewTransactionId() to ReadNextTransactionId(). -- Peter Geoghegan
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On Tue, Feb 09, 2021 at 04:35:03AM +, tsunakawa.ta...@fujitsu.com wrote: > Rebased to HEAD with the following modifications. It passes make check in > the top directory and contrib/postgres_fdw. > That said, with the reviews from some people and good performance results, I > think this can be ready for committer. This is crashing during fdw check. http://cfbot.cputube.org/andrey-lepikhov.html Maybe it's related to this patch: |commit 6214e2b2280462cbc3aa1986e350e167651b3905 |Fix permission checks on constraint violation errors on partitions. |Security: CVE-2021-3393 TRAP: FailedAssertion("n >= 0 && n < list->length", File: "../../src/include/nodes/pg_list.h", Line: 259, PID: 19780) (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7fd33a557801 in __GI_abort () at abort.c:79 #2 0x55f7f53bbc88 in ExceptionalCondition (conditionName=conditionName@entry=0x7fd33b81bc40 "n >= 0 && n < list->length", errorType=errorType@entry=0x7fd33b81b698 "FailedAssertion", fileName=fileName@entry=0x7fd33b81be70 "../../src/include/nodes/pg_list.h", lineNumber=lineNumber@entry=259) at assert.c:69 #3 0x7fd33b816b54 in list_nth_cell (n=, list=) at ../../src/include/nodes/pg_list.h:259 #4 list_nth (n=, list=) at ../../src/include/nodes/pg_list.h:281 #5 exec_rt_fetch (estate=, rti=) at ../../src/include/executor/executor.h:558 #6 postgresBeginForeignCopy (mtstate=, resultRelInfo=) at postgres_fdw.c:2208 #7 0x55f7f5114bb4 in ExecInitRoutingInfo (mtstate=mtstate@entry=0x55f7f710a508, estate=estate@entry=0x55f7f71a7d50, proute=proute@entry=0x55f7f710a720, dispatch=dispatch@entry=0x55f7f710a778, partRelInfo=partRelInfo@entry=0x55f7f710eb20, partidx=partidx@entry=0) at execPartition.c:1004 #8 0x55f7f511618d in ExecInitPartitionInfo (partidx=0, rootResultRelInfo=0x55f7f710a278, dispatch=0x55f7f710a778, proute=0x55f7f710a720, estate=0x55f7f71a7d50, mtstate=0x55f7f710a508) at execPartition.c:742 #9 ExecFindPartition () at execPartition.c:400 #10 0x55f7f50a2718 in CopyFrom () at copyfrom.c:857 #11 0x55f7f50a1b06 in DoCopy () at copy.c:299 (gdb) up #7 0x55f7f5114bb4 in ExecInitRoutingInfo (mtstate=mtstate@entry=0x55f7f710a508, estate=estate@entry=0x55f7f71a7d50, proute=proute@entry=0x55f7f710a720, dispatch=dispatch@entry=0x55f7f710a778, partRelInfo=partRelInfo@entry=0x55f7f710eb20, partidx=partidx@entry=0) at execPartition.c:1004 1004 partRelInfo->ri_FdwRoutine->BeginForeignCopy(mtstate, partRelInfo); (gdb) p partRelInfo->ri_RangeTableIndex $7 = 0 (gdb) p *estate->es_range_table $9 = {type = T_List, length = 1, max_length = 5, elements = 0x55f7f717a2c0, initial_elements = 0x55f7f717a2c0} -- Justin
Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
On Mon, Feb 15, 2021 at 10:02 AM Peter Geoghegan wrote: > On Sat, Feb 6, 2021 at 7:40 PM Andres Freund wrote: > > Looks like a mistake on my part... Probably a rename regex that somehow > > went wrong - I went back and forth on those names way too many > > times. Want me to push the fix? > > Spotted another one: Shouldn't ReadNextFullTransactionId() actually be > called ReadNewFullTransactionId()? Actually, the inverse approach > looks like it produces fewer inconsistencies: you could instead rename > ReadNewTransactionId() to ReadNextTransactionId(). I prefer "next", because that's in the name of the variable it reads, and the variable name seemed to me to have a more obvious meaning. That's why I went for that name in commit 2fc7af5e966. I do agree that it's slightly strange that the 32 and 64 bit versions differ here. I'd vote for renaming the 32 bit version to match...
Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
On Sun, Feb 14, 2021 at 2:08 PM Thomas Munro wrote: > I prefer "next", because that's in the name of the variable it reads, > and the variable name seemed to me to have a more obvious meaning. > That's why I went for that name in commit 2fc7af5e966. I do agree > that it's slightly strange that the 32 and 64 bit versions differ > here. I'd vote for renaming the 32 bit version to match... I was just going to say the same thing myself. Please do the honors if you have time... -- Peter Geoghegan
Re: WIP: WAL prefetch (another approach)
On 2/13/21 10:39 PM, Stephen Frost wrote: Greetings, * Andres Freund (and...@anarazel.de) wrote: On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote: Yeah, that's a good point. I think it'd make sense to keep track of recent FPIs and skip prefetching such blocks. But how exactly should we implement that, how many blocks do we need to track? If you get an FPI, how long should we skip prefetching of that block? I don't think the history needs to be very long, for two reasons. Firstly, the usual pattern is that we have FPI + several changes for that block shortly after it. Secondly, maintenance_io_concurrency limits this naturally - after crossing that, redo should place the FPI into shared buffers, allowing us to skip the prefetch. So I think using maintenance_io_concurrency is sufficient. We might track more buffers to allow skipping prefetches of blocks that were evicted from shared buffers, but that seems like an overkill. However, maintenance_io_concurrency can be quite high, so just a simple queue is not very suitable - searching it linearly for each block would be too expensive. But I think we can use a simple hash table, tracking (relfilenode, block, LSN), over-sized to minimize collisions. Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, and whenever we prefetch a block or find an FPI, we simply add the block to the array as determined by hash(relfilenode, block) hashtable[hash(...)] = {relfilenode, block, LSN} and then when deciding whether to prefetch a block, we look at that one position. If the (relfilenode, block) match, we check the LSN and skip the prefetch if it's sufficiently recent. Otherwise we prefetch. I'm a bit doubtful this is really needed at this point. Yes, the prefetching will do a buffer table lookup - but it's a lookup that already happens today. And the patch already avoids doing a second lookup after prefetching (by optimistically caching the last Buffer id, and re-checking). I agree that when a page is looked up, and found, in the buffer table that the subsequent cacheing of the buffer id in the WAL records does a good job of avoiding having to re-do that lookup. However, that isn't the case which was being discussed here or what Tomas's suggestion was intended to address. What I pointed out up-thread and what's being discussed here is what happens when the WAL contains a few FPIs and a few regular WAL records which are mixed up and not in ideal order. When that happens, with this patch, the FPIs will be ignored, the regular WAL records will reference blocks which aren't found in shared buffers (yet) and then we'll both issue pre-fetches for those and end up having spent effort doing a buffer lookup that we'll later re-do. The question is how common this pattern actually is - I don't know. As noted, the non-FPI would have to be fairly close to the FPI, i.e. within the wal_decode_buffer_size, to actually cause measurable harm. To address the unnecessary syscalls we really just need to keep track of any FPIs that we've seen between where the point where the prefetching is happening and the point where the replay is being done- once replay has replayed an FPI, our buffer lookup will succeed and we'll cache the buffer that the FPI is at- in other words, only wal_decode_buffer_size amount of WAL needs to be considered. Yeah, that's essentially what I proposed. We could further leverage this tracking of FPIs, to skip the prefetch syscalls, by cacheing what later records address the blocks that have FPIs earlier in the queue with the FPI record and then when replay hits the FPI and loads it into shared_buffers, it could update the other WAL records in the queue with the buffer id of the page, allowing us to very likely avoid having to do another lookup later on. This seems like an over-engineering, at least for v1. I think there's potential for some significant optimization going forward, but I think it's basically optimization over what we're doing today. As this is already a nontrivial patch, I'd argue for doing so separately. This seems like a great optimization, albeit a fair bit of code, for a relatively uncommon use-case, specifically where full page writes are disabled or very large checkpoints. As that's the case though, I would think it's reasonable to ask that it go out of its way to avoid slowing down the more common configurations, particularly since it's proposed to have it on by default (which I agree with, provided it ends up improving the common cases, which I think the suggestions above would certainly make it more likely to do). I'm OK to do some benchmarking, but it's not quite clear to me why does it matter if the checkpoints are smaller than shared buffers? IMO what matters is how "localized" the updates are, i.e. how likely it is to hit the same page repeatedly (in a short amount of time). Regular pgbench is not very suitable for that, but some non-uniform distribution should do the t
Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: > On 2/13/21 10:39 PM, Stephen Frost wrote: > >* Andres Freund (and...@anarazel.de) wrote: > >>On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote: > >>>Yeah, that's a good point. I think it'd make sense to keep track of recent > >>>FPIs and skip prefetching such blocks. But how exactly should we implement > >>>that, how many blocks do we need to track? If you get an FPI, how long > >>>should we skip prefetching of that block? > >>> > >>>I don't think the history needs to be very long, for two reasons. Firstly, > >>>the usual pattern is that we have FPI + several changes for that block > >>>shortly after it. Secondly, maintenance_io_concurrency limits this > >>>naturally > >>>- after crossing that, redo should place the FPI into shared buffers, > >>>allowing us to skip the prefetch. > >>> > >>>So I think using maintenance_io_concurrency is sufficient. We might track > >>>more buffers to allow skipping prefetches of blocks that were evicted from > >>>shared buffers, but that seems like an overkill. > >>> > >>>However, maintenance_io_concurrency can be quite high, so just a simple > >>>queue is not very suitable - searching it linearly for each block would be > >>>too expensive. But I think we can use a simple hash table, tracking > >>>(relfilenode, block, LSN), over-sized to minimize collisions. > >>> > >>>Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, > >>>and whenever we prefetch a block or find an FPI, we simply add the block to > >>>the array as determined by hash(relfilenode, block) > >>> > >>> hashtable[hash(...)] = {relfilenode, block, LSN} > >>> > >>>and then when deciding whether to prefetch a block, we look at that one > >>>position. If the (relfilenode, block) match, we check the LSN and skip the > >>>prefetch if it's sufficiently recent. Otherwise we prefetch. > >> > >>I'm a bit doubtful this is really needed at this point. Yes, the > >>prefetching will do a buffer table lookup - but it's a lookup that > >>already happens today. And the patch already avoids doing a second > >>lookup after prefetching (by optimistically caching the last Buffer id, > >>and re-checking). > > > >I agree that when a page is looked up, and found, in the buffer table > >that the subsequent cacheing of the buffer id in the WAL records does a > >good job of avoiding having to re-do that lookup. However, that isn't > >the case which was being discussed here or what Tomas's suggestion was > >intended to address. > > > >What I pointed out up-thread and what's being discussed here is what > >happens when the WAL contains a few FPIs and a few regular WAL records > >which are mixed up and not in ideal order. When that happens, with this > >patch, the FPIs will be ignored, the regular WAL records will reference > >blocks which aren't found in shared buffers (yet) and then we'll both > >issue pre-fetches for those and end up having spent effort doing a > >buffer lookup that we'll later re-do. > > The question is how common this pattern actually is - I don't know. As > noted, the non-FPI would have to be fairly close to the FPI, i.e. within the > wal_decode_buffer_size, to actually cause measurable harm. Yeah, so it'll depend on how big wal_decode_buffer_size is. Increasing that would certainly help to show if there ends up being a degredation with this patch due to the extra prefetching being done. > >To address the unnecessary syscalls we really just need to keep track of > >any FPIs that we've seen between where the point where the prefetching > >is happening and the point where the replay is being done- once replay > >has replayed an FPI, our buffer lookup will succeed and we'll cache the > >buffer that the FPI is at- in other words, only wal_decode_buffer_size > >amount of WAL needs to be considered. > > Yeah, that's essentially what I proposed. Glad I captured it correctly. > >We could further leverage this tracking of FPIs, to skip the prefetch > >syscalls, by cacheing what later records address the blocks that have > >FPIs earlier in the queue with the FPI record and then when replay hits > >the FPI and loads it into shared_buffers, it could update the other WAL > >records in the queue with the buffer id of the page, allowing us to very > >likely avoid having to do another lookup later on. > > This seems like an over-engineering, at least for v1. Perhaps, though it didn't seem like it'd be very hard to do with the already proposed changes to stash the buffer id in the WAL records. > >>I think there's potential for some significant optimization going > >>forward, but I think it's basically optimization over what we're doing > >>today. As this is already a nontrivial patch, I'd argue for doing so > >>separately. > > > >This seems like a great optimization, albeit a fair bit of code, for a > >relatively uncommon use-case, specifically where full page writes are > >disabled or very large checkpoints. As that's the cas
Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
On Mon, Feb 15, 2021 at 11:33 AM Peter Geoghegan wrote: > On Sun, Feb 14, 2021 at 2:08 PM Thomas Munro wrote: > > I prefer "next", because that's in the name of the variable it reads, > > and the variable name seemed to me to have a more obvious meaning. > > That's why I went for that name in commit 2fc7af5e966. I do agree > > that it's slightly strange that the 32 and 64 bit versions differ > > here. I'd vote for renaming the 32 bit version to match... > > I was just going to say the same thing myself. > > Please do the honors if you have time... Done.
Re: Preventing free space from being reused
>I would suggest to take a look at the BRIN opclass multi-minmax currently in development. Thank you, this does look like it could help a lot with BRIN performance in this situation! But again, if index performance alone was the only issue, then I would simply accept the space overhead and switch to btree. However, the disk fragmentation issue still remains and is significant. It is also amplified in my use case due to using ZFS, mostly for compression. But it is worth it: I am currently observing a 13x compression ratio (when comparing disk space reported by du and select sum(octet_length(x)), so this does not include the false gains from compressing padding). But in general, any variable-sized append-only workload suffers from this fragmentation problem. It's just that with filesystem compression, there is no longer a good reason to fill up those holes and accept the fragmentation. To be clear, the main reason why I even brought my questions to this mailing list is that I don't know how to (correctly) get past the check in heap_getnext (see my first email) when implementing the workaround as a custom table access method. A reloption could theoretically disable free space maps entirely for some added efficiency, but I'm inclined to agree that this is not really needed. On Sat, Feb 13, 2021 at 1:36 PM John Naylor wrote: > On Fri, Feb 12, 2021 at 6:21 PM Noah Bergbauer > wrote: > > > > A btree index on the same column is 700x the size of BRIN, or 10% of > relation itself. It does not perform significantly better than BRIN. The > issue here is twofold: not only does slotting these tuples into older pages > significantly reduce the effectiveness of BRIN, it also causes > fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to > look at this situation is that my data is inserted exactly in index order, > but Postgres keeps un-clustering it for reasons that are valid in general > (don't waste disk space) but don't apply at all in this case (the file > system uses compression, no space is wasted). > > > > Any alternative ideas would of course be much appreciated! But at the > moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me. > > I would suggest to take a look at the BRIN opclass multi-minmax currently > in development. It's designed to address that exact situation, and more > review would be welcome: > > https://commitfest.postgresql.org/32/2523/ > > -- > John Naylor > EDB: http://www.enterprisedb.com >
Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
On Sun, Feb 14, 2021 at 4:21 PM Thomas Munro wrote: > Done. Thanks. -- Peter Geoghegan
GCC warning in back branches
Hi, guc.c: In function ‘RestoreGUCState’: guc.c:9455:4: error: ‘varsourceline’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 9455 |set_config_sourcefile(varname, varsourcefile, varsourceline); |^~~~ I propose the attached. From d0ce66b0a24c99ed94f8caac32084b44990863ce Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 15 Feb 2021 14:04:58 +1300 Subject: [PATCH 1/2] Fix compiler warning in back branches. Back-patch a tiny bit of commit fbb2e9a0 into 9.6 and 10, to silence an uninitialized variable warning from GCC 10.2. --- src/backend/utils/misc/guc.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 9f47bd0aea..9e26cbd18e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -9437,6 +9437,8 @@ RestoreGUCState(void *gucstate) if (varsourcefile[0]) read_gucstate_binary(&srcptr, srcend, &varsourceline, sizeof(varsourceline)); + else + varsourceline = 0; read_gucstate_binary(&srcptr, srcend, &varsource, sizeof(varsource)); read_gucstate_binary(&srcptr, srcend, -- 2.30.0
RE: [PoC] Non-volatile WAL buffer
From: Masahiko Sawada > I've done some performance benchmarks with the master and NTT v4 > patch. Let me share the results. > ... > master NTT master-unlogged > 32 113209 67107 154298 > 64 144880 54289 178883 > 96 151405 50562 180018 > > "master-unlogged" is the same setup as "master" except for using > unlogged tables (using --unlogged-tables pgbench option). The TPS > increased by about 20% compared to "master" case (i.g., logged table > case). The reason why I experimented unlogged table case as well is > that we can think these results as an ideal performance if we were > able to write WAL records in 0 sec. IOW, even if the PMEM patch would > significantly improve WAL logging performance, I think it could not > exceed this performance. But hope is that if we currently have a > performance bottle-neck in WAL logging (.e.g, locking and writing > WAL), removing or minimizing WAL logging would bring a chance to > further improve performance by eliminating the new-coming bottle-neck. Could you tell us the specifics of the storage for WAL, e.g., SSD/HDD, the interface is NVMe/SAS/SATA, read-write throughput and latency (on the product catalog), and the product model? Was the WAL stored on a storage device separate from the other files? I want to know if the comparison is as fair as possible. I guess that in the NTT (PMEM) case, the WAL traffic is not affected by the I/Os of the other files. What would the comparison look like between master and unlogged-master if you place WAL on a DAX-aware filesystem like xfs or ext4 on PMEM, which Oracle recommends as REDO log storage? That is, if we place the WAL on the fastest storage configuration possible, what would be the difference between the logged and unlogged? I'm asking these to know if we consider it worthwhile to make further efforts in special code for WAL on PMEM. > Besides, I've checked the main wait events on each experiment using > pg_wait_sampling. Here are the top 5 wait events on "master" case > excluding wait events on the main function of auxiliary processes: > > event_type |event | sum > +--+--- > Client | ClientRead | 46902 > LWLock | WALWrite | 33405 > IPC| ProcArrayGroupUpdate | 8855 > LWLock | WALInsert| 3215 > LWLock | ProcArray| 3022 > > We can see the wait event on WALWrite lwlock acquisition happened many > times and it was the primary wait event. > > The result of "ntt" case is: > > event_type |event | sum > +--+ > LWLock | WALInsert| 126487 > Client | ClientRead | 12173 > LWLock | BufferContent| 4480 > Lock | transactionid| 2017 > IPC| ProcArrayGroupUpdate |924 > > The wait event on WALWrite lwlock disappeared. Instead, there were > many wait events on WALInsert lwlock. I've not investigated this > result yet. This could be because the v4 patch acquires WALInsert lock > more than necessary or writing WAL records to PMEM took more time than > writing to DRAM as Tomas mentioned before. Increasing NUM_XLOGINSERT_LOCKS might improve the result, but I don't have much hope because PMEM appears to have limited concurrency... Regards Takayuki Tsunakawa
Re: doing something about the broken dynloader.h symlink
On Fri, Jun 19, 2020 at 11:38 PM Julien Rouhaud wrote: > On Fri, Jun 19, 2020 at 12:08 PM Thomas Munro wrote: > > On Fri, Jun 19, 2020 at 8:02 PM Peter Eisentraut > > wrote: > > > +[# Remove links created by old versions of configure, so that there > > > +# are no broken symlinks in the tree > > > +rm -f src/include/dynloader.h]) > > > > +1 > > +1 +1
Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)
On Sun, Feb 14, 2021 at 11:39:47AM -0300, Ranier Vilela wrote: > What do you think? That's not a good idea for two reasons: 1) There is CRC32 to worry about, which relies on a different logic. 2) It would become easier to miss the new option as compilation would not warn anymore if a new checksum type is added. I have reviewed my patch this morning, tweaked a comment, and applied it. -- Michael signature.asc Description: PGP signature
Re: shared tempfile was not removed on statement_timeout
On Fri, Feb 5, 2021 at 5:47 PM Thomas Munro wrote: > On Sun, Jan 31, 2021 at 6:07 PM Tom Lane wrote: > > Thomas Munro writes: > > > So that gives a very simple back-patchable patch. > > > > Hmm, so is the *rest* of that function perfectly okay with being > > interrupted? > > It looks OK to me. There aren't any CFI()s in there. Pushed. That closes CF #2657.
Re: GCC warning in back branches
On Mon, Feb 15, 2021 at 02:15:51PM +1300, Thomas Munro wrote: > guc.c: In function ‘RestoreGUCState’: > guc.c:9455:4: error: ‘varsourceline’ may be used uninitialized in this > function [-Werror=maybe-uninitialized] > 9455 |set_config_sourcefile(varname, varsourcefile, varsourceline); > |^~~~ > > I propose the attached. We usually don't bother much about compilation warnings in stable branches as long as they are not real bugs, and these are the oldest stable ones. So why here? I would have patched the top of the function if it were me, btw. -- Michael signature.asc Description: PGP signature
Re: GCC warning in back branches
Michael Paquier writes: > On Mon, Feb 15, 2021 at 02:15:51PM +1300, Thomas Munro wrote: >> I propose the attached. > We usually don't bother much about compilation warnings in stable > branches as long as they are not real bugs, and these are the oldest > stable ones. So why here? I would have patched the top of the > function if it were me, btw. If somebody were running a buildfarm member with recent gcc and -Werror, we'd pretty much have to fix it. I'd say the real policy is that we don't worry about uninitialized-variable warnings from old compiler versions, on the theory that they're probably compiler shortcomings. But I'd be inclined to fix anything from a current gcc version. regards, tom lane
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Thu, Feb 04, 2021 at 03:38:39PM +0900, Michael Paquier wrote: > On Wed, Feb 03, 2021 at 07:54:42PM +0900, Michael Paquier wrote: > > Not sure I like that. Here is a proposal: > > "it is recommended to separately use ALTER TABLE ONLY on them so as > > any new partitions attached inherit the new tablespace value." > > So, I have done more work on this stuff today, and applied that as of > c5b2860. > A second thing I have come back to is allow_system_table_mods for > toast relations, and decided to just forbid TABLESPACE if attempting > to use it directly on a system table even if allow_system_table_mods > is true. This was leading to inconsistent behaviors and weirdness in > the concurrent case because all the indexes are processed in series > after building a list. As we want to ignore the move of toast indexes > when moving the indexes of the parent table, this was leading to extra > conditions that are not really worth supporting after thinking about > it. One other issue was the lack of consistency when using pg_global > that was a no-op for the concurrent case but failed in the > non-concurrent case. I have put in place more regression tests for > all that. Isn't this dead code ? postgres=# REINDEX (CONCURRENTLY, TABLESPACE pg_global) TABLE pg_class; ERROR: 0A000: cannot reindex system catalogs concurrently LOCATION: ReindexRelationConcurrently, indexcmds.c:3276 diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 127ba7835d..c77a9b2563 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -3260,73 +3260,66 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) { if (IsCatalogRelationOid(relationOid)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot reindex system catalogs concurrently"))); ... - if (OidIsValid(params->tablespaceOid) && - IsSystemRelation(heapRelation)) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), -errmsg("cannot move system relation \"%s\"", - RelationGetRelationName(heapRelation; - @@ -3404,73 +3397,66 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) if (IsCatalogRelationOid(heapId)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot reindex system catalogs concurrently"))); ... - if (OidIsValid(params->tablespaceOid) && - IsSystemRelation(heapRelation)) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), -errmsg("cannot move system relation \"%s\"", - get_rel_name(relationOid; - >From b4347c18bc732d30295c065ef71edaac65e68fe6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 14 Feb 2021 19:49:52 -0600 Subject: [PATCH] Dead code: REINDEX (CONCURRENTLY, TABLESPACE ..): c5b286047cd698021e57a527215b48865fd4ad4e --- src/backend/commands/indexcmds.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 127ba7835d..c77a9b2563 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -3260,73 +3260,66 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) { /* * In the case of a relation, find all its indexes including * toast indexes. */ Relation heapRelation; /* Save the list of relation OIDs in private context */ oldcontext = MemoryContextSwitchTo(private_context); /* Track this relation for session locks */ heapRelationIds = lappend_oid(heapRelationIds, relationOid); MemoryContextSwitchTo(oldcontext); if (IsCatalogRelationOid(relationOid)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot reindex system catalogs concurrently"))); /* Open relation to get its indexes */ if ((params->options & REINDEXOPT_MISSING_OK) != 0) { heapRelation = try_table_open(relationOid, ShareUpdateExclusiveLock); /* leave if relation does not exist */ if
Re: Use pg_pwrite() in pg_test_fsync
On Sun, Jan 24, 2021 at 1:50 PM Thomas Munro wrote: > On Sun, Jan 10, 2021 at 9:21 AM Thomas Munro wrote: > > I left the fsync-after-closing and non-sync'd tests using write(), > > because they weren't using lseek(). The latter case is arguably a bit > > odd because it's not overwriting pre-allocated blocks, unlike the > > earlier tests. > > On closer inspection, the weird thing about that final test is that > it's opening and closing the file every time. That doesn't seem to > make any sense. Perhaps it's a copy and paste error from the previous > test? In v2 I changed it to pg_pwrite(), and moved the open and close > calls out of the loop. Pushed.
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-02-08 13:01, Fujii Masao wrote: On 2021/02/05 8:45, Masahiro Ikeda wrote: I pgindented the patches. Thanks for updating the patches! Thanks for checking the patches. + XLogWrite, which nomally called by an + issue_xlog_fsync, which nomally called by an Typo: "nomally" should be "normally"? Yes, I'll fix it. + XLogFlush request(see ) + XLogFlush request(see ), Isn't it better to add a space character just after "request"? Thanks, I'll fix it. + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration); If several cycles happen in the do-while loop, m_wal_write_time should be updated with the sum of "duration" in those cycles instead of "duration" in the last cycle? If yes, "+=" should be used instead of "=" when updating m_wal_write_time? + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); Also "=" should be "+=" in the above? Yes, they are my mistake when changing the unit from milliseconds to microseconds. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-02-08 14:26, Fujii Masao wrote: On 2021/02/08 13:01, Fujii Masao wrote: On 2021/02/05 8:45, Masahiro Ikeda wrote: I pgindented the patches. Thanks for updating the patches! + XLogWrite, which nomally called by an + issue_xlog_fsync, which nomally called by an Typo: "nomally" should be "normally"? + XLogFlush request(see linkend="wal-configuration"/>) + XLogFlush request(see linkend="wal-configuration"/>), Isn't it better to add a space character just after "request"? + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration); If several cycles happen in the do-while loop, m_wal_write_time should be updated with the sum of "duration" in those cycles instead of "duration" in the last cycle? If yes, "+=" should be used instead of "=" when updating m_wal_write_time? + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); Also "=" should be "+=" in the above? + /* Send WAL statistics */ + pgstat_send_wal(); This may cause overhead in WAL-writing by walwriter because it's called every cycles even when walwriter needs to write more WAL next cycle (don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal() should be called only when WaitLatch() returns with WL_TIMEOUT? Thanks, I didn't notice that. I'll fix it. - XLogFlush request(see ) + XLogFlush request(see ), + or WAL data written out to disk by WAL receiver. So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and wal_sync_time are updated even while the other values are not. Isn't this confusing to users? If so, what about reporting those walreceiver stats in pg_stat_wal_receiver? OK, I'll add new infrastructure code to interect with wal receiver and stats collector and show the stats in pg_stat_wal_receiver. if (endofwal) + { + /* Send WAL statistics to the stats collector */ + pgstat_send_wal(); break; You added pgstat_send_wal() so that it's called in some cases where walreceiver exits. But ISTM that there are other walreceiver-exit cases. For example, in the case where SIGTERM is received. Instead, pgstat_send_wal() should be called in WalRcvDie() for those all cases? Thanks, I forgot the case. I'll fix it. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Snapshot scalability patch issue
The call to heap_page_prune() within lazy_scan_heap() passes a bool literal ('false') as its fourth argument. But the fourth argument is of type TransactionId, not bool. This has been the case since the snapshot scalability work performed by commit dc7420c2c92. Surely something is amiss here. I also notice some inconsistencies in the heap_page_prune() prototype names vs the corresponding definition names. Might be worth doing something about in passing. -- Peter Geoghegan
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Sat, Feb 13, 2021 at 11:41 AM japin wrote: > > IIUC, with the current patch, the new ALTER SUBSCRIPTION ... ADD/DROP > > errors out on the first publication that already exists/that doesn't > > exist right? What if there are multiple publications given in the > > ADD/DROP list, and few of them exist/don't exist. Isn't it good if we > > loop over the subscription's publication list and show all the already > > existing/not existing publications in the error message, instead of > > just erroring out for the first existing/not existing publication? > > > > Yes, you are right. Agree with you, I modified it. Please consider v5 > for further review. Thanks for the updated patches. I have a comment about reporting the existing/not existing publications code. How about something like the attached delta patch on v5-0002? Sorry for attaching I also think that we could merge 0002 into 0001 and have only 4 patches in the patch set. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v1-0001-delta-patch-for-reporting-error.patch Description: Binary data
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Sun, Feb 14, 2021 at 08:10:50PM -0600, Justin Pryzby wrote: > Isn't this dead code ? Nope, it's not dead. Those two code paths can be hit when attempting a reidex with a tablespace move directly on toast tables and indexes, see: =# create table aa (a text); CREATE TABLE =# select relname from pg_class where oid > 16000; relname -- aa pg_toast_16385 pg_toast_16385_index (3 rows) =# reindex (concurrently, tablespace pg_default) table pg_toast.pg_toast_16385; ERROR: 0A000: cannot move system relation "pg_toast_16385" LOCATION: ReindexRelationConcurrently, indexcmds.c:3295 =# reindex (concurrently, tablespace pg_default) index pg_toast.pg_toast_16385_index; ERROR: 0A000: cannot move system relation "pg_toast_16385_index" LOCATION: ReindexRelationConcurrently, indexcmds.c:3439 It is easy to save the relation name using \gset in a regression test, but we had better keep a reference to the relation name in the error message so this would not be really portable. Using a PL function to do that with a CATCH block would not work either as CONCURRENTLY cannot be run in a transaction block. This leaves 090_reindexdb.pl, but I was not really convinced that this was worth the extra test cycles (I am aware of the --tablespace option missing in reindexdb, someone I know was trying to get that done for the next CF). -- Michael signature.asc Description: PGP signature
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-02-10 00:51, David G. Johnston wrote: On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda wrote: I pgindented the patches. ... XLogWrite, which is invoked during an XLogFlush request (see ...). This is also incremented by the WAL receiver during replication. ("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original) You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste) is ether -> is either "This parameter is off by default as it will repeatedly query the operating system..." ", because" -> "as" Thanks, I fixed them. wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication." I skipped changing it since I separated the stats for the WAL receiver in pg_stat_wal_receiver. "The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in) This is undesirable because ..." Thanks, I fixed it. I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics. This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not. It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL). Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead. In either case, this half-and-half implementation seems undesirable. OK, as Fujii-san mentioned, I separated the WAL receiver stats. (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch) I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and the stats for WAL receiver is counted in pg_stat_wal_receiver. What do you think? Regards, -- Masahiro Ikeda NTT DATA CORPORATIONFrom 4ecbc467f88a2f923b3bc2eb6fe2c4f7725c02be Mon Sep 17 00:00:00 2001 From: Masahiro Ikeda Date: Fri, 12 Feb 2021 11:19:59 +0900 Subject: [PATCH 1/2] Add statistics related to write/sync wal records. This patch adds following statistics to pg_stat_wal view to track WAL I/O activity by backends and background processes except WAL receiver. - the total number of times writing/syncing WAL data. - the total amount of time spent writing/syncing WAL data. Since to track I/O timing may leads significant overhead, GUC parameter "track_wal_io_timing" is introduced. Only if this is on, the I/O timing is measured. The statistics related to sync are zero when "wal_sync_method" is "open_datasync" or "open_sync", because it doesn't call each sync method. (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID) Author: Masahiro Ikeda Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75...@oss.nttdata.co --- doc/src/sgml/config.sgml | 23 +++- doc/src/sgml/monitoring.sgml | 50 doc/src/sgml/wal.sgml | 12 +++- src/backend/access/transam/xlog.c | 57 +++ src/backend/catalog/system_views.sql | 4 ++ src/backend/postmaster/pgstat.c | 4 ++ src/backend/postmaster/walwriter.c| 16 -- src/backend/utils/adt/pgstatfuncs.c | 24 +++- src/backend/utils/misc/guc.c | 9 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + src/include/catalog/pg_proc.dat | 6 +- src/include/pgstat.h | 10 src/test/regress/expected/rules.out | 6 +- 14 files changed, 208 insertions(+), 15 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5ef1c7ad3c..0bb03d0717 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7404,7 +7404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; Enables timing of database I/O calls. This parameter is off by -default, because it will repeatedly query the operating system for +default, as it will repeatedly query the operating system for the current time, which may cause significant overhead on some platforms. You can use the tool to measure the overhead of timing on your system. @@ -7418,6 +7418,27 @@ COPY postgres_log FROM '/full/path/to/logfil
Re: Some regular-expression performance hacking
"Joel Jacobson" writes: > Below is the result of the performance test query: > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a: > Time: 592196.145 ms (09:52.196) > -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches: > Time: 461739.364 ms (07:41.739) > That's an impressive 22% speed-up! I've been doing some more hacking over the weekend, and have a couple of additional improvements to show. The point of these two additional patches is to reduce the number of "struct subre" sub-regexps that the regex parser creates. The subre's themselves aren't that large, so this might seem like it would have only small benefit. However, each subre requires its own NFA for the portion of the RE that it matches. That adds space, and it also adds compilation time because we run the "optimize()" pass separately for each such NFA. Maybe there'd be a way to share some of that work, but I'm not very clear how. In any case, not having a subre at all is clearly better where we can manage it. 0003 is a small patch that fixes up parseqatom() so that it doesn't emit no-op subre's for empty portions of a regexp branch that are adjacent to a "messy" regexp atom (that is, a capture node, a backref, or an atom with greediness different from what preceded it). 0004 is a rather larger patch whose result is to get rid of extra subre's associated with alternation subre's. If we have a|b|c and any of those alternation branches are messy, we end up with * / \ a * / \ b * / \ c NULL where each "*" is an alternation subre node, and all those "*"'s have identical NFAs that match the whole a|b|c construct. This means that for an N-way alternation we're going to need something like O(N^2) work to optimize all those NFAs. That's embarrassing (and I think it's my fault --- if memory serves, I put in this representation of messy alternations years ago). We can improve matters by having just one parent node for an alternation: * \ a -> b -> c That requires replacing the binary-tree structure of subre's with a child-and-sibling arrangement, which is not terribly difficult but accounts for most of the bulk of the patch. (I'd wanted to do that for years, but up till now I did not think it would have any real material benefit.) There might be more that can be done in this line, but that's as far as I got so far. I did some testing on this using your dataset (thanks for giving me a copy) and this query: SELECT pattern, subject, is_match AS is_match_head, captured AS captured_head, subject ~ pattern AS is_match_patch, regexp_match(subject, pattern) AS captured_patch FROM subjects WHERE error IS NULL AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern)); I got these runtimes (non-cassert builds): HEAD313661.149 ms (05:13.661) +0001 297397.293 ms (04:57.397) 5% better than HEAD +0002 151995.803 ms (02:31.996) 51% better than HEAD +0003 139843.934 ms (02:19.844) 55% better than HEAD +0004 95034.611 ms (01:35.035)69% better than HEAD Since I don't have all the tables used in your query, I can't try to reproduce your results exactly. I suspect the reason I'm getting a better percentage improvement than you did is that the joining/grouping/ordering involved in your query creates a higher baseline query cost. Anyway, I'm feeling pretty pleased with these results ... regards, tom lane diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c index 1e4f0121f3..fcf03de32d 100644 --- a/contrib/pg_trgm/trgm_regexp.c +++ b/contrib/pg_trgm/trgm_regexp.c @@ -282,8 +282,8 @@ typedef struct typedef int TrgmColor; /* We assume that colors returned by the regexp engine cannot be these: */ -#define COLOR_UNKNOWN (-1) -#define COLOR_BLANK (-2) +#define COLOR_UNKNOWN (-3) +#define COLOR_BLANK (-4) typedef struct { @@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA) palloc0(colorsCount * sizeof(TrgmColorInfo)); /* - * Loop over colors, filling TrgmColorInfo about each. + * Loop over colors, filling TrgmColorInfo about each. Note we include + * WHITE (0) even though we know it'll be reported as non-expandable. */ for (i = 0; i < colorsCount; i++) { @@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key) /* Add enter key to this state */ addKeyToQueue(trgmNFA, &destKey); } - else + else if (arc->co >= 0) { - /* Regular color */ + /* Regular color (including WHITE) */ TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co]; if (colorInfo->expandable) @@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key) addKeyToQueue(trgmNFA, &destKey); } } + else + { + /* RAINBOW: treat as unexpandable color */ + destKey.prefix.colors[0] = COLOR_UNKNOWN; + destKey.prefix.colors[1] = C
Re: Default wal_sync_method on FreeBSD
On Fri, Jan 8, 2021 at 2:19 PM Thomas Munro wrote: > The system header change has one interesting consequence for existing > releases of PostgreSQL, though: xlogdefs.h now sees that there is an > O_DSYNC macro that is distinct from O_SYNC, and defaults to > wal_sync_method=open_datasync. That's not a great default setting, > because it gets you O_DIRECT | O_DSYNC, which performs terribly when > you're writing 8KB blocks on UFS's default 32KB logical block size (it > triggers read-before-write, quite visibly destroying performance with > eg pg_test_fsync), and for all I know, it might even not work at all > on some other file systems. I suspect it might come out very slightly > ahead on a UFS filesystem created with 8KB blocks, but in any case, > that seems like something you should have to opt in to, as you do on > Linux. Done.
Re: repeated decoding of prepared transactions
On Sat, Feb 13, 2021 at 10:23 PM Andres Freund wrote: > > On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote: > > > AFAIK this is exactly why origins are Wal logged along with > > transaction, it allows us to guarantee never getting anything that has > > beed durably written. > > I think you'd need something like origins in that case, because > something could still go wrong before the other side has received the > flush (network disconnect, primary crash, ...). > We are already using origins in apply-worker to guarantee that and with each commit, the origin's lsn location is also WAL-logged. That helps us to send the start location for a slot after the restart. As far as I understand this is how it works from the apply-worker side. I am not sure if I am missing something here? -- With Regards, Amit Kapila.
Re: 64-bit XIDs in deleted nbtree pages
On Sat, Feb 13, 2021 at 10:47 PM Peter Geoghegan wrote: > It will be rare. But more importantly, the fact that scenario is now > an extreme case justifies treating it as an extreme case. We can teach > _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In > particular, I think that it will now be okay to increase the threshold > applied when considering deleted pages inside > _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the > patch. But in v4, which has the new recycling enhancement, I think > that it would be sensible to make it 5%, or maybe even 10%. This > naturally makes Masahiko's problem scenario unlikely to actually > result in a truly wasted call to btvacuumscan(). Attached is v4, which has the "recycle pages that we ourselves deleted during this same VACUUM operation" enhancement. It also doubles the _bt_vacuum_needs_cleanup() threshold applied to deleted pages -- it goes from 2.5% to 5%. The new patch is the patch series (v4-0002-*) certainly needs more polishing. I'm posting what I have now because v3 has bitrot. Benchmarking has shown that the enhancement in v4-0002-* can significantly reduce the amount of index bloat in two of the BenchmarkSQL/TPC-C indexes. -- Peter Geoghegan v4-0002-Recycle-pages-deleted-during-same-VACUUM.patch Description: Binary data v4-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patch Description: Binary data v4-0003-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patch Description: Binary data
RE: [POC] Fast COPY FROM command for the table with foreign partitions
From: Justin Pryzby > This is crashing during fdw check. > http://cfbot.cputube.org/andrey-lepikhov.html > > Maybe it's related to this patch: > |commit 6214e2b2280462cbc3aa1986e350e167651b3905 > |Fix permission checks on constraint violation errors on partitions. > |Security: CVE-2021-3393 Thank you for your kind detailed investigation. The rebased version is attached. Regards Takayuki Tsunakawa v15-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch Description: v15-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Re: GCC warning in back branches
On Mon, Feb 15, 2021 at 2:35 PM Michael Paquier wrote: > ... I would have patched the top of the > function if it were me, btw. I just copied the way it is coded in master (due to commit fbb2e9a0 which fixed this warning in 11+).
Re: PG vs LLVM 12 on seawasp, next round
On Sat, Dec 12, 2020 at 8:45 AM Fabien COELHO wrote: > > +ERROR: could not load library > > "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so": > > libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or > > directory Bonjour Fabien, Here is the creation of llvmjit.so: g++ ... -o llvmjit.so ... -L/home/fabien/clgtk/lib ... -lLLVMOrcJIT ... That'd be from llvm-config --ldflags or similar, from this binary: checking for llvm-config... (cached) /home/fabien/clgtk/bin/llvm-config So what does ls -slap /home/fabien/clgtk/lib show? What does ldd /home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so show? What do you have in seawap's LD_LIBRARY_PATH, and is the problem that you need to add /home/fabien/clgtk/lib to it (or is it supposed to be making it into the rpath, or is it in your ld.so.conf)? PS Could you try blowing away the accache directory so we can rule out bad cached configure stuff?
Re: adding wait_start column to pg_locks
On 2021/02/10 10:43, Fujii Masao wrote: On 2021/02/09 23:31, torikoshia wrote: On 2021-02-09 22:54, Fujii Masao wrote: On 2021/02/09 19:11, Fujii Masao wrote: On 2021/02/09 18:13, Fujii Masao wrote: On 2021/02/09 17:48, torikoshia wrote: On 2021-02-05 18:49, Fujii Masao wrote: On 2021/02/05 0:03, torikoshia wrote: On 2021-02-03 11:23, Fujii Masao wrote: 64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating "waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock when reading "waitStart"? Also it might be worth thinking to use 64-bit atomic operations like pg_atomic_read_u64(), for that. Thanks for your suggestion and advice! In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64(). waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and pg_atomic_write_xxx only supports unsigned int, so I cast the type. I may be using these functions not correctly, so if something is wrong, I would appreciate any comments. About the documentation, since your suggestion seems better than v6, I used it as is. Thanks for updating the patch! + if (pg_atomic_read_u64(&MyProc->waitStart) == 0) + pg_atomic_write_u64(&MyProc->waitStart, + pg_atomic_read_u64((pg_atomic_uint64 *) &now)); pg_atomic_read_u64() is really necessary? I think that "pg_atomic_write_u64(&MyProc->waitStart, now)" is enough. + deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT); + pg_atomic_write_u64(&MyProc->waitStart, + pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart)); Same as above. + /* + * Record waitStart reusing the deadlock timeout timer. + * + * It would be ideal this can be synchronously done with updating + * lock information. Howerver, since it gives performance impacts + * to hold partitionLock longer time, we do it here asynchronously. + */ IMO it's better to comment why we reuse the deadlock timeout timer. proc->waitStatus = waitStatus; + pg_atomic_init_u64(&MyProc->waitStart, 0); pg_atomic_write_u64() should be used instead? Because waitStart can be accessed concurrently there. I updated the patch and addressed the above review comments. Patch attached. Barring any objection, I will commit this version. Thanks for modifying the patch! I agree with your comments. BTW, I ran pgbench several times before and after applying this patch. The environment is virtual machine(CentOS 8), so this is just for reference, but there were no significant difference in latency or tps(both are below 1%). Thanks for the test! I pushed the patch. But I reverted the patch because buildfarm members rorqual and prion don't like the patch. I'm trying to investigate the cause of this failures. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10 - relation | locktype | mode --+--+- - test_prepared_1 | relation | RowExclusiveLock - test_prepared_1 | relation | AccessExclusiveLock -(2 rows) - +ERROR: invalid spinlock number: 0 "rorqual" reported that the above error happened in the server built with --disable-atomics --disable-spinlocks when reading pg_locks after the transaction was prepared. The cause of this issue is that "waitStart" atomic variable in the dummy proc created at the end of prepare transaction was not initialized. I updated the patch so that pg_atomic_init_u64() is called for the "waitStart" in the dummy proc for prepared transaction. Patch attached. I confirmed that the patched server built with --disable-atomics --disable-spinlocks passed all the regression tests. Thanks for fixing the bug, I also tested v9.patch configured with --disable-atomics --disable-spinlocks on my environment and confirmed that all tests have passed. Thanks for the test! I found another bug in the patch. InitProcess() initializes "waitStart", but previously InitAuxiliaryProcess() did not. This could cause "invalid spinlock number" error when reading pg_locks in the standby server. I fixed that. Attached is the updated version of the patch. I pushed this version. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: logical replication seems broken
On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers wrote: > > > On 02/13/2021 11:49 AM Amit Kapila wrote: > > > > On Fri, Feb 12, 2021 at 10:00 PM wrote: > > > > > > > On 02/12/2021 1:51 PM Amit Kapila wrote: > > > > > > > > On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers wrote: > > > > > > > > > > I am seeing errors in replication in a test program that I've been > > > > > running for years with very little change (since 2017, really [1]). > > > > > > Hi, > > > > > > Here is a test program. Careful, it deletes stuff. And it will need > > > some changes: > > > > > > > Thanks for sharing the test. I think I have found the problem. > > Actually, it was an existing code problem exposed by the commit > > ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the > > prepare_write ('w') message but then the actual message was not being > > sent. This was the case when we didn't found the origin of a txn. This > > can happen after that commit because we have now started using origins > > for tablesync workers as well and those origins are removed once the > > tablesync workers are finished. We might want to change the behavior > > related to the origin messages as indicated in the comments but for > > now, fixing the existing code. > > > > Can you please test if the attached fixes the problem at your end as well? > > > [fix_origin_message_1.patch] > > I compiled just now a binary from HEAD, and a binary from HEAD+patch > > HEAD is still broken; your patch rescues it, so yes, fixed. > > Maybe a test (check or check-world) should be added to run a second replica? > (Assuming that would have caught this bug) > +1 for the idea of having a test for this. I have written a test for this. Thanks for the fix Amit, I could reproduce the issue without your fix and verified that the issue gets fixed with the patch you shared. Attached a patch for the same. Thoughts? Regards, Vignesh From eff13970ae34bcdc8590a9602eddce5eb6200195 Mon Sep 17 00:00:00 2001 From: vignesh Date: Mon, 15 Feb 2021 11:41:55 +0530 Subject: [PATCH v1] Test for verifying data is replicated in cascaded setup. Test for verifying data is replicated in cascaded setup. --- src/test/subscription/t/100_bugs.pl | 65 + 1 file changed, 65 insertions(+) diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl index d1e407a..afb2d08 100644 --- a/src/test/subscription/t/100_bugs.pl +++ b/src/test/subscription/t/100_bugs.pl @@ -153,3 +153,68 @@ is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t"), $rows * 2, "2x$rows rows in t"); is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t2"), $rows * 2, "2x$rows rows in t2"); + +# Verify table data is synced with cascaded replication setup. +my $node_pub = get_new_node('testpublisher1'); +$node_pub->init(allows_streaming => 'logical'); +$node_pub->start; + +my $node_pub_sub = get_new_node('testpublisher_subscriber'); +$node_pub_sub->init(allows_streaming => 'logical'); +$node_pub_sub->start; + +my $node_sub = get_new_node('testsubscriber1'); +$node_sub->init(allows_streaming => 'logical'); +$node_sub->start; + +# Create the tables in all nodes. +$node_pub->safe_psql('postgres', "CREATE TABLE tab1 (a int)"); +$node_pub_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)"); +$node_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)"); + +# Create a cascaded replication setup like: +# N1 - Create publication testpub1. +# N2 - Create publication testpub2 and also include subscriber which subscribes +# to testpub1. +# N3 - Create subscription testsub2 subscribes to testpub2. +$node_pub->safe_psql('postgres', + "CREATE PUBLICATION testpub1 FOR TABLE tab1"); + +$node_pub_sub->safe_psql('postgres', + "CREATE PUBLICATION testpub2 FOR TABLE tab1"); + +my $publisher1_connstr = $node_pub->connstr . ' dbname=postgres'; +my $publisher2_connstr = $node_pub_sub->connstr . ' dbname=postgres'; + +$node_sub->safe_psql('postgres', + "CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher2_connstr' PUBLICATION testpub2" +); + +$node_pub_sub->safe_psql('postgres', + "CREATE SUBSCRIPTION testsub1 CONNECTION '$publisher1_connstr' PUBLICATION testpub1" +); + +$node_pub->safe_psql('postgres', + "INSERT INTO tab1 values(generate_series(1,10))"); + +# Verify that the data is cascaded from testpub1 to testsub1 and further from +# testpub2 (which had testsub1) to testsub2. +$node_pub->wait_for_catchup('testsub1'); +$node_pub_sub->wait_for_catchup('testsub2'); + +# Drop subscriptions as we don't need them anymore +$node_pub_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub1"); +$node_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub2"); + +# Drop publications as we don't need them anymore +$node_pub->safe_psql('postgres', "DROP PUBLICATION testpub1"); +$node_pub_sub->safe_psql('postgres', "DROP PUBLICATION testpub2"); + +# Clean up the tables on both publisher and subscriber as we don't need them +$node_pub->safe_psql('postgres', "DROP TABLE
Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses
2021年2月13日(土) 11:52 Michael Paquier : > On Fri, Feb 12, 2021 at 10:32:14AM +0900, Ian Lawrence Barwick wrote: > > In the documentation, the "[ NO ]" option is listed in the synopsis for > > ALTER TRIGGER and ALTER FUNCTION, but not the others. > > Trivial patch attached. > > There are two flavors to cover for 6 commands per gram.y, and you are > covering all of them. So this looks good to me. I'll apply and > backpatch in a bit. It is worth noting that tab-complete.c does a bad > job in completing those clauses. > -- > Michael > -- EnterpriseDB: https://www.enterprisedb.com
Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses
2021年2月13日(土) 11:52 Michael Paquier : > On Fri, Feb 12, 2021 at 10:32:14AM +0900, Ian Lawrence Barwick wrote: > > In the documentation, the "[ NO ]" option is listed in the synopsis for > > ALTER TRIGGER and ALTER FUNCTION, but not the others. > > Trivial patch attached. > > There are two flavors to cover for 6 commands per gram.y, and you are > covering all of them. So this looks good to me. I'll apply and > backpatch in a bit. Thanks! (Apologies for the preceding blank mail). It is worth noting that tab-complete.c does a bad > job in completing those clauses. > Indeed it does. Not the most exciting of use cases, though I imagine it might come in handy for anyone developing an extension, and the existing implementation is inconsistent (in place for ALTER INDEX, and partially for ALTER MATERIALIZED VIEW, but not the others). Patch suggestion attached. Regards Ian Barwick -- EnterpriseDB: https://www.enterprisedb.com commit 5ffe7228c6789cbf29084daae6aa8f4638996a98 Author: Ian Barwick Date: Mon Feb 15 15:27:25 2021 +0900 psql: improve [ NO ] DEPENDS tab completion support diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 1e1c315bae..f99564937d 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1614,14 +1614,24 @@ psql_completion(const char *text, int start, int end) /* ALTER AGGREGATE,FUNCTION,PROCEDURE,ROUTINE */ else if (Matches("ALTER", "AGGREGATE|FUNCTION|PROCEDURE|ROUTINE", MatchAny)) COMPLETE_WITH("("); - /* ALTER AGGREGATE,FUNCTION,PROCEDURE,ROUTINE (...) */ - else if (Matches("ALTER", "AGGREGATE|FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny)) + /* ALTER AGGREGATE (...) */ + else if (Matches("ALTER", "AGGREGATE", MatchAny, MatchAny)) { if (ends_with(prev_wd, ')')) COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA"); else COMPLETE_WITH_FUNCTION_ARG(prev2_wd); } + /* ALTER FUNCTION,PROCEDURE,ROUTINE (...) */ + else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny)) + { + if (ends_with(prev_wd, ')')) + COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA", + "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION"); + else + COMPLETE_WITH_FUNCTION_ARG(prev2_wd); + } + /* ALTER PUBLICATION */ else if (Matches("ALTER", "PUBLICATION", MatchAny)) COMPLETE_WITH("ADD TABLE", "DROP TABLE", "OWNER TO", "RENAME TO", "SET"); @@ -1735,7 +1745,8 @@ psql_completion(const char *text, int start, int end) /* ALTER INDEX */ else if (Matches("ALTER", "INDEX", MatchAny)) COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME TO", "SET", - "RESET", "ATTACH PARTITION", "DEPENDS", "NO DEPENDS", + "RESET", "ATTACH PARTITION", + "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION", "ALTER COLLATION"); else if (Matches("ALTER", "INDEX", MatchAny, "ATTACH")) COMPLETE_WITH("PARTITION"); @@ -1782,10 +1793,6 @@ psql_completion(const char *text, int start, int end) "buffering =", /* GiST */ "pages_per_range =", "autosummarize =" /* BRIN */ ); - else if (Matches("ALTER", "INDEX", MatchAny, "NO", "DEPENDS")) - COMPLETE_WITH("ON EXTENSION"); - else if (Matches("ALTER", "INDEX", MatchAny, "DEPENDS")) - COMPLETE_WITH("ON EXTENSION"); /* ALTER INDEX ALTER COLLATION */ else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLLATION")) { @@ -1919,7 +1926,7 @@ psql_completion(const char *text, int start, int end) /* ALTER MATERIALIZED VIEW */ else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny)) - COMPLETE_WITH("ALTER COLUMN", "CLUSTER ON", "DEPENDS ON EXTENSION", + COMPLETE_WITH("ALTER COLUMN", "CLUSTER ON", "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION", "OWNER TO", "RENAME", "RESET (", "SET"); /* ALTER MATERIALIZED VIEW xxx RENAME */ else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny, "RENAME")) @@ -1997,7 +2004,7 @@ psql_completion(const char *text, int start, int end) /* ALTER TRIGGER ON */ else if (Matches("ALTER", "TRIGGER", MatchAny, "ON", MatchAny)) - COMPLETE_WITH("RENAME TO"); + COMPLETE_WITH("RENAME TO", "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION"); /* * If we detect ALTER TABLE , suggest sub commands
Re: a misbehavior of partition row movement (?)
On Wed, Jan 20, 2021 at 7:04 PM Amit Langote wrote: > > On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut > wrote: > > On 2021-01-08 09:54, Amit Langote wrote: > > >>> I don't quite recall if the decision to implement it like this was > > >>> based on assuming that this is what users would like to see happen in > > >>> this case or the perceived difficulty of implementing it the other way > > >>> around, that is, of firing AFTER UPDATE triggers in this case. > > >> I tried to look that up, but I couldn't find any discussion about this. > > >> Do you have any ideas in which thread that was handled? > > > It was discussed here: > > > > > > https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com > > > > > > It's a huge discussion, so you'll have to ctrl+f "trigger" to spot > > > relevant emails. You might notice that the developers who > > > participated in that discussion gave various opinions and what we have > > > today got there as a result of a majority of them voting for the > > > current approach. Someone also said this during the discussion: > > > "Regarding the trigger issue, I can't claim to have a terribly strong > > > opinion on this. I think that practically anything we do here might > > > upset somebody, but probably any halfway-reasonable thing we choose to > > > do will be OK for most people." So what we've got is that > > > "halfway-reasonable" thing, YMMV. :) > > > > Could you summarize here what you are trying to do with respect to what > > was decided before? I'm a bit confused, looking through the patches you > > have posted. The first patch you posted hard-coded FK trigger OIDs > > specifically, other patches talk about foreign key triggers in general > > or special case internal triggers or talk about all triggers. > > The original problem statement is this: the way we generally fire > row-level triggers of a partitioned table can lead to some unexpected > behaviors of the foreign keys pointing to that partitioned table > during its cross-partition updates. > > Let's start with an example that shows how triggers are fired during a > cross-partition update: > > create table p (a numeric primary key) partition by list (a); > create table p1 partition of p for values in (1); > create table p2 partition of p for values in (2); > create or replace function report_trig_details() returns trigger as $$ > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = > 'DELETE' then return old; end if; return new; end; $$ language > plpgsql; > create trigger trig1 before update on p for each row execute function > report_trig_details(); > create trigger trig2 after update on p for each row execute function > report_trig_details(); > create trigger trig3 before delete on p for each row execute function > report_trig_details(); > create trigger trig4 after delete on p for each row execute function > report_trig_details(); > create trigger trig5 before insert on p for each row execute function > report_trig_details(); > create trigger trig6 after insert on p for each row execute function > report_trig_details(); > > insert into p values (1); > update p set a = 2; > NOTICE: BEFORE UPDATE on p1 > NOTICE: BEFORE DELETE on p1 > NOTICE: BEFORE INSERT on p2 > NOTICE: AFTER DELETE on p1 > NOTICE: AFTER INSERT on p2 > UPDATE 1 > > (AR update triggers are not fired.) > > For contrast, here is an intra-partition update: > > update p set a = a; > NOTICE: BEFORE UPDATE on p2 > NOTICE: AFTER UPDATE on p2 > UPDATE 1 > > Now, the trigger machinery makes no distinction between user-defined > and internal triggers, which has implications for the foreign key > enforcing triggers on partitions. Consider the following example: > > create table q (a bigint references p); > insert into q values (2); > update p set a = 1; > NOTICE: BEFORE UPDATE on p2 > NOTICE: BEFORE DELETE on p2 > NOTICE: BEFORE INSERT on p1 > ERROR: update or delete on table "p2" violates foreign key constraint > "q_a_fkey2" on table "q" > DETAIL: Key (a)=(2) is still referenced from table "q". > > So the RI delete trigger (NOT update) on p2 prevents the DELETE that > occurs as part of the row movement. One might make the updates > cascade and expect that to prevent the error: > > drop table q; > create table q (a bigint references p on update cascade); > insert into q values (2); > update p set a = 1; > NOTICE: BEFORE UPDATE on p2 > NOTICE: BEFORE DELETE on p2 > NOTICE: BEFORE INSERT on p1 > ERROR: update or delete on table "p2" violates foreign key constraint > "q_a_fkey2" on table "q" > DETAIL: Key (a)=(2) is still referenced from table "q". > > No luck, because again it's the RI delete trigger on p2 that gets > fired. If you make deletes cascade too, an even worse thing happens: > > drop table q; > create table q (a bigint references p on update cascade on delete cascade); > insert into q values (2); > update p set a = 1; > NOTICE: BEFORE UPDATE on p2 > N
Re: Fallback table AM for relkinds without storage
On Tue, Feb 09, 2021 at 04:27:34PM +0900, Michael Paquier wrote: > Putting sanity checks within all the table_* functions of tableam.h > would not be a good idea, as nothing prevents the call of what's > stored in rel->rd_tableam. I have been playing with this idea, and finished with the attached, which is not the sexiest patch around. The table AM used as fallback for tables without storage is called no_storage (this could be called virtual_am?). Reverting e786be5 or dd705a0 leads to an error coming from no_storage instead of a crash. One thing to note is that this simplifies a bit slot_callbacks as views, foreign tables and partitioned tables can grab their slot type directly from this new table AM. -- Michael From 0611ef60b39907a3bd405ece990d661d4d63900d Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 15 Feb 2021 16:21:01 +0900 Subject: [PATCH] Add no_storage, fallback table AM for relations without storage --- src/include/access/tableam.h | 6 +- src/include/catalog/pg_am.dat | 4 + src/include/catalog/pg_proc.dat | 4 + src/backend/access/Makefile | 5 +- src/backend/access/no_storage/Makefile| 18 + src/backend/access/no_storage/README | 9 + .../access/no_storage/no_storage_handler.c| 518 ++ src/backend/access/table/tableam.c| 33 -- src/backend/utils/cache/relcache.c| 37 +- src/test/regress/expected/create_am.out | 11 +- src/test/regress/expected/psql.out| 96 ++-- 11 files changed, 639 insertions(+), 102 deletions(-) create mode 100644 src/backend/access/no_storage/Makefile create mode 100644 src/backend/access/no_storage/README create mode 100644 src/backend/access/no_storage/no_storage_handler.c diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 33bffb6815..999e05bb89 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -832,7 +832,11 @@ typedef struct TableAmRoutine * for the relation. Works for tables, views, foreign tables and partitioned * tables. */ -extern const TupleTableSlotOps *table_slot_callbacks(Relation rel); +static inline const TupleTableSlotOps * +table_slot_callbacks(Relation rel) +{ + return rel->rd_tableam->slot_callbacks(rel); +} /* * Returns slot using the callbacks returned by table_slot_callbacks(), and diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat index 6082f0e6a8..e0e32a681f 100644 --- a/src/include/catalog/pg_am.dat +++ b/src/include/catalog/pg_am.dat @@ -15,6 +15,10 @@ { oid => '2', oid_symbol => 'HEAP_TABLE_AM_OID', descr => 'heap table access method', amname => 'heap', amhandler => 'heap_tableam_handler', amtype => 't' }, +{ oid => '8381', oid_symbol => 'NO_STORAGE_TABLE_AM_OID', + descr => 'no_storage table access method', + amname => 'no_storage', amhandler => 'no_storage_tableam_handler', + amtype => 't' }, { oid => '403', oid_symbol => 'BTREE_AM_OID', descr => 'b-tree index access method', amname => 'btree', amhandler => 'bthandler', amtype => 'i' }, diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 4e0c9be58c..5a796324a5 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -905,6 +905,10 @@ proname => 'heap_tableam_handler', provolatile => 'v', prorettype => 'table_am_handler', proargtypes => 'internal', prosrc => 'heap_tableam_handler' }, +{ oid => '8382', descr => 'no_storage access method handler', + proname => 'no_storage_tableam_handler', provolatile => 'v', + prorettype => 'table_am_handler', proargtypes => 'internal', + prosrc => 'no_storage_tableam_handler' }, # Index access method handlers { oid => '330', descr => 'btree index access method handler', diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 0880e0a8bb..29d3937ada 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -8,7 +8,8 @@ subdir = src/backend/access top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ - table tablesample transam +SUBDIRS = brin common gin gist hash heap index nbtree no_storage \ + rmgrdesc spgist \ + table tablesample transam include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/no_storage/Makefile b/src/backend/access/no_storage/Makefile new file mode 100644 index 00..8d1227b428 --- /dev/null +++ b/src/backend/access/no_storage/Makefile @@ -0,0 +1,18 @@ +#- +# +# Makefile-- +#Makefile for access/no_storage +# +# IDENTIFICATION +#src/backend/access/no_storage/Makefile +# +#- + +subdir = src/backend/access/no_storage +top_builddir = ../../../.. +include $(top_bui
Re: Parallel INSERT (INTO ... SELECT ...)
On Sat, Feb 13, 2021 at 12:17 AM Amit Langote wrote: > > On Thu, Feb 11, 2021 at 4:43 PM Greg Nancarrow wrote: > > On Thu, Feb 11, 2021 at 5:33 PM Greg Nancarrow wrote: > > > On Tue, Feb 9, 2021 at 1:04 AM Amit Langote > > > wrote: > > > > > > > > * I think that the concerns raised by Tsunakawa-san in: > > > > > > > > https://www.postgresql.org/message-id/TYAPR01MB2990CCB6E24B10D35D28B949FEA30%40TYAPR01MB2990.jpnprd01.prod.outlook.com > > > > > > > > regarding how this interacts with plancache.c deserve a look. > > > > Specifically, a plan that uses parallel insert may fail to be > > > > invalidated when partitions are altered directly (that is without > > > > altering their root parent). That would be because we are not adding > > > > partition OIDs to PlannerGlobal.invalItems despite making a plan > > > > that's based on checking their properties. See this (tested with all > > > > patches applied!): > > > > > > > > > > Does any current Postgres code add partition OIDs to > > > PlannerGlobal.invalItems for a similar reason? > > Currently, the planner opens partitions only for SELECT queries and > also adds them to the query's range table. And because they are added > to the range table, their OIDs do get added to > PlannerGlobal.relationOids (not invalItems, sorry!) by way of > CompleteCachedPlan() calling extract_query_dependencies(), which looks > at Query.rtable to decide which tables/partitions to add. > > > > I would have thought that, for example, partitions with a default > > > column expression, using a function that is changed from SAFE to > > > UNSAFE, would suffer the same plancache issue (for current parallel > > > SELECT functionality) as we're talking about here - but so far I > > > haven't seen any code handling this. > > AFAIK, default column expressions don't affect plans for SELECT > queries. OTOH, consider a check constraint expression as an example. > The planner may use one to exclude a partition from the plan with its > constraint exclusion algorithm (separate from "partition pruning"). > If the check constraint is dropped, any cached plans that used it will > be invalidated. > Sorry, I got that wrong, default column expressions are relevant to INSERT, not SELECT. > > > > Actually, I tried adding the following in the loop that checks the > > parallel-safety of each partition and it seemed to work: > > > > glob->relationOids = > > lappend_oid(glob->relationOids, pdesc->oids[i]); > > > > Can you confirm, is that what you were referring to? > > Right. I had mistakenly mentioned PlannerGlobal.invalItems, sorry. > > Although it gets the job done, I'm not sure if manipulating > relationOids from max_parallel_hazard() or its subroutines is okay, > but I will let the committer decide that. As I mentioned above, the > person who designed this decided for some reason that it is > extract_query_dependencies()'s job to populate > PlannerGlobal.relationOids/invalItems. > Yes, it doesn't really seem right doing it within max_parallel_hazard(). I tried doing it in extract_query_dependencies() instead - see attached patch - and it seems to work, but I'm not sure if there might be any unintended side-effects. Regards, Greg Nancarrow Fujitsu Australia setrefs.patch Description: Binary data