Re: Replication slot stats misgivings
On Fri, Apr 16, 2021 at 2:58 PM Amit Kapila wrote: > > On Thu, Apr 15, 2021 at 4:35 PM Masahiko Sawada wrote: > > > > Thank you for the update! The patch looks good to me. > > BTW regarding the commit f5fc2f5b23 that added total_txns and total_bytes, we add the reorder buffer size (i.g., rb->size) to rb->totalBytes but I think we should use the transaction size (i.g., txn->size) instead: @@ -1363,6 +1365,11 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state) dlist_delete(&change->node); dlist_push_tail(&state->old_change, &change->node); + /* +* Update the total bytes processed before releasing the current set +* of changes and restoring the new set of changes. +*/ + rb->totalBytes += rb->size; if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file, &state->entries[off].segno)) { @@ -2363,6 +2370,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, ReorderBufferIterTXNFinish(rb, iterstate); iterstate = NULL; + /* +* Update total transaction count and total transaction bytes +* processed. Ensure to not count the streamed transaction multiple +* times. +* +* Note that the statistics computation has to be done after +* ReorderBufferIterTXNFinish as it releases the serialized change +* which we have already accounted in ReorderBufferIterTXNNext. +*/ + if (!rbtxn_is_streamed(txn)) + rb->totalTxns++; + + rb->totalBytes += rb->size; + IIUC rb->size could include multiple decoded transactions. So it's not appropriate to add that value to the counter as the transaction size passed to the logical decoding plugin. If the reorder buffer process a transaction while having a large transaction that is being decoded, we could end up more increasing txn_bytes than necessary. Please review the attached patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index c27f710053..30d9ffa0da 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state) * Update the total bytes processed before releasing the current set * of changes and restoring the new set of changes. */ - rb->totalBytes += rb->size; + rb->totalBytes += entry->txn->size; if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file, &state->entries[off].segno)) { @@ -2382,7 +2382,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, if (!rbtxn_is_streamed(txn)) rb->totalTxns++; - rb->totalBytes += rb->size; + rb->totalBytes += txn->size; /* * Done with current changes, send the last message for this set of
Re: [BUG] "FailedAssertion" reported when streaming in logical replication
On Wed, Apr 28, 2021 at 12:25 PM tanghy.f...@fujitsu.com wrote: > > > I have modified the patch based on the above comments. > > Thanks for your patch. > I tested again after applying your patch and the problem is fixed. Thanks for confirming. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 00:39, Amit Khandekar wrote: > If planned parallel workers do not get launched, the Result Cache plan > node shows all-0 stats for each of those workers: Thanks for reporting this and for the patch. You're right that there is a problem here. I did in fact have code to skip workers that didn't have any cache misses right up until v18 of the patch [1], but I removed it because I was getting some test instability in the partition_prune regression tests. That was highlighted by the CFbot machines. I mentioned about that in the final paragraph of [2]. I didn't mention the exact test there, but I was talking about the test in partition_prune.sql. By the time it came to b6002a796, I did end up changing the partition_prune tests. However, I had to back that version out again because of some problems with force_parallel_mode = regress buildfarm animals. By the time I re-committed Result Cache in 9eacee2e6, I had changed the partition_prune tests so they did SET enable_resultcache = 0 before running that parallel test. I'd basically decided that the test was never going to be stable on the buildfarm. The problem there was that the results would vary depending on if the parallel worker managed to do anything before the main process did all the work. Because the tests are pretty small scale, many machines wouldn't manage to get their worker(s) in gear and running before the main process had finished the test. This was the reason I removed the cache_misses == 0 test in explain.c. I'd thought that I could make that test stable by just always outputting the cache stats line from workers, even if they didn't assist during execution. So, given that I removed the parallel test in partition_prune.sql, and don't have any EXPLAIN ANALYZE output for parallel tests in resultcache.sql, it should be safe enough to put that cache_misses == 0 test back into explain.c I've attached a patch to do this. The explain.c part is pretty similar to your patch, I just took my original code and comment. The patch also removes the SET force_parallel_mode = off in resultcache.sql. That's no longer needed due to adding this check in explain.c again. I also removed the changes I made to the explain_parallel_append function in partition_prune.sql. I shouldn't have included those in 9eacee2e6. I plan to push this in the next 24 hours or so. David [1] https://postgr.es/m/caaphdvoomttnpof-+q1daoma8vwfsfbyqb_a0iukts5nf2d...@mail.gmail.com [2] https://postgr.es/m/CAApHDvrz4f+i1wu-8hyqJ=pxydroga5okgo5rwpoj47rz6q...@mail.gmail.com resultcache_explain_fix.patch Description: Binary data
Some doubious error messages and comments
Hello. 0001: I found some typos in a error message and a comment. multirangetypes.c: 1420 > errmsg("range_intersect_agg must be called with a multirange"))); This "range_intersect_agg" looks like a typo of "multirange_..". operatorcmds.c:303 > * Look up a join estimator function ny name, and verify that it has the "ny" looks like a typo of "by". 0002: The following messages are substantially same and are uselessly split into separate messages. I'm not sure any compiler complains about using %zu for int, explicit casting would work in that case. be-secure-gssapi.c:351 > (errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)", > (size_t) input.length, > PQ_GSS_RECV_BUFFER_SIZE - sizeof(uint32; be-secure-gssapi.c:570 > (errmsg("oversize GSSAPI packet sent by the client (%zu > %d)", > (size_t) input.length, > PQ_GSS_RECV_BUFFER_SIZE))); 0003: The messages below seems to be a bit unclear. I'm not sure they worth doing. conversioncmds.c: 130 errmsg("encoding conversion function %s returned incorrect result for empty input", This is not wrong at all, but another message just above is saying that "encoding conversion function %s must return type %s". Why aren't we explicit here, like this? "encoding conversion function %s must return zero for empty input" typecmds.c:4294 > if (requireSuper) > if (!superuser()) > ereport(ERROR, >errmsg("must be superuser to alter a > type"))); Where, requireSuper varies depending on the set of operations but the description looks like describing general behavior. I'm not sure but something like the following might be better? +errmsg("must be superuser to perform all operations"))); +errmsg("some of the operations require superuser privilege"))); Any opinions or suggestions? regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 1634d8dd87b1474f60cfd6689c1e58acc87c1cfc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 28 Apr 2021 17:23:52 +0900 Subject: [PATCH 1/3] Fix typos --- src/backend/commands/operatorcmds.c | 2 +- src/backend/utils/adt/multirangetypes.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c index 809043c5d1..ba456c4dd1 100644 --- a/src/backend/commands/operatorcmds.c +++ b/src/backend/commands/operatorcmds.c @@ -300,7 +300,7 @@ ValidateRestrictionEstimator(List *restrictionName) } /* - * Look up a join estimator function ny name, and verify that it has the + * Look up a join estimator function by name, and verify that it has the * correct signature and we have the permissions to attach it to an * operator. */ diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c index 0b81649779..2583ddeedf 100644 --- a/src/backend/utils/adt/multirangetypes.c +++ b/src/backend/utils/adt/multirangetypes.c @@ -1417,7 +1417,7 @@ multirange_intersect_agg_transfn(PG_FUNCTION_ARGS) if (!type_is_multirange(mltrngtypoid)) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), - errmsg("range_intersect_agg must be called with a multirange"))); + errmsg("multirange_intersect_agg must be called with a multirange"))); typcache = multirange_get_typcache(fcinfo, mltrngtypoid); -- 2.27.0 >From 14a6b4b81d3b4d35bf690e1c240975bd888531f3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 28 Apr 2021 17:24:31 +0900 Subject: [PATCH 2/3] Consolidate substantially same messages --- src/backend/libpq/be-secure-gssapi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c index 316ca65db5..a335f78d47 100644 --- a/src/backend/libpq/be-secure-gssapi.c +++ b/src/backend/libpq/be-secure-gssapi.c @@ -567,9 +567,9 @@ secure_open_gssapi(Port *port) if (input.length > PQ_GSS_RECV_BUFFER_SIZE) { ereport(COMMERROR, - (errmsg("oversize GSSAPI packet sent by the client (%zu > %d)", + (errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)", (size_t) input.length, - PQ_GSS_RECV_BUFFER_SIZE))); + (size_t) PQ_GSS_RECV_BUFFER_SIZE))); return -1; } -- 2.27.0 >From 0db041334e0fc4df46f1d98e75e984b04777befe Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 28 Apr 2021 17:25:36 +0900 Subject: [PATCH 3/3] Clarify merror messages --- src/backend/commands/conversioncmds.c | 2 +- src/backend/commands/typecmds.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/commands/conversioncmds.c b/src/backend/commands/conversioncmds.c index 5fed97a2f9..307afb909d 100644 --- a/src/backend/commands/conversioncmds.c +++ b/src/backend/commands/conversioncmds.c @@ -127,7 +127,7 @@ Create
Re: Better sanity checking for message length words
Hi Tom, > scratches head ... ] The patch still applies perfectly cleanly > for me, using either "patch" or "git apply". Sorry, my bad. It was about lines separating on different platforms. The patch is fine and passes installcheck-world on MacOS. -- Best regards, Aleksander Alekseev
"Multiple table synchronizations are processed serially" still happens
Hi, One of my customers has an issue with logical replication. As $SUBJECT says, multiple table synchronization happens serially. To be honest, it doesn't do this every time. It happens when the tables are big enough. This issue was already described on this thread (from 2017): https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3Vg_0xmuc=upbrz-_mux...@mail.gmail.com This thread was closed by a commit ( https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6c2003f8a1bbc7c192a2e83ec51581c018aa162f) which apparently fixed the issue for the OP. Attached is a small test case where it still happens for me on 12.6, 11.11, and 10.16. I can't make it happen on 13.2. I don't know why. It may imply bigger tables for 13.2, but why? I simply don't know. Anyway, the issue at the end of the test case is that synchronizations sometimes happen serially. You can see on the process list that one walsender process is waiting in "idle in transaction" state: guillau+ 4868222227 0 10:44 ?00:00:00 /opt/postgresql/12/bin/postgres guillau+ 486824 486822 0 10:44 ?00:00:01 postgres: testcase: checkpointer guillau+ 486825 486822 0 10:44 ?00:00:04 postgres: testcase: background writer guillau+ 486826 486822 1 10:44 ?00:00:06 postgres: testcase: walwriter guillau+ 486827 486822 0 10:44 ?00:00:00 postgres: testcase: autovacuum launcher guillau+ 486828 486822 0 10:44 ?00:00:00 postgres: testcase: stats collector guillau+ 486829 486822 0 10:44 ?00:00:00 postgres: testcase: logical replication launcher guillau+ 489822 486822 0 10:55 ?00:00:00 postgres: testcase: logical replication worker for subscription 16436 guillau+ 489824 486822 10 10:55 ?00:00:01 postgres: testcase: walsender repuser ::1(38770) idle guillau+ 489825 486822 22 10:55 ?00:00:02 postgres: testcase: logical replication worker for subscription 16436 sync 16416 guillau+ 489826 486822 8 10:55 ?00:00:00 postgres: testcase: walsender repuser ::1(38772) COPY guillau+ 489827 486822 0 10:55 ?00:00:00 postgres: testcase: logical replication worker for subscription 16436 sync 16427 guillau+ 489828 486822 0 10:55 ?00:00:00 postgres: testcase: walsender repuser ::1(38774) idle in transaction waiting And the log says (from the start of the subscription): 2021-04-28 10:55:32.337 CEST [489822] LOG: logical replication apply worker for subscription "sub" has started 2021-04-28 10:55:32.342 CEST [489824] LOG: duration: 0.426 ms statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-04-28 10:55:32.342 CEST [489824] LOG: received replication command: IDENTIFY_SYSTEM 2021-04-28 10:55:32.342 CEST [489824] LOG: received replication command: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '1', publication_names '"pub"') 2021-04-28 10:55:32.342 CEST [489824] LOG: starting logical decoding for slot "sub" 2021-04-28 10:55:32.342 CEST [489824] DETAIL: Streaming transactions committing after 1/FF5D8130, reading WAL from 1/FF5D80F8. 2021-04-28 10:55:32.342 CEST [489824] LOG: logical decoding found consistent point at 1/FF5D80F8 2021-04-28 10:55:32.342 CEST [489824] DETAIL: There are no running transactions. 2021-04-28 10:55:32.345 CEST [489825] LOG: logical replication table synchronization worker for subscription "sub", table "foo" has started 2021-04-28 10:55:32.348 CEST [489826] LOG: duration: 0.315 ms statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-04-28 10:55:32.349 CEST [489826] LOG: duration: 0.041 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ 2021-04-28 10:55:32.349 CEST [489826] LOG: received replication command: CREATE_REPLICATION_SLOT "sub_16436_sync_16416" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2021-04-28 10:55:32.355 CEST [489827] LOG: logical replication table synchronization worker for subscription "sub", table "bar" has started 2021-04-28 10:55:32.359 CEST [489828] LOG: duration: 0.431 ms statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-04-28 10:55:32.359 CEST [489828] LOG: duration: 0.048 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ 2021-04-28 10:55:32.360 CEST [489828] LOG: received replication command: CREATE_REPLICATION_SLOT "sub_16436_sync_16427" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2021-04-28 10:55:32.407 CEST [489826] LOG: logical decoding found consistent point at 1/FF602880 2021-04-28 10:55:32.407 CEST [489826] DETAIL: There are no running transactions. 2021-04-28 10:55:32.409 CEST [489826] LOG: duration: 1.262 ms statement: SELECT c.oid, c.relreplident FROM pg_catalog.pg_class c INNER JOIN pg_catalog.pg_namespace nON (c.relnamespace = n.oid) WHERE n.nspname = 's01' AND c.relname = 'foo' AND c.relkind = 'r' 2021-04-28 10:55:32.410 CEST [489826] LOG: duration: 1.347 ms statement: SELECT a.attname, a.atttypid, a.atttypmod, a.a
Re: "Multiple table synchronizations are processed serially" still happens
Le mer. 28 avr. 2021 à 11:12, Guillaume Lelarge a écrit : > Hi, > > One of my customers has an issue with logical replication. As $SUBJECT > says, multiple table synchronization happens serially. To be honest, it > doesn't do this every time. It happens when the tables are big enough. > > This issue was already described on this thread (from 2017): > https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3Vg_0xmuc=upbrz-_mux...@mail.gmail.com > > This thread was closed by a commit ( > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6c2003f8a1bbc7c192a2e83ec51581c018aa162f) > which apparently fixed the issue for the OP. > > Attached is a small test case where it still happens for me on 12.6, > 11.11, and 10.16. I can't make it happen on 13.2. I don't know why. It may > imply bigger tables for 13.2, but why? I simply don't know. > > Actually, it also happens on 13.2. -- Guillaume.
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, Apr 28, 2021 at 1:54 PM David Rowley wrote: > I plan to push this in the next 24 hours or so. I happen to see explain_resultcache in resultcache.sql, seems like two of the tests still have numbers for cache hits and misses - Hits: 980 Misses: 20, won't these make tests unstable? Will these numbers be same across machines? Or is it that no buildfarm had caught these? The comment below says that, the hits and misses are not same across machines: -- Ensure we get some evictions. We're unable to validate the hits and misses -- here as the number of entries that fit in the cache at once will vary -- between different machines. Should we remove the hide_hitmiss parameter in explain_resultcache and always print N for non-zero and Zero for 0 hits and misses? This clearly shows that we have 0 or non-zero hits or misses. Am I missing something? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Replication slot stats misgivings
On Wed, Apr 28, 2021 at 12:49 PM Masahiko Sawada wrote: > > > BTW regarding the commit f5fc2f5b23 that added total_txns and > total_bytes, we add the reorder buffer size (i.g., rb->size) to > rb->totalBytes but I think we should use the transaction size (i.g., > txn->size) instead: > You are right about the problem but I think your proposed fix also won't work because txn->size always has current transaction size which will be top-transaction in the case when a transaction has multiple subtransactions. It won't include the subtxn->size. For example, you can try to decode with below kind of transaction: Begin; insert into t1 values(1); savepoint s1; insert into t1 values(2); savepoint s2; insert into t1 values(3); commit; I think we can fix it by keeping track of total_size in toptxn as we are doing for the streaming case in ReorderBufferChangeMemoryUpdate. We can probably do it for non-streaming cases as well. -- With Regards, Amit Kapila.
Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress
Hi hackers, > Any opinions on backpatching? > Other than src/test/modules/brin, the ISOLATION users don't look > much like real extensions (rather than test scaffolding), either. > Out of core, the only thing I can see with isolation tests is rum, but > it uses a workaround to have an access to the necessary binaries. I just wanted to let you know that TimescaleDB uses pg_isolation_regress and occasionally there are reports from some suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it makes investing the time into backpatching worth it. However if someone could do it, it would be nice. [1]: https://github.com/timescale/timescaledb/issues/1655 -- Best regards, Aleksander Alekseev
Re: Built-in connection pooler
Konstantin Knizhnik wrote: > People asked me to resubmit built-in connection pooler patch to commitfest. > Rebased version of connection pooler is attached. I've reviewd the patch but haven't read the entire thread thoroughly. I hope that I don't duplicate many comments posted earlier by others. (Please note that the patch does not apply to the current master, I had to reset the head of my repository to the appropriate commit.) Documentation / user interface -- * session_pool_size (config.sgml) I wonder if "The default value is 10, so up to 10 backends will serve each database," should rather be "The default value is 10, so up to 10 backends will serve each database/user combination." However, when I read the code, I think that each proxy counts the size of the pool separately, so the total number of backends used for particular database/user combination seems to be session_pool_size * connection_proxies Since the feature uses shared memory statistics anyway, shouldn't it only count the total number of backends per database/user? It would need some locking, but the actual pools (hash tables) could still be local to the proxy processes. * connection_proxies (I've noticed that Ryan Lambert questioned this variable upthread.) I think this variable makes the configuration less straightforward from the user perspective. Cannot the server launch additional proxies dynamically, as needed, e.g. based on the shared memory statistics that the patch introduces? I see that postmaster would have to send the sockets in a different way. How about adding a "proxy launcher" process that would take care of the scheduling and launching new proxies? * multitenant_proxy I thought the purpose of this setting is to reduce the number of backends needed, but could not find evidence in the code. In particular, client_attach() always retrieves the backend from the appropriate pool, and backend_reschedule() does so as well. Thus the role of both client and backend should always match. What piece of information do I miss? * typo (2 occurrences in config.sgml) "stanalone" -> "standalone" Design / coding --- * proxy.c:backend_start() does not change the value of the "host" parameter to the socket directory, so I assume the proxy connects to the backend via TCP protocol. I think the unix socket should be preferred for this connection if the platform has it, however: * is libpq necessary for the proxy to connect to backend at all? Maybe postgres.c:ReadCommand() can be adjusted so that the backend can communicate with the proxy just via the plain socket. I don't like the idea of server components communicating via libpq (do we need anything else of the libpq connection than the socket?) as such, but especially these includes in proxy.c look weird: #include "../interfaces/libpq/libpq-fe.h" #include "../interfaces/libpq/libpq-int.h" * How does the proxy recognize connections to the walsender? I haven't tested that, but it's obvious that these connections should not be proxied. * ConnectionProxyState is in shared memory, so access to its fields should be synchronized. * StartConnectionProxies() is only called from PostmasterMain(), so I'm not sure the proxies get restarted after crash. Perhaps PostmasterStateMachine() needs to call it too after calling StartupDataBase(). * Why do you need the Channel.magic integer field? Wouldn't a boolean field "active" be sufficient? ** In proxy_loop(), I've noticed tests (chan->magic == ACTIVE_CHANNEL_MAGIC) tests inside the branch else if (chan->magic == ACTIVE_CHANNEL_MAGIC) Since neither channel_write() nor channel_read() seem to change the value, I think those tests are not necessary. * Comment lines are often too long. * pgindent should be applied to the patch at some point. I can spend more time reviewing the patch during the next CF. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Some oversights in query_id calculation
Hi Julien, > I'm attaching a patch that fixes those, with regression tests to reproduce > each > problem. I believe something could be not quite right with the patch. Here is what I did: $ git apply ... # revert the changes in the code but keep the new tests $ git checkout src/backend/utils/misc/queryjumble.c $ ./full-build.sh && single-install.sh && make installcheck-world ... where named .sh scripts are something I use to quickly check a patch [1]. I was expecting that several tests will fail but they didn't. Maybe I missed something? [1]: https://github.com/afiskon/pgscripts -- Best regards, Aleksander Alekseev
Re: Parallel INSERT SELECT take 2
On Wed, Apr 28, 2021 at 12:44 PM houzj.f...@fujitsu.com wrote: > > 0003: > 1) Temporarily, add the check of built-in function by adding a member for > proparallel in FmgrBuiltin. > To avoid enlarging FmgrBuiltin struct , change the existing bool members > strict and and retset into > one member of type char, and represent the original values with some bit > flags. > I was thinking that it would be better to replace the two bool members with one "unsigned char" member for the bitflags for strict and retset, and another "char" member for parallel. The struct would still remain the same size as it originally was, and you wouldn't need to convert between bit settings and char ('u'/'r'/'s') each time a built-in function was checked for parallel-safety in fmgr_info(). > Note: this will lock down the parallel property of built-in function, but, I > think the parallel safety of built-in function > is related to the C code, user should not change the property of it unless > they change its code. So, I think it might be > better to disallow changing parallel safety for built-in functions, Thoughts > ? > I'd vote for disallowing it (unless someone can justify why it currently is allowed). > I have not added the parallel safety check in ALTER/CREATE table PARALLEL DML > SAFE command. > I think it seems better to add it after some more discussion. > I'd vote for not adding such a check (as this is a declaration). Some additional comments: 1) In patch 0002 comment, it says: This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just like pg_proc's proparallel. The default is UNSAFE. It should say "relparalleldml" column. 2) With the patches applied, I seem to be getting a couple of errors when running "make installcheck-world" with force_parallel_mode=regress in effect. Can you please try it? Regards, Greg Nancarrow Fujitsu Australia
Re: Some oversights in query_id calculation
Hi Aleksander, On Wed, Apr 28, 2021 at 01:19:36PM +0300, Aleksander Alekseev wrote: > Hi Julien, > > > I'm attaching a patch that fixes those, with regression tests to reproduce > > each > > problem. > > I believe something could be not quite right with the patch. Here is what I > did: > > $ git apply ... > # revert the changes in the code but keep the new tests > $ git checkout src/backend/utils/misc/queryjumble.c > $ ./full-build.sh && single-install.sh && make installcheck-world > > ... where named .sh scripts are something I use to quickly check a patch [1]. > > I was expecting that several tests will fail but they didn't. Maybe I > missed something? I think it's because installcheck-* don't run pg_stat_statements' tests, see its Makefile: > # Disabled because these tests require > "shared_preload_libraries=pg_stat_statements", > # which typical installcheck users do not have (e.g. buildfarm clients). > NO_INSTALLCHECK = 1 You should see failures doing a check-world or simply a make -C contrib/pg_stat_statements check
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 15:08, Bharath Rupireddy wrote: > > On Wed, Apr 28, 2021 at 1:54 PM David Rowley wrote: > > I plan to push this in the next 24 hours or so. > > I happen to see explain_resultcache in resultcache.sql, seems like two > of the tests still have numbers for cache hits and misses - Hits: 980 > Misses: 20, won't these make tests unstable? Will these numbers be > same across machines? Or is it that no buildfarm had caught these? The > comment below says that, the hits and misses are not same across > machines: > -- Ensure we get some evictions. We're unable to validate the hits and misses > -- here as the number of entries that fit in the cache at once will vary > -- between different machines. > > Should we remove the hide_hitmiss parameter in explain_resultcache and > always print N for non-zero and Zero for 0 hits and misses? This > clearly shows that we have 0 or non-zero hits or misses. > > Am I missing something? I believe, the assumption here is that with no workers involved, it is guaranteed to have the exact same cache misses and hits anywhere.
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 21:38, Bharath Rupireddy wrote: > > On Wed, Apr 28, 2021 at 1:54 PM David Rowley wrote: > > I plan to push this in the next 24 hours or so. > > I happen to see explain_resultcache in resultcache.sql, seems like two > of the tests still have numbers for cache hits and misses - Hits: 980 > Misses: 20, won't these make tests unstable? Will these numbers be > same across machines? Or is it that no buildfarm had caught these? The > comment below says that, the hits and misses are not same across > machines: > -- Ensure we get some evictions. We're unable to validate the hits and misses > -- here as the number of entries that fit in the cache at once will vary > -- between different machines. The only reason it would be unstable is if there are cache evictions. Evictions will only happen if the cache fills up and we need to make way for new entries. A 32-bit machine, for example, will use slightly less memory for caching items, so the number of evictions is going to be a bit less on those machine. Having an unstable number of evictions will cause the hits and misses to be unstable too. Otherwise, the number of misses is predictable, it'll be the number of distinct sets of parameters that we lookup in the cache. Any repeats will be a hit. So hits plus misses should just add up to the number of times that a normal parameterized nested loop would execute the inner side, and that's predictable too. It would only change if you change the query or the data in the table. > Should we remove the hide_hitmiss parameter in explain_resultcache and > always print N for non-zero and Zero for 0 hits and misses? This > clearly shows that we have 0 or non-zero hits or misses. I added that because if there are no evictions then the hits and misses should be perfectly stable, providing the test is small enough not to exceed work_mem and fill the cache. If someone was to run the tests with a small work_mem, then there would be no shortage of other tests that would fail due to plan changes. These tests were designed to be small enough so there's no danger of getting close to work_mem and filling the cache. However, I did add 1 test that sets work_mem down to 64kB to ensure the eviction code does get some exercise. You'll notice that I pass "true" to explain_resultcache() to hide the hits and misses there. We can't test the exact number of hits/misses/evictions there, but we can at least tell apart the zero and non-zero by how I coded explain_resultcache() to replace with Zero or N depending on if the number was zero or above zero. David
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 16:14, David Rowley wrote: > However, I did add 1 test that sets work_mem down to 64kB to ensure > the eviction code does get some exercise. You'll notice that I pass > "true" to explain_resultcache() to hide the hits and misses there. We > can't test the exact number of hits/misses/evictions there, but we can > at least tell apart the zero and non-zero by how I coded > explain_resultcache() to replace with Zero or N depending on if the > number was zero or above zero. Thanks for the explanation. I did realize after replying to Bharat upthread, that I was wrong in assuming that the cache misses and cache hits are always stable for non-parallel scans.
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 13:54, David Rowley wrote: > So, given that I removed the parallel test in partition_prune.sql, and > don't have any EXPLAIN ANALYZE output for parallel tests in > resultcache.sql, it should be safe enough to put that cache_misses == > 0 test back into explain.c > > I've attached a patch to do this. The explain.c part is pretty similar > to your patch, I just took my original code and comment. Sounds good. And thanks for the cleanup patch, and the brief history. Patch looks ok to me.
Re: Replication slot stats misgivings
On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila wrote: > > On Wed, Apr 28, 2021 at 12:49 PM Masahiko Sawada > wrote: > > > > > > BTW regarding the commit f5fc2f5b23 that added total_txns and > > total_bytes, we add the reorder buffer size (i.g., rb->size) to > > rb->totalBytes but I think we should use the transaction size (i.g., > > txn->size) instead: > > > > You are right about the problem but I think your proposed fix also > won't work because txn->size always has current transaction size which > will be top-transaction in the case when a transaction has multiple > subtransactions. It won't include the subtxn->size. Right. I missed the point that ReorderBufferProcessTXN() processes also subtransactions. > I think we can fix it by keeping track of total_size in toptxn as we > are doing for the streaming case in ReorderBufferChangeMemoryUpdate. > We can probably do it for non-streaming cases as well. Agreed. I've updated the patch. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ use_total_size_v2.patch Description: Binary data
Re: Replication slot stats misgivings
On Wed, Apr 28, 2021 at 4:51 PM Masahiko Sawada wrote: > > On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila wrote: > > > > > I think we can fix it by keeping track of total_size in toptxn as we > > are doing for the streaming case in ReorderBufferChangeMemoryUpdate. > > We can probably do it for non-streaming cases as well. > > Agreed. > > I've updated the patch. What do you think? > @@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state) * Update the total bytes processed before releasing the current set * of changes and restoring the new set of changes. */ - rb->totalBytes += rb->size; + rb->totalBytes += entry->txn->total_size; if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file, &state->entries[off].segno)) I have not tested this but won't in the above change you need to check txn->toptxn for subtxns? -- With Regards, Amit Kapila.
Re: Error in libpq docs for target_session_attrs, prefer-standby mode
On Wed, 2021-04-28 at 12:55 +1000, Greg Nancarrow wrote: > I spotted an error in the development version documentation for > libpq's connection parameter "target_session_attrs" (34.1.2 Parameter > Key Words). > In the description for the "prefer-standby" mode, it says "... but if > none of the listed hosts is a standby server, try again in all mode". > There is no such "all" mode. It should instead say "any" mode. > Patch is attached. You are right, and the patch is good. Yours, Laurenz Albe
RE: Forget close an open relation in ReorderBufferProcessTXN()
On Monday, April 26, 2021 2:05 PM Amit Kapila wrote: > On Fri, Apr 23, 2021 at 8:03 PM osumi.takami...@fujitsu.com > wrote: > > > > On Saturday, April 17, 2021 4:13 PM Amit Kapila > wrote: > > > > I also don't find a test for this. It is introduced in > > > > 5dfd1e5a6696, wrote by Simon Riggs, Marco Nenciarini and Peter > > > > Eisentraut. Maybe they can explain when we can enter this condition? > > > > > > My guess is that this has been copied from the code a few lines > > > above to handle insert/update/delete where it is required to handle > > > some DDL ops like Alter Table but I think we don't need it here (for > > > Truncate op). If that understanding turns out to be true then we > > > should either have an Assert for this or an elog message. > > In this thread, we are discussing 3 topics below... > > > > (1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE > in > > ReorderBufferProcessTXN() > > (2) discussion of whether we disallow decoding of operations on user > > catalog tables or not > > (3) memory leak of maybe_send_schema() (patch already provided) > > > > Let's address those one by one. > > In terms of (1), which was close to the motivation of this thread, > > > > I think (1) and (2) are related because if we need (2) then the check removed > by (1) needs to be replaced with another check. So, I am not sure how to > make this decision. Yeah, you are right. On Monday, April 19, 2021 9:23 PM Amit Kapila wrote: > On Sat, Apr 17, 2021 at 12:01 PM Amit Kapila > wrote: > > On Fri, Apr 16, 2021 at 11:24 PM Andres Freund > wrote: > > > > > > > I think it is also important to *not* acquire any lock on relation > > > > otherwise it can lead to some sort of deadlock or infinite wait in > > > > the decoding process. Consider a case for operations like Truncate > > > > (or if the user has acquired an exclusive lock on the relation in > > > > some other way say via Lock command) which acquires an exclusive > > > > lock on relation, it won't get replicated in synchronous mode > > > > (when synchronous_standby_name is configured). The truncate > > > > operation will wait for the transaction to be replicated to the > > > > subscriber and the decoding process will wait for the Truncate > operation to finish. > > > > > > However, this cannot be really relied upon for catalog tables. An > > > output function might acquire locks or such. But for those we do not > > > need to decode contents... > > > > > > > I see that if we define a user_catalog_table (create table t1_cat(c1 > > int) WITH(user_catalog_table = true);), we are able to decode > > operations like (insert, truncate) on such a table. What do you mean > > by "But for those we do not need to decode contents"? > > > > I think we are allowed to decode the operations on user catalog tables > because we are using RelationIsLogicallyLogged() instead of > RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN(). > Based on this discussion, I think we should not be allowing decoding of > operations on user catalog tables, so we should use > RelationIsAccessibleInLogicalDecoding to skip such ops in > ReorderBufferProcessTXN(). Am, I missing something? > > Can you please clarify? I don't understand that point, either. I read the context where the user_catalog_table was introduced - [1]. There, I couldn't find any discussion if we should skip decode operations on that kind of tables or not. Accordingly, we just did not conclude it, I suppose. What surprised me a bit is to decode operations of system catalog table are considered like [2] somehow at the time. I cannot find any concrete description of such use cases in the thread, though. Anyway, I felt disallowing decoding of operations on user catalog tables doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ? [1] - https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de Note that in this discussion, user_catalog_table was renamed from treat_as_catalog_table in the middle of the thread. Searching it might help you to shorten your time to have a look at it. [2] - https://www.postgresql.org/message-id/CA%2BTgmobhDCHuckL_86wRDWJ31Gw3Y1HrQ4yUKEn7U1_hTbeVqQ%40mail.gmail.com Best Regards, Takamichi Osumi
Re: Some oversights in query_id calculation
Hi Julien, > You should see failures doing a check-world or simply a make -C > contrib/pg_stat_statements check Sorry, my bad. I was running make check-world, but did it with -j4 flag which was a mistake. The patch is OK. On Wed, Apr 28, 2021 at 1:27 PM Julien Rouhaud wrote: > Hi Aleksander, > > On Wed, Apr 28, 2021 at 01:19:36PM +0300, Aleksander Alekseev wrote: > > Hi Julien, > > > > > I'm attaching a patch that fixes those, with regression tests to > reproduce each > > > problem. > > > > I believe something could be not quite right with the patch. Here is > what I did: > > > > $ git apply ... > > # revert the changes in the code but keep the new tests > > $ git checkout src/backend/utils/misc/queryjumble.c > > $ ./full-build.sh && single-install.sh && make installcheck-world > > > > ... where named .sh scripts are something I use to quickly check a patch > [1]. > > > > I was expecting that several tests will fail but they didn't. Maybe I > > missed something? > > I think it's because installcheck-* don't run pg_stat_statements' tests, > see > its Makefile: > > > # Disabled because these tests require > "shared_preload_libraries=pg_stat_statements", > > # which typical installcheck users do not have (e.g. buildfarm clients). > > NO_INSTALLCHECK = 1 > > You should see failures doing a check-world or simply a make -C > contrib/pg_stat_statements check > -- Best regards, Aleksander Alekseev
Re: Use simplehash.h instead of dynahash in SMgr
David Rowley писал 2021-04-26 09:43: I tried that and it got a median result of 113.795 seconds over 14 runs with this recovery benchmark test. LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1014, max chain: 6, avg chain: 0.499016, total_collisions: 428, max_collisions: 3, avg_collisions: 0.210630 I also tried the following hash function just to see how much performance might be left from speeding it up: static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode) { uint32 h; h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32) rnode->node.dbNode); return murmurhash32(h); } I got a median of 112.685 seconds over 14 runs with: LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1044, max chain: 7, avg chain: 0.513780, total_collisions: 438, max_collisions: 3, avg_collisions: 0.215551 The best result is with just: return (uint32)rnode->node.relNode; ie, relNode could be taken without mixing at all. relNode is unique inside single database, and almost unique among whole cluster since it is Oid. I'd like to see benchmark code. It quite interesting this place became measurable at all. Sure. $ cat recoverybench_insert_hash.sh David. So, I've repeated benchmark with different number of partitons (I tried to catch higher fillfactor) and less amount of inserted data (since I don't want to stress my SSD). partitions/ | dynahash | dynahash + | simplehash | simplehash + | fillfactor | | simple func | | simple func | +--+-+--+ 3500/0.43 | 3.73s | 3.54s | 3.58s| 3.34s | 3200/0.78 | 3.64s | 3.46s | 3.47s| 3.25s | 1500/0.74 | 3.18s | 2.97s | 3.03s| 2.79s | Fillfactor is effective fillfactor in simplehash with than number of partitions. I wasn't able to measure with fillfactor close to 0.9 since looks like simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE. Simple function is hash function that returns only rnode->node.relNode. I've test it both with simplehash and dynahash. For dynahash also custom match function were made. Conclusion: - trivial hash function gives better results for both simplehash and dynahash, - simplehash improves performance for both complex and trivial hash function, - simplehash + trivial function perform best. I'd like to hear other's people comments on trivial hash function. But since generation of relation's Oid are not subject of human interventions, I'd recommend to stick with trivial. Since patch is simple, harmless and gives measurable improvement, I think it is ready for commit fest. regards, Yura Sokolov. Postgres Proffesional https://www.postgrespro.com PS. David, please send patch once again since my mail client reattached files in previous messages, and commit fest robot could think I'm author.
Re: Some doubious error messages and comments
On Wed, Apr 28, 2021 at 05:36:33PM +0900, Kyotaro Horiguchi wrote: > 0001: I found some typos in a error message and a comment. > > multirangetypes.c: 1420 > > errmsg("range_intersect_agg must be called with a multirange"))); > > This "range_intersect_agg" looks like a typo of "multirange_..". > > operatorcmds.c:303 > > * Look up a join estimator function ny name, and verify that it has the > > "ny" looks like a typo of "by". "ny name" shows up a 2nd time. I have another "comment typos" patch - maybe someone will want to push them together. commit 32e979c652c68ca5e3a7f308d677058e0c08547b Author: Kyotaro Horiguchi Date: Wed Apr 28 17:23:52 2021 +0900 Fix typos ny name: 321eed5f0f7563a0cabb3d7a98132856287c1ad1 multirange: 6df7a9698bb036610c1e8c6d375e1be38cb26d5f diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c index 809043c5d1..fbd7d8d062 100644 --- a/src/backend/commands/operatorcmds.c +++ b/src/backend/commands/operatorcmds.c @@ -265,7 +265,7 @@ DefineOperator(List *names, List *parameters) } /* - * Look up a restriction estimator function ny name, and verify that it has + * Look up a restriction estimator function by name, and verify that it has * the correct signature and we have the permissions to attach it to an * operator. */ @@ -300,7 +300,7 @@ ValidateRestrictionEstimator(List *restrictionName) } /* - * Look up a join estimator function ny name, and verify that it has the + * Look up a join estimator function by name, and verify that it has the * correct signature and we have the permissions to attach it to an * operator. */ diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c index 0b81649779..2583ddeedf 100644 --- a/src/backend/utils/adt/multirangetypes.c +++ b/src/backend/utils/adt/multirangetypes.c @@ -1417,7 +1417,7 @@ multirange_intersect_agg_transfn(PG_FUNCTION_ARGS) if (!type_is_multirange(mltrngtypoid)) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), -errmsg("range_intersect_agg must be called with a multirange"))); +errmsg("multirange_intersect_agg must be called with a multirange"))); typcache = multirange_get_typcache(fcinfo, mltrngtypoid); commit 8247b4034ed4c68241be9fbdec249bc967ceafd4 Author: Justin Pryzby Date: Tue Apr 27 07:57:50 2021 -0500 Comment typos: extended stats a4d75c86b and 518442c7f diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 9dd30370da..eb9e63f4a8 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -1943,7 +1943,7 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid, * simply append them after simple column references. * * XXX Some places during build/estimation treat expressions as if they -* are before atttibutes, but for the CREATE command that's entirely +* are before attributes, but for the CREATE command that's entirely * irrelevant. */ datum = SysCacheGetAttr(STATEXTOID, ht_stats, diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 7e11cb9d5f..5e53783ea6 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1796,7 +1796,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli continue; /* -* Now we know the clause is compatible (we have either atttnums +* Now we know the clause is compatible (we have either attnums * or expressions extracted from it), and was not estimated yet. */
Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch
Thomas Munro writes: > On Wed, Apr 28, 2021 at 3:56 AM Tom Lane wrote: >> Of course Wikipedia has been known to contain errors, but now >> I'm inclined to think I blew this. Anyone want to check my work? > I tried a couple of examples not from Wikipedia. ... Thanks for checking! I'll go adjust the documentation. regards, tom lane
Re: Unresolved repliaction hang and stop problem.
On 2021-Apr-28, Amit Kapila wrote: > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera > wrote: > > Hmm ... On what does it depend (other than plain git conflicts, which > > are aplenty)? On a quick look to the commit, it's clear that we need to > > be careful in order not to cause an ABI break, but that doesn't seem > > impossible to solve, but I'm wondering if there is more to it than that. > > As mentioned in the commit message, we need another commit [1] change > to make this work. > > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL message type, but we have added such on a minor release before.) ... It's strange that replication worked for them on pg10 though and broke on 13. What did we change anything to make it so? (I don't have any fish to fry on this topic at present, but it seems a bit concerning.) -- Álvaro Herrera Valdivia, Chile
Re: pg_hba.conf.sample wording improvement
On 2021-Apr-28, Peter Eisentraut wrote: > I propose the attached patch to shake up the wording in the connection type > section of pg_hba.conf.sample a bit. After the hostgssenc part was added > on, the whole thing became a bit wordy, and it's also a bit inaccurate for > example in that the current wording for "host" appears to say that it does > not apply to GSS-encrypted connections. Yeah, that's a clear improvement. Looking at it now, I wonder how well do the "hostno" options work. If I say "hostnogssenc", is an SSL-encrypted socket good? If I say "hostnossl", is a GSS-encrypted socket good? If so, how does that make sense? -- Álvaro Herrera39°49'30"S 73°17'W "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)
Re: Replication slot stats misgivings
On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila wrote: > > On Wed, Apr 28, 2021 at 9:37 AM Masahiko Sawada wrote: > > > > On Wed, Apr 28, 2021 at 12:29 PM Amit Kapila > > wrote: > > > > > > On Tue, Apr 27, 2021 at 11:02 AM vignesh C wrote: > > > > > > > > On Tue, Apr 27, 2021 at 9:48 AM vignesh C wrote: > > > > > > > > > > > > > Attached patch has the changes to update statistics during > > > > spill/stream which prevents the statistics from being lost during > > > > interrupt. > > > > > > > > > > void > > > -UpdateDecodingStats(LogicalDecodingContext *ctx) > > > +UpdateDecodingStats(ReorderBuffer *rb) > > > > > > I don't think you need to change this interface because > > > reorderbuffer->private_data points to LogicalDecodingContext. See > > > StartupDecodingContext. > > > > +1 > > > > With this approach, we could still miss the totalTxns and totalBytes > > updates if the decoding a large but less than > > logical_decoding_work_mem is interrupted, right? > > > > Right, but is there some simple way to avoid that? I see two > possibilities (a) store stats in ReplicationSlot and then send them at > ReplicationSlotRelease but that will lead to an increase in shared > memory usage and as per the discussion above, we don't want that, (b) > send intermediate stats after decoding say N changes but for that, we > need to additionally compute the size of each change which might add > some overhead. Right. > I am not sure if any of these alternatives are a good idea. What do > you think? Do you have any other ideas for this? I've been considering some ideas but don't come up with a good one yet. It’s just an idea and not tested but how about having CreateDecodingContext() register before_shmem_exit() callback with the decoding context to ensure that we send slot stats even on interruption. And FreeDecodingContext() cancels the callback. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Skip temporary table schema name from explain-verbose output.
> On Tue, Apr 27, 2021 at 7:08 PM Bharath Rupireddy > Make sense, we would lose the ability to differentiate temporary > tables from the auto_explain logs. There's no useful differentiation that can be done with the temp schema name. They're assigned on connection start randomly from the pool of temp schemas. The names you find in the log won't be useful and as new connections are made the same schema names will be reused for different connections. I would say it makes sense to remove them -- except perhaps it makes it harder to parse explain output. If explain verbose always includes the schema then it's easier for a parser to make sense of the explain plan output without having to be prepared to sometimes see a schema and sometimes not. That's probably a pretty hypothetical concern however since all the explain plan parsers that actually exist are prepared to deal with non-verbose plans anyways. And we have actual machine-readable formats too anyways. -- greg
Re: pg_hba.conf.sample wording improvement
Peter Eisentraut writes: > I propose the attached patch to shake up the wording in the connection > type section of pg_hba.conf.sample a bit. After the hostgssenc part was > added on, the whole thing became a bit wordy, and it's also a bit > inaccurate for example in that the current wording for "host" appears to > say that it does not apply to GSS-encrypted connections. +1 for revising it in this general way. I notice you omitted "TCP/IP" from the last line though: +# - "hostnogssenc" is a not GSSAPI-encrypted socket which doesn't seem consistent. Another thought is to switch the phrase order: +# - "local" is a Unix-domain socket +# - "host" is a TCP/IP socket (encrypted or not) +# - "hostssl" is a TCP/IP socket that is SSL-encrypted +# - "hostnossl" is a TCP/IP socket that is not SSL-encrypted +# - "hostgssenc" is a TCP/IP socket that is GSSAPI-encrypted +# - "hostnogssenc" is a TCP/IP socket that is not GSSAPI-encrypted I'm not wedded to that idea, but it seems to help reduce random variations between the wordings of these lines. regards, tom lane
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On Wed, Apr 28, 2021 at 8:32 AM Alvaro Herrera wrote: > On 2021-Apr-27, Alvaro Herrera wrote: > > > This v3 handles things as you suggested and works correctly AFAICT. I'm > > going to add some more tests cases to verify the behavior in the > > scenarios you showed, and get them to run under cache-clobber options to > > make sure it's good. > > Yep, it seems to work. Strangely, the new isolation case doesn't > actually crash before the fix -- it merely throws a memory allocation > error. Thanks. Yeah, it does seem to work. I noticed that rd_partdesc_nodetached_xmin can sometimes end up with value 0. While you seem to be already aware of that, because otherwise you wouldn't have added TransactionIdIsValid(...) in condition in RelationGetPartitionDesc(), the comments nearby don't mention why such a thing might happen. Also, I guess it can't be helped that the partdesc_nodetached will have to be leaked when the xmin is 0, but that shouldn't be as problematic as the case we discussed earlier. + /* +* But first, a kluge: if there's an old context for this type of +* descriptor, it contains an old partition descriptor that may still be +* referenced somewhere. Preserve it, while not leaking it, by +* reattaching it as a child context of the new one. Eventually it will +* get dropped by either RelationClose or RelationClearRelation. +* +* We keep the regular partdesc in rd_pdcxt, and the partdesc-excluding- +* detached-partitions in rd_pddcxt. +*/ + context = is_omit ? &rel->rd_pddcxt : &rel->rd_pdcxt; + if (*context != NULL) + MemoryContextSetParent(*context, new_pdcxt); + *context = new_pdcxt; Would it be a bit more readable to just duplicate this stanza in the blocks that assign to rd_partdesc_nodetached and rd_partdesc, respectively? That's not much code to duplicate and it'd be easier to see which context is for which partdesc. + TransactionId rd_partdesc_nodetached_xmin; /* xmin for the above */ Could you please expand this description a bit? -- Amit Langote EDB: http://www.enterprisedb.com
Re: Skip temporary table schema name from explain-verbose output.
Greg Stark writes: >> On Tue, Apr 27, 2021 at 7:08 PM Bharath Rupireddy >> > Make sense, we would lose the ability to differentiate temporary >> tables from the auto_explain logs. > There's no useful differentiation that can be done with the temp > schema name. Agreed. > I would say it makes sense to remove them -- except perhaps it makes > it harder to parse explain output. I don't think we should remove them. However, it could make sense to print the "pg_temp" alias instead of the real schema name when we are talking about myTempNamespace. Basically try to make that alias a bit less leaky. regards, tom lane
Re: pg_hba.conf.sample wording improvement
Alvaro Herrera writes: > Looking at it now, I wonder how well do the "hostno" options work. If I > say "hostnogssenc", is an SSL-encrypted socket good? If I say > "hostnossl", is a GSS-encrypted socket good? If so, how does that make > sense? Kind of off-topic for this thread, but I wonder if we should introduce "hostenc" and "hostnoenc" to mean "encrypted (or not), and I don't care by which method". The addition of GSS has made it painful to express those concepts. regards, tom lane
Re: pg_upgrade fails to detect unsupported arrays and ranges
[ blast-from-the-past department ] I wrote: > Daniel Gustafsson writes: >> I can see the appeal of >> including it before the wrap, even though I personally would've held off. > Nah, I'm not gonna risk it at this stage. I concur with your point > that this is an ancient bug, and one that is unlikely to bite many > people. I'll push it Wednesday or so. I happened across a couple of further pg_upgrade oversights in the same vein as 29aeda6e4 et al: * Those commits fixed the bugs in pg_upgrade/version.c about not checking the contents of arrays/ranges/etc, but there are two similar functions in pg_upgrade/check.c that I failed to notice (probably due to the haste with which that patch was prepared). * We really need to also reject user tables that contain instances of system-defined composite types (i.e. catalog rowtypes), because except for a few bootstrap catalogs, those type OIDs are assigned by genbki.pl not by hand, so they aren't stable across major versions. For example, in HEAD I get regression=# select 'pg_enum'::regtype::oid; oid --- 13045 (1 row) but the same OID was 12022 in v13, 11551 in v11, etc. So if you had a column of type pg_enum, you'd likely get no-such-type-OID failures when reading the values after an upgrade. I don't see much use-case for doing such a thing, so it seems saner to just block off the possibility rather than trying to support it. (We'd have little choice in the back branches anyway, as their OID values are locked down now.) The attached proposed patch fixes these cases too. I generalized the recursive query a little more so that it could start from an arbitrary query yielding pg_type OIDs, rather than just one type name; otherwise it's pretty straightforward. Barring objections I'll apply and back-patch this soon. regards, tom lane diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 1c1c908664..ce821a46b3 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -24,6 +24,7 @@ static void check_for_prepared_transactions(ClusterInfo *cluster); static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster); static void check_for_user_defined_postfix_ops(ClusterInfo *cluster); static void check_for_tables_with_oids(ClusterInfo *cluster); +static void check_for_composite_data_type_usage(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); @@ -100,6 +101,7 @@ check_and_dump_old_cluster(bool live_check) check_is_install_user(&old_cluster); check_proper_datallowconn(&old_cluster); check_for_prepared_transactions(&old_cluster); + check_for_composite_data_type_usage(&old_cluster); check_for_reg_data_type_usage(&old_cluster); check_for_isn_and_int8_passing_mismatch(&old_cluster); @@ -1043,6 +1045,63 @@ check_for_tables_with_oids(ClusterInfo *cluster) } +/* + * check_for_composite_data_type_usage() + * Check for system-defined composite types used in user tables. + * + * The OIDs of rowtypes of system catalogs and information_schema views + * can change across major versions; unlike user-defined types, we have + * no mechanism for forcing them to be the same in the new cluster. + * Hence, if any user table uses one, that's problematic for pg_upgrade. + */ +static void +check_for_composite_data_type_usage(ClusterInfo *cluster) +{ + bool found; + Oid firstUserOid; + char output_path[MAXPGPATH]; + char *base_query; + + prep_status("Checking for system-defined composite types in user tables"); + + snprintf(output_path, sizeof(output_path), "tables_using_composite.txt"); + + /* + * Look for composite types that were made during initdb *or* belong to + * information_schema; that's important in case information_schema was + * dropped and reloaded. + * + * The cutoff OID here should match the source cluster's value of + * FirstNormalObjectId. We hardcode it rather than using that C #define + * because, if that #define is ever changed, our own version's value is + * NOT what to use. Eventually we may need a test on the source cluster's + * version to select the correct value. + */ + firstUserOid = 16384; + + base_query = psprintf("SELECT t.oid FROM pg_catalog.pg_type t " + "LEFT JOIN pg_catalog.pg_namespace n ON t.typnamespace = n.oid " + " WHERE typtype = 'c' AND (t.oid < %u OR nspname = 'information_schema')", + firstUserOid); + + found = check_for_data_types_usage(cluster, base_query, output_path); + + free(base_query); + + if (found) + { + pg_log(PG_REPORT, "fatal\n"); + pg_fatal("Your installation contains system-defined composite type(s) in user tables.\n" + "These type OIDs are not stable across PostgreSQL versions,\n" + "so this cluster cannot currently be upgraded. You can\n" + "remove the problem tables and restart the upgrade.\n" + "A list of the p
回复:Attach to shared memory after fork()
> Windows has CreateProcess, which isn't available elsewhere. Yes, we still need fork() on *nix. So the solution is to reduce the overhead of fork(). Attach to shared memory after fork() might be a "Better shared memory management". > This is one of the reasons for using a connection pooler like pgbouncer, > which can vastly reduce the number of new process creations Postgres has to do. Yes, it’s another way I forgot to mention. But I think there should be a cleaner way without other component. > This proposal seems moderately insane. In the first place, it > introduces failure modes we could do without, and in the second place, > how is it not strictly *more* expensive than what happens now? You > still have to end up with all those TLB entries mapped in the child. Yes, the idea is radical. But it’s practical. 1. I don’t quite catch that. Can you explain it? 2. Yes, the overall cost is still the same, but the cost can spread into multi processes thus CPUs, not 100% on Postmaster. > (If your kernel is unable to pass down shared-memory TLBs effectively, > ISTM that's a kernel shortcoming not a Postgres architectural problem.) Indeed, it’s a kernel/CPUarch shortcoming. But it is also a Postgres architectural problem. MySQL and Oracle have no such problem. IMHO Postgres should manage itself well(eg. IO/buffer pool/connection/...) and not rely so much on OS kernel... The fork() used to be a genius hack, but now it’s a burden and it will get worse and worse. All I want to do is remove fork() or reduce the overhead. Maybe *nux will have CreateProcess someday(and I think it will), and we should wait for it?
Re: [PATCH] expand the units that pg_size_pretty supports on output
A second patch to teach the same units to pg_size_bytes(). Best, David On Wed, Apr 14, 2021 at 11:13 AM David Christensen < david.christen...@crunchydata.com> wrote: > Hi folks, > > Enclosed is a patch that expands the unit output for > pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing > numeric output code to account for the larger number of units we're using > rather than just adding nesting levels. > > There are also a few other places that could benefit from expanded units, > but this is a standalone starting point. > > Best, > > David > 0001-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch Description: Binary data
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Thanks for re-reviewing! This one I hope is the last version. On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote: > I noticed that rd_partdesc_nodetached_xmin can sometimes end up with > value 0. While you seem to be already aware of that, because otherwise > you wouldn't have added TransactionIdIsValid(...) in condition in > RelationGetPartitionDesc(), the comments nearby don't mention why such > a thing might happen. Also, I guess it can't be helped that the > partdesc_nodetached will have to be leaked when the xmin is 0, but > that shouldn't be as problematic as the case we discussed earlier. The only case I am aware where that can happen is if the pg_inherits tuple is frozen. (That's exactly what the affected test case was testing, note the "VACUUM FREEZE pg_inherits" there). So that test case blew up immediately; but I think the real-world chances that people are going to be doing that are pretty low, so I'm not really concerned about the leak. > Would it be a bit more readable to just duplicate this stanza in the > blocks that assign to rd_partdesc_nodetached and rd_partdesc, > respectively? That's not much code to duplicate and it'd be easier to > see which context is for which partdesc. Sure .. that's how I first wrote this code. We don't use that style much, so I'm OK with backing out of it. > + TransactionId rd_partdesc_nodetached_xmin; /* xmin for the above */ > > Could you please expand this description a bit? Done.From 8b1489a17cb5cfcd30d4dfd18020c64e606c5042 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 27 Apr 2021 19:04:37 -0400 Subject: [PATCH v5] Allow a partdesc-omitting-partitions to be cached Makes partition descriptor acquisition faster during the transient period in which a partition is in the process of being detached. (This incidentally fixes a bug in 8aba9322511 whereby a memory context holding a transient partdesc is reparented to NULL, leading to permanent leak of that memory. Reported by Amit Langote.) Per gripe from Amit Langote Discussion: https://postgr.es/m/ca+hiwqfgpp1lxjzobygt9rpvtjxxkg5qg2+xch2z1q7krqz...@mail.gmail.com --- src/backend/catalog/pg_inherits.c | 52 - src/backend/commands/tablecmds.c | 66 ++- src/backend/commands/trigger.c| 3 +- src/backend/partitioning/partdesc.c | 101 +++- src/backend/utils/cache/relcache.c| 33 +- src/include/catalog/pg_inherits.h | 6 +- src/include/utils/rel.h | 15 +++ .../detach-partition-concurrently-3.out | 110 +- .../detach-partition-concurrently-3.spec | 14 ++- 9 files changed, 327 insertions(+), 73 deletions(-) diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c index 6447b52854..c373faf2d6 100644 --- a/src/backend/catalog/pg_inherits.c +++ b/src/backend/catalog/pg_inherits.c @@ -52,6 +52,22 @@ typedef struct SeenRelsEntry * then no locks are acquired, but caller must beware of race conditions * against possible DROPs of child relations. * + * Partitions marked as being detached are omitted; see + * find_inheritance_children_extended for details. + */ +List * +find_inheritance_children(Oid parentrelId, LOCKMODE lockmode) +{ + return find_inheritance_children_extended(parentrelId, true, lockmode, + NULL, NULL); +} + +/* + * find_inheritance_children_extended + * + * As find_inheritance_children, with more options regarding detached + * partitions. + * * If a partition's pg_inherits row is marked "detach pending", * *detached_exist (if not null) is set true. * @@ -60,11 +76,13 @@ typedef struct SeenRelsEntry * marked "detach pending" is visible to that snapshot, then that partition is * omitted from the output list. This makes partitions invisible depending on * whether the transaction that marked those partitions as detached appears - * committed to the active snapshot. + * committed to the active snapshot. In addition, *detached_xmin (if not null) + * is set to the xmin of the row of the detached partition. */ List * -find_inheritance_children(Oid parentrelId, bool omit_detached, - LOCKMODE lockmode, bool *detached_exist) +find_inheritance_children_extended(Oid parentrelId, bool omit_detached, + LOCKMODE lockmode, bool *detached_exist, + TransactionId *detached_xmin) { List *list = NIL; Relation relation; @@ -132,7 +150,32 @@ find_inheritance_children(Oid parentrelId, bool omit_detached, snap = GetActiveSnapshot(); if (!XidInMVCCSnapshot(xmin, snap)) +{ + if (detached_xmin) + { + /* + * Two detached partitions should not occur (see + * checks in MarkInheritDetached), but if they do, + * track the newer of the two. Make sure to warn the + * user, so that they can clean up. Since this is + * just a cross-check against potentially corrupt + * catalogs, we don't
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-22 13:59:58 +1200, Thomas Munro wrote: > On Thu, Apr 22, 2021 at 1:21 PM Tom Lane wrote: > > I've also tried to reproduce on 32-bit and 64-bit Intel, without > > success. So if this is real, maybe it's related to being big-endian > > hardware? But it's also quite sensitive to $dunno-what, maybe the > > history of WAL records that have already been replayed. > > Ah, that's interesting. There are a couple of sparc64 failures and a > ppc64 failure in the build farm, but I couldn't immediately spot what > was wrong with them or whether it might be related to this stuff. > > Thanks for the clues. I'll see what unusual systems I can find to try > this on FWIW, I've run 32 and 64 bit x86 through several hundred regression cycles, without hitting an issue. For a lot of them I set checkpoint_timeout to a lower value as I thought that might make it more likely to reproduce an issue. Tom, any chance you could check if your machine repros the issue before these commits? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > Tom, any chance you could check if your machine repros the issue before > these commits? Wilco, but it'll likely take a little while to get results ... regards, tom lane
strange case of "if ((a & b))"
These look strange to me - the inner parens don't do anything. I wouldn't write it with 2x parens for the same reason I wouldn't write it with 8x parens. diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index 9f159eb3db..3bbc13c443 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -693,7 +693,7 @@ check_tuple_header(HeapCheckContext *ctx) report_corruption(ctx, psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls)", expected_hoff, ctx->tuphdr->t_hoff)); - else if ((infomask & HEAP_HASNULL)) + else if ((infomask & HEAP_HASNULL) != 0) report_corruption(ctx, psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, has nulls)", expected_hoff, ctx->tuphdr->t_hoff, ctx->natts)); diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c index 15115cb29f..0dd2838f8b 100644 --- a/contrib/ltree/ltree_io.c +++ b/contrib/ltree/ltree_io.c @@ -661,17 +661,17 @@ deparse_lquery(const lquery *in) } memcpy(ptr, curtlevel->name, curtlevel->len); ptr += curtlevel->len; - if ((curtlevel->flag & LVAR_SUBLEXEME)) + if ((curtlevel->flag & LVAR_SUBLEXEME) != 0) { *ptr = '%'; ptr++; } - if ((curtlevel->flag & LVAR_INCASE)) + if ((curtlevel->flag & LVAR_INCASE) != 0) { *ptr = '@'; ptr++; } - if ((curtlevel->flag & LVAR_ANYEND)) + if ((curtlevel->flag & LVAR_ANYEND) != 0) { *ptr = '*'; ptr++; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 13396eb7f2..f5a4db5c57 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2107,7 +2107,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, vmstatus = visibilitymap_get_status(relation, BufferGetBlockNumber(buffer), &vmbuffer); - if ((starting_with_empty_page || vmstatus & VISIBILITYMAP_ALL_FROZEN)) + if (starting_with_empty_page || + (vmstatus & VISIBILITYMAP_ALL_FROZEN) != 0) all_frozen_set = true; } diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 441445927e..28fdd2943b 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2417,7 +2417,7 @@ PrepareTransaction(void) * cases, such as a temp table created and dropped all within the * transaction. That seems to require much more bookkeeping though. */ - if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPNAMESPACE)) + if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPNAMESPACE) != 0) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot PREPARE a transaction that has operated on temporary objects"))); @@ -5530,7 +5530,7 @@ XactLogCommitRecord(TimestampTz commit_time, xl_xinfo.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE; if (forceSyncCommit) xl_xinfo.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT; - if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK)) + if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK) != 0) xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS; /* @@ -5681,7 +5681,7 @@ XactLogAbortRecord(TimestampTz abort_time, xlrec.xact_time = abort_time; - if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK)) + if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK) != 0) xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS; if (nsubxacts > 0) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8e717ada28..f341e6d143 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -16029,7 +16029,7 @@ PreCommit_on_commit_actions(void)
Re: strange case of "if ((a & b))"
Justin Pryzby writes: > These look strange to me - the inner parens don't do anything. > I wouldn't write it with 2x parens for the same reason I wouldn't write it > with > 8x parens. Agreed, but shouldn't we just drop the excess parens rather than doubling down on useless notation? regards, tom lane
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Pushed that now, with a one-line addition to the docs that only one partition can be marked detached. -- Álvaro Herrera39°49'30"S 73°17'W "That sort of implies that there are Emacs keystrokes which aren't obscure. I've been using it daily for 2 years now and have yet to discover any key sequence which makes any sense."(Paul Thomas)
Re: Better sanity checking for message length words
Aleksander Alekseev writes: > Sorry, my bad. It was about lines separating on different platforms. The > patch is fine and passes installcheck-world on MacOS. Pushed, thanks for looking at it! regards, tom lane
Re: pg_upgrade fails to detect unsupported arrays and ranges
> On 28 Apr 2021, at 17:09, Tom Lane wrote: > > [ blast-from-the-past department ] > > I wrote: >> Daniel Gustafsson writes: >>> I can see the appeal of >>> including it before the wrap, even though I personally would've held off. > >> Nah, I'm not gonna risk it at this stage. I concur with your point >> that this is an ancient bug, and one that is unlikely to bite many >> people. I'll push it Wednesday or so. > > I happened across a couple of further pg_upgrade oversights in the > same vein as 29aeda6e4 et al: Nice find, this makes a lot of sense. > ..the same OID was 12022 in v13, 11551 in v11, etc. So if you > had a column of type pg_enum, you'd likely get no-such-type-OID > failures when reading the values after an upgrade. I don't see > much use-case for doing such a thing, so it seems saner to just > block off the possibility rather than trying to support it. Agreed. Having implemented basically this for Greenplum I think it’s wise to avoid it unless we really have to, it gets very complicated once the layers of worms are peeled back. > The attached proposed patch fixes these cases too. I generalized > the recursive query a little more so that it could start from an > arbitrary query yielding pg_type OIDs, rather than just one type > name; otherwise it's pretty straightforward. > > Barring objections I'll apply and back-patch this soon. Patch LGTM on reading, +1 on applying. Being on parental leave I don’t have my dev env ready to go so I didn’t perform testing; sorry about that. > + pg_fatal("Your installation contains system-defined composite > type(s) in user tables.\n" > + "These type OIDs are not stable across > PostgreSQL versions,\n" > + "so this cluster cannot currently be upgraded. > You can\n" > + "remove the problem tables and restart the > upgrade.\n" > + "A list of the problem columns is in the > file:\n" Would it be helpful to inform the user that they can alter/drop just the problematic columns as a potentially less scary alternative to dropping the entire table? > - * The type of interest might be wrapped in a domain, array, > + * The types of interest might be wrapped in a domain, array, Shouldn't this be "type(s)” as in the other changes here? -- Daniel Gustafsson https://vmware.com/
Re: Replication slot stats misgivings
It seems that the test case added by f5fc2f5b2 is still a bit unstable, even after c64dcc7fe: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2021-04-23%2006%3A20%3A12 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2021-04-24%2018%3A20%3A10 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-04-28%2017%3A53%3A14 (The snapper run fails to show regression.diffs, so it's not certain that it's the same failure as peripatus, but ...) regards, tom lane
Re: pg_upgrade fails to detect unsupported arrays and ranges
Daniel Gustafsson writes: >> On 28 Apr 2021, at 17:09, Tom Lane wrote: >> +pg_fatal("Your installation contains system-defined composite >> type(s) in user tables.\n" >> + "These type OIDs are not stable across >> PostgreSQL versions,\n" >> + "so this cluster cannot currently be upgraded. >> You can\n" >> + "remove the problem tables and restart the >> upgrade.\n" >> + "A list of the problem columns is in the >> file:\n" > Would it be helpful to inform the user that they can alter/drop just the > problematic columns as a potentially less scary alternative to dropping the > entire table? This wording is copied-and-pasted from the other similar tests. I agree that it's advocating a solution that might be overkill, but if we change it we should also change the existing messages. I don't mind doing that in HEAD; less sure about the back branches, as (I think) these are translatable strings. Thoughts? >> - * The type of interest might be wrapped in a domain, array, >> + * The types of interest might be wrapped in a domain, array, > Shouldn't this be "type(s)” as in the other changes here? Fair enough. regards, tom lane
Re: pg_upgrade fails to detect unsupported arrays and ranges
> On 28 Apr 2021, at 22:47, Tom Lane wrote: > > Daniel Gustafsson writes: >>> On 28 Apr 2021, at 17:09, Tom Lane wrote: >>> + pg_fatal("Your installation contains system-defined composite >>> type(s) in user tables.\n" >>> +"These type OIDs are not stable across >>> PostgreSQL versions,\n" >>> +"so this cluster cannot currently be upgraded. >>> You can\n" >>> +"remove the problem tables and restart the >>> upgrade.\n" >>> +"A list of the problem columns is in the >>> file:\n" > >> Would it be helpful to inform the user that they can alter/drop just the >> problematic columns as a potentially less scary alternative to dropping the >> entire table? > > This wording is copied-and-pasted from the other similar tests. I agree > that it's advocating a solution that might be overkill, but if we change > it we should also change the existing messages. Good point. > I don't mind doing that in HEAD; less sure about the back branches, as I think it would be helpful for users to try and give slightly more expanded advice while (obviously) still always being safe. I’m happy to take a crack at that once back unless someone beats me to it. > (I think) these are translatable strings. If they aren't I think we should try and make them so to as far as we can reduce language barrier problems in such important messages. -- Daniel Gustafsson https://vmware.com/
Re: pg_upgrade fails to detect unsupported arrays and ranges
Daniel Gustafsson writes: > On 28 Apr 2021, at 22:47, Tom Lane wrote: >> This wording is copied-and-pasted from the other similar tests. I agree >> that it's advocating a solution that might be overkill, but if we change >> it we should also change the existing messages. > Good point. >> I don't mind doing that in HEAD; less sure about the back branches, as > I think it would be helpful for users to try and give slightly more expanded > advice while (obviously) still always being safe. I’m happy to take a crack > at > that once back unless someone beats me to it. Seems like s/remove the problem tables/drop the problem columns/ is easy and sufficient. >> (I think) these are translatable strings. > If they aren't I think we should try and make them so to as far as we can > reduce language barrier problems in such important messages. Checking, I see they do appear in pg_upgrade's po files. So I propose that we change the existing messages in HEAD but not the back branches. regards, tom lane
Re: SQL/JSON: functions
On Fri, Mar 26, 2021 at 9:14 PM Nikita Glukhov wrote: > Attached 54th version of the patches rebased onto current master. > > > On 27.03.2021 01:30, Andrew Dunstan wrote: > > Specifically, patch 4 (SQL-JSON-query-functions) fails with this when > built with LLVM: > > > There is also a bug that results in a warning in gram.y, but fixing it > doesn't affect this issue. Nikita, please look into this ASAP. > > LLVM issues and gram.y are fixed. > > > > It's apparently bitrotted again. See < http://cfbot.cputube.org/patch_33_2901.log> cheers andrew
gcc 11.1.0 warnings in llvmjit_expr.c
Hello, gcc 11.1.0 produces quite a litany of second thoughts when compiling llvmjit_expr.c; is there something in it? (all 'warning' or 'note') llvmjit_expr.c: In function ‘llvm_compile_expr’: llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \ | 71 | ((LLVMValueRef[]){__VA_ARGS__})) | llvmjit_expr.c:1553:33: note: in expansion of macro ‘build_EvalXFunc’ 1553 | build_EvalXFunc(b, mod, "ExecEvalSQLValueFunction", | ^~~ llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct LLVMOpaqueValue **’ 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \ | 71 | ((LLVMValueRef[]){__VA_ARGS__})) | llvmjit_expr.c:1553:33: note: in expansion of macro ‘build_EvalXFunc’ 1553 | build_EvalXFunc(b, mod, "ExecEvalSQLValueFunction", | ^~~ llvmjit_expr.c:2460:1: note: in a call to function ‘build_EvalXFuncInt’ 2460 | build_EvalXFuncInt(LLVMBuilderRef b, LLVMModuleRef mod, const char *funcname, | ^~ llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \ | 71 | ((LLVMValueRef[]){__VA_ARGS__})) | llvmjit_expr.c:1559:33: note: in expansion of macro ‘build_EvalXFunc’ 1559 | build_EvalXFunc(b, mod, "ExecEvalCurrentOfExpr", | ^~~ llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct LLVMOpaqueValue **’ 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \ | 71 | ((LLVMValueRef[]){__VA_ARGS__})) | llvmjit_expr.c:1559:33: note: in expansion of macro ‘build_EvalXFunc’ 1559 | build_EvalXFunc(b, mod, "ExecEvalCurrentOfExpr", | ^~~ llvmjit_expr.c:2460:1: note: in a call to function ‘build_EvalXFuncInt’ 2460 | build_EvalXFuncInt(LLVMBuilderRef b, LLVMModuleRef mod, const char *funcname, | ^~ llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \ | 71 | ((LLVMValueRef[]){__VA_ARGS__})) | llvmjit_expr.c:1565:33: note: in expansion of macro ‘build_EvalXFunc’ 1565 | build_EvalXFunc(b, mod, "ExecEvalNextValueExpr", | ^~~ llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct LLVMOpaqueValue **’ 69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \ | ^~~ 70 | lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 29, 2021 at 4:45 AM Tom Lane wrote: > Andres Freund writes: > > Tom, any chance you could check if your machine repros the issue before > > these commits? > > Wilco, but it'll likely take a little while to get results ... FWIW I also chewed through many megawatts trying to reproduce this on a PowerPC system in 64 bit big endian mode, with an emulator. No cigar. However, it's so slow that I didn't make it to 10 runs...
Re: WIP: WAL prefetch (another approach)
Thomas Munro writes: > FWIW I also chewed through many megawatts trying to reproduce this on > a PowerPC system in 64 bit big endian mode, with an emulator. No > cigar. However, it's so slow that I didn't make it to 10 runs... Speaking of megawatts ... my G4 has now finished about ten cycles of installcheck-parallel without a failure, which isn't really enough to draw any conclusions yet. But I happened to notice the accumulated CPU time for the background processes: USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND tgl 19048 0.0 4.4 229952 92196 ?? Ss3:19PM 19:59.19 postgres: startup recovering 000100140022 tgl 19051 0.0 0.1 229656 1696 ?? Ss3:19PM 27:09.14 postgres: walreceiver streaming 14/227D8F14 tgl 19052 0.0 0.1 229904 2516 ?? Ss3:19PM 17:38.17 postgres: walsender tgl [local] streaming 14/227D8F14 IOW, we've spent over twice as many CPU cycles shipping data to the standby as we did in applying the WAL on the standby. Is this expected? I've got wal_consistency_checking = all, which is bloating the WAL volume quite a bit, but still it seems like the walsender and walreceiver have little excuse for spending more cycles per byte than the startup process. (This is testing b3ee4c503, so if Thomas' WAL changes improved efficiency of the replay process at all, the discrepancy could be even worse in HEAD.) regards, tom lane
Re: Replication slot stats misgivings
On Thu, Apr 29, 2021 at 5:41 AM Tom Lane wrote: > > It seems that the test case added by f5fc2f5b2 is still a bit > unstable, even after c64dcc7fe: Hmm, I don't see the exact cause yet but there are two possibilities: some transactions were really spilled, and it showed the old stats due to losing the drop (and create) slot messages. For the former case, it seems to better to create the slot just before the insertion and setting logical_decoding_work_mem to the default (64MB). For the latter case, maybe we can use a different name slot than the name used in other tests? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Use simplehash.h instead of dynahash in SMgr
On Thu, 29 Apr 2021 at 00:28, Yura Sokolov wrote: > The best result is with just: > > return (uint32)rnode->node.relNode; > > ie, relNode could be taken without mixing at all. > relNode is unique inside single database, and almost unique among whole > cluster > since it is Oid. I admit to having tried that too just to almost eliminate the cost of hashing. I just didn't consider it something we'd actually do. The system catalogues are quite likely to all have the same relfilenode in all databases, so for workloads that have a large number of databases, looking up WAL records that touch the catalogues is going to be pretty terrible. The simplified hash function I wrote with just the relNode and dbNode would be the least I'd be willing to entertain. However, I just wouldn't be surprised if there was a good reason for that being bad too. > So, I've repeated benchmark with different number of partitons (I tried > to catch higher fillfactor) and less amount of inserted data (since I > don't > want to stress my SSD). > > partitions/ | dynahash | dynahash + | simplehash | simplehash + | > fillfactor | | simple func | | simple func | > +--+-+--+ > 3500/0.43 | 3.73s | 3.54s | 3.58s| 3.34s | > 3200/0.78 | 3.64s | 3.46s | 3.47s| 3.25s | > 1500/0.74 | 3.18s | 2.97s | 3.03s| 2.79s | > > Fillfactor is effective fillfactor in simplehash with than number of > partitions. > I wasn't able to measure with fillfactor close to 0.9 since looks like > simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE. Thanks for testing that. I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to see if it made a difference. The test was similar to last time, but I trimmed the number of rows inserted from 100 million down to 25 million. Last time I tested with 1000 partitions, this time with each of: 100 200 300 400 500 600 700 800 900 1000 partitions. There didn't seem to be any point of testing lower than that as the minimum hash table size is 512. The averages over 10 runs were: nparts ff75 ff90 100 21.898 22.226 200 23.105 25.493 300 25.274 24.251 400 25.139 25.611 500 25.738 25.454 600 26.656 26.82 700 27.577 27.102 800 27.608 27.546 900 27.284 28.186 100029 28.153 Or to summarise a bit, we could just look at the sum of all the results per fillfactor: sum ff75 2592.79 sum ff90 2608.42 100.6% fillfactor 75 did come out slightly faster, but only just. It seems close enough that it might be better just to keep the 90 to save a little memory and improve caching elsewhere. Also, from below, you can see that for 300, 500, 600, 700, 1000 tables tests, the hash tables ended up the same size, yet there's a bit of variability in the timing result. So the 0.6% gain might just be noise. I don't think it's worth making the fillfactor 75. drowley@amd3990x:~/recoverylogs$ grep -rH -m 1 "collisions" ff75_tb100.log:LOG: size: 1024, members: 231, filled: 0.225586, total chain: 33, max chain: 2, avg chain: 0.142857, total_collisions: 20, max_collisions: 2, avg_collisions: 0.086580 ff90_tb100.log:LOG: size: 512, members: 231, filled: 0.451172, total chain: 66, max chain: 2, avg chain: 0.285714, total_collisions: 36, max_collisions: 2, avg_collisions: 0.155844 ff75_tb200.log:LOG: size: 1024, members: 431, filled: 0.420898, total chain: 160, max chain: 4, avg chain: 0.371230, total_collisions: 81, max_collisions: 3, avg_collisions: 0.187935 ff90_tb200.log:LOG: size: 512, members: 431, filled: 0.841797, total chain: 942, max chain: 9, avg chain: 2.185615, total_collisions: 134, max_collisions: 3, avg_collisions: 0.310905 ff90_tb300.log:LOG: size: 1024, members: 631, filled: 0.616211, total chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158, max_collisions: 4, avg_collisions: 0.250396 ff75_tb300.log:LOG: size: 1024, members: 631, filled: 0.616211, total chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158, max_collisions: 4, avg_collisions: 0.250396 ff75_tb400.log:LOG: size: 2048, members: 831, filled: 0.405762, total chain: 341, max chain: 4, avg chain: 0.410349, total_collisions: 162, max_collisions: 3, avg_collisions: 0.194946 ff90_tb400.log:LOG: size: 1024, members: 831, filled: 0.811523, total chain: 1747, max chain: 15, avg chain: 2.102286, total_collisions: 269, max_collisions: 3, avg_collisions: 0.323706 ff75_tb500.log:LOG: size: 2048, members: 1031, filled: 0.503418, total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions: 219, max_collisions: 4, avg_collisions: 0.212415 ff90_tb500.log:LOG: size: 2048, members: 1031, filled: 0.503418, total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions: 219, max_collisions: 4, avg_collisions: 0.212415 ff75_tb600.log:LOG: size: 2048, members: 1231, filled: 0.601074, total chain: 9
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 19:24:53 -0400, Tom Lane wrote: > But I happened to notice the accumulated CPU time for the background > processes: > > USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND > tgl 19048 0.0 4.4 229952 92196 ?? Ss3:19PM 19:59.19 > postgres: startup recovering 000100140022 > tgl 19051 0.0 0.1 229656 1696 ?? Ss3:19PM 27:09.14 > postgres: walreceiver streaming 14/227D8F14 > tgl 19052 0.0 0.1 229904 2516 ?? Ss3:19PM 17:38.17 > postgres: walsender tgl [local] streaming 14/227D8F14 > > IOW, we've spent over twice as many CPU cycles shipping data to the > standby as we did in applying the WAL on the standby. Is this > expected? I've got wal_consistency_checking = all, which is bloating > the WAL volume quite a bit, but still it seems like the walsender and > walreceiver have little excuse for spending more cycles per byte > than the startup process. I don't really know how the time calculation works on mac. Is there a chance it includes time spent doing IO? On the primary the WAL IO is done by a lot of backends, but on the standby it's all going to be the walreceiver. And the walreceiver does fsyncs in a not particularly efficient manner. FWIW, on my linux workstation no such difference is visible: USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND andres 2910540 9.4 0.0 2237252 126680 ? Ss 16:55 0:20 postgres: dev assert standby: startup recovering 00010002003F andres 2910544 5.2 0.0 2236724 9260 ?Ss 16:55 0:11 postgres: dev assert standby: walreceiver streaming 2/3FDCF118 andres 2910545 2.1 0.0 2237036 10672 ? Ss 16:55 0:04 postgres: dev assert: walsender andres [local] streaming 2/3FDCF118 > (This is testing b3ee4c503, so if Thomas' WAL changes improved > efficiency of the replay process at all, the discrepancy could be > even worse in HEAD.) The prefetching isn't enabled by default, so I'd not expect meaningful differences... And even with the prefetching enabled, our normal regression tests largely are resident in s_b, so there shouldn't be much prefetching. Oh! I was about to ask how much shared buffers your primary / standby have. And I think I may actually have reproduce a variant of the issue! I previously had played around with different settings that I thought might increase the likelihood of reproducing the problem. But this time I set shared_buffers lower than before, and got: 2021-04-28 17:03:22.174 PDT [2913840][] LOG: database system was shut down in recovery at 2021-04-28 17:03:11 PDT 2021-04-28 17:03:22.174 PDT [2913840][] LOG: entering standby mode 2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG: redo starts at 2/416C6278 2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG: consistent recovery state reached at 4/7F5C3200 2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL: invalid memory alloc request size 3053455757 2021-04-28 17:03:37.628 PDT [2913839][] LOG: database system is ready to accept read only connections 2021-04-28 17:03:37.636 PDT [2913839][] LOG: startup process (PID 2913840) exited with exit code 1 This reproduces across restarts. Yay, I guess. Isn't it off that we get a "database system is ready to accept read only connections"? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > On 2021-04-28 19:24:53 -0400, Tom Lane wrote: >> IOW, we've spent over twice as many CPU cycles shipping data to the >> standby as we did in applying the WAL on the standby. > I don't really know how the time calculation works on mac. Is there a > chance it includes time spent doing IO? I'd be pretty astonished if it did. This is basically a NetBSD system remember (in fact, this ancient macOS release is a good deal closer to those roots than modern versions). BSDen have never accounted for time that way AFAIK. Also, the "ps" man page says specifically that that column is CPU time. > Oh! I was about to ask how much shared buffers your primary / standby > have. And I think I may actually have reproduce a variant of the issue! Default configurations, so 128MB each. regards, tom lane
Re: Use simplehash.h instead of dynahash in SMgr
David Rowley писал 2021-04-29 02:51: On Thu, 29 Apr 2021 at 00:28, Yura Sokolov wrote: The best result is with just: return (uint32)rnode->node.relNode; ie, relNode could be taken without mixing at all. relNode is unique inside single database, and almost unique among whole cluster since it is Oid. I admit to having tried that too just to almost eliminate the cost of hashing. I just didn't consider it something we'd actually do. The system catalogues are quite likely to all have the same relfilenode in all databases, so for workloads that have a large number of databases, looking up WAL records that touch the catalogues is going to be pretty terrible. Single workload that could touch system catalogues in different databases is recovery (and autovacuum?). Client backends couldn't be connected to more than one database. But netherless, you're right. I oversimplified it intentionally. I wrote originally: hashcode = (uint32)rnode->node.dbNode * 0xc2b2ae35; hashcode ^= (uint32)rnode->node.relNode; return hashcode; But before sending mail I'd cut dbNode part. Still, main point: relNode could be put unmixed into final value. This way less collisions happen. The simplified hash function I wrote with just the relNode and dbNode would be the least I'd be willing to entertain. However, I just wouldn't be surprised if there was a good reason for that being bad too. So, I've repeated benchmark with different number of partitons (I tried to catch higher fillfactor) and less amount of inserted data (since I don't want to stress my SSD). partitions/ | dynahash | dynahash + | simplehash | simplehash + | fillfactor | | simple func | | simple func | +--+-+--+ 3500/0.43 | 3.73s | 3.54s | 3.58s| 3.34s | 3200/0.78 | 3.64s | 3.46s | 3.47s| 3.25s | 1500/0.74 | 3.18s | 2.97s | 3.03s| 2.79s | Fillfactor is effective fillfactor in simplehash with than number of partitions. I wasn't able to measure with fillfactor close to 0.9 since looks like simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE. Thanks for testing that. I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to see if it made a difference. The test was similar to last time, but I trimmed the number of rows inserted from 100 million down to 25 million. Last time I tested with 1000 partitions, this time with each of: 100 200 300 400 500 600 700 800 900 1000 partitions. There didn't seem to be any point of testing lower than that as the minimum hash table size is 512. The averages over 10 runs were: nparts ff75 ff90 100 21.898 22.226 200 23.105 25.493 300 25.274 24.251 400 25.139 25.611 500 25.738 25.454 600 26.656 26.82 700 27.577 27.102 800 27.608 27.546 900 27.284 28.186 100029 28.153 Or to summarise a bit, we could just look at the sum of all the results per fillfactor: sum ff75 2592.79 sum ff90 2608.42 100.6% fillfactor 75 did come out slightly faster, but only just. It seems close enough that it might be better just to keep the 90 to save a little memory and improve caching elsewhere. Also, from below, you can see that for 300, 500, 600, 700, 1000 tables tests, the hash tables ended up the same size, yet there's a bit of variability in the timing result. So the 0.6% gain might just be noise. I don't think it's worth making the fillfactor 75. To be clear: I didn't change SH_FILLFACTOR. It were equal to 0.9 . I just were not able to catch table with fill factor more than 0.78. Looks like you've got it with 900 partitions :-) Another line of thought for making it go faster would be to do something like get rid of the hash status field from SMgrEntry. That could be either coded into a single bit we'd borrow from the hash value, or it could be coded into the least significant bit of the data field. A pointer to palloc'd memory should always be MAXALIGNed, which means at least the lower two bits are always zero. We'd just need to make sure and do something like "data & ~((uintptr_t) 3)" to trim off the hash status bits before dereferencing the pointer. That would make the SMgrEntry 12 bytes on a 64-bit machine. However, it would also mean that some entries would span 2 cache lines, which might affect performance a bit. Then data pointer will be itself unaligned to 8 bytes. While x86 is mostly indifferent to this, doubtfully this memory economy will pay off. regards, Yura Sokolov.
Patch to allow pg_filedump to support reading of pg_filenode.map
Hello Hackers, I recently had to work on a case where some catalog files were corrupt and/or missing. One of the things we sought to inspect was pg_filenode.map, but there was no tooling available to do so. With the help of Álvaro H.. I've put together a patch to allow pg_filedump to do some rudimentary decoding of pg_filenode.map, so that it's at least human-readable. I had the idea to try to map the OIDs to relnames, but that might get hairy, with TOAST table OIDs possibly changing (for pg_proc, etc.) It seems that Christoph Berg is the primary committer for the pg_filedump project these days so he is cc'd here, but if there's some other place I should be sending this kind of contribution to, please let me know. Hopefully this will be helpful to others in the future. Much appreciated, --Richard add_filenode_support.patch Description: Binary data
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 20:24:43 -0400, Tom Lane wrote: > Andres Freund writes: > > Oh! I was about to ask how much shared buffers your primary / standby > > have. > Default configurations, so 128MB each. I thought that possibly initdb would detect less or something... I assume this is 32bit? I did notice that a 32bit test took a lot longer than a 64bit test. But didn't investigate so far. > And I think I may actually have reproduce a variant of the issue! Unfortunately I had not set up things in a way that the primary retains the WAL, making it harder to compare whether it's the WAL that got corrupted or whether it's a decoding bug. I can however say that pg_waldump on the standby's pg_wal does also fail. The failure as part of the backend is "invalid memory alloc request size", whereas in pg_waldump I get the much more helpful: pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF00 at 4/7F5C3200 In frontend code that allocation actually succeeds, because there is no size check. But in backend code we run into the size check, and thus don't even display a useful error. In 13 the header is validated before allocating space for the record(except if header is spread across pages) - it seems inadvisable to turn that around? Greetings, Andres Freund
RE: [bug?] Missed parallel safety checks, and wrong parallel safety
> >>> I'm curious. The FmgrBuiltin struct includes the "strict" flag, so > >>> that would "lock down the value" of the strict flag, wouldn't it? > > >> It does, but that's much more directly a property of the function's C > >> code than parallel-safety is. > > > I'm not sure I agree with that, but I think having the "strict" flag > > in FmgrBuiltin isn't that nice either. > > Yeah, if we could readily do without it, we probably would. But the function > call mechanism itself is responsible for implementing strictness, so it *has* > to > have that flag available. So, If we do not want to lock down the parallel safety of built-in functions. It seems we can try to fetch the proparallel from pg_proc for built-in function in fmgr_info_cxt_security too. To avoid recursive safety check when fetching proparallel from pg_proc, we can add a Global variable to mark is it in a recursive state. And we skip safety check in a recursive state, In this approach, parallel safety will not be locked, and there are no new members in FmgrBuiltin. Attaching the patch about this approach [0001-approach-1]. Thoughts ? I also attached another approach patch [0001-approach-2] about adding parallel safety in FmgrBuiltin, because this approach seems faster and we can combine some bool member into a bitflag to avoid enlarging the FmgrBuiltin array, though this approach will lock down the parallel safety of built-in function. Best regards, houzj 0002-fix-testcase-with-wrong-parallel-safety-flag.patch Description: 0002-fix-testcase-with-wrong-parallel-safety-flag.patch 0001-approach-1-check-parallel-safety-in-fmgr_info_cxt_se.patch Description: 0001-approach-1-check-parallel-safety-in-fmgr_info_cxt_se.patch 0001-approach-2-check-parallel-safety-in-fmgr_info_cxt_se.patch Description: 0001-approach-2-check-parallel-safety-in-fmgr_info_cxt_se.patch
Re: [HACKERS] logical decoding of two-phase transactions
Modified pgbench's "tpcb-like" builtin query as below to do two-phase commits and then ran a 4 cascade replication setup. "BEGIN;\n" "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" "PREPARE TRANSACTION ':aid:';\n" "COMMIT PREPARED ':aid:';\n" The tests ran fine and all 4 cascaded servers replicated the changes correctly. All the subscriptions were configured with two_phase enabled. regards, Ajin Cherian Fujitsu Australia
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 17:59:22 -0700, Andres Freund wrote: > I can however say that pg_waldump on the standby's pg_wal does also > fail. The failure as part of the backend is "invalid memory alloc > request size", whereas in pg_waldump I get the much more helpful: > pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect > prev-link 416200FF/FF00 at 4/7F5C3200 There's definitely something broken around continuation records, in XLogFindNextRecord(). Which means that it's not the cause for the server side issue, but obviously still not good. The conversion of XLogFindNextRecord() to be state machine based basically only works in a narrow set of circumstances. Whenever the end of the first record read is on a different page than the start of the record, we'll endlessly loop. We'll go into XLogFindNextRecord(), and return until we've successfully read the page header. Then we'll enter the second loop. Which will try to read until the end of the first record. But after returning the first loop will again ask for page header. Even if that's fixed, the second loop alone has the same problem: As XLogBeginRead() is called unconditionally we'll start read the start of the record, discover that it needs data on a second page, return, and do the same thing again. I think it needs something roughly like the attached. Greetings, Andres Freund diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h index 3b8af31a8fe..82a80cf2bf5 100644 --- i/src/include/access/xlogreader.h +++ w/src/include/access/xlogreader.h @@ -297,6 +297,7 @@ struct XLogFindNextRecordState XLogReaderState *reader_state; XLogRecPtr targetRecPtr; XLogRecPtr currRecPtr; + bool found_start; }; /* Report that data is available for decoding. */ diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c index 4277e92d7c9..935c841347f 100644 --- i/src/backend/access/transam/xlogreader.c +++ w/src/backend/access/transam/xlogreader.c @@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized) /* validate record header if not yet */ if (!state->record_verified && record_len >= SizeOfXLogRecord) { -if (!ValidXLogRecordHeader(state, state->DecodeRecPtr, + if (!ValidXLogRecordHeader(state, state->DecodeRecPtr, state->PrevRecPtr, prec)) goto err; @@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr) state->reader_state = reader_state; state->targetRecPtr = start_ptr; state->currRecPtr = start_ptr; + state->found_start = false; return state; } @@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state) * skip over potential continuation data, keeping in mind that it may span * multiple pages */ - while (true) + while (!state->found_start) { XLogRecPtr targetPagePtr; int targetRecOff; @@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state) * because either we're at the first record after the beginning of a page * or we just jumped over the remaining data of a continuation. */ - XLogBeginRead(state->reader_state, state->currRecPtr); + if (!state->found_start) + { + XLogBeginRead(state->reader_state, state->currRecPtr); + state->found_start = true; + } + while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) != XLREAD_FAIL) {
Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress
On Wed, Apr 28, 2021 at 12:44:45PM +0300, Aleksander Alekseev wrote: > I just wanted to let you know that TimescaleDB uses > pg_isolation_regress and occasionally there are reports from some > suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it > makes investing the time into backpatching worth it. However if > someone could do it, it would be nice. FWIW, I am not really sure that this is important enough to justify a back-patch, even it is true that there have been cases in the past where extra binaries got added in minor releases. -- Michael signature.asc Description: PGP signature
[BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication
Hi I met an assertion failure at the publisher in lazy_scan_heap() when synchronous running logical replication. Could someone please take a look at it? Here's what I did to produce the problem. First, use './configure --enable-cassert' to build the PG. Then, I created multiple publications at publisher and multiple subscriptions at subscriber. Then, set the value of synchronous_standby_names and reload, make them in synchronous commit mode. After that, an assertion failed at publisher when I COMMIT and ROLLBACK transactions concurrently: >TRAP: FailedAssertion("!all_visible_according_to_vm || >prunestate.all_visible", File: "vacuumlazy.c", Line: 1347, PID: 1274675) BTW, in asynchronous mode, the same problem can also happen but in a low frequency.(I tried many times, but the problem happened only 2 times) As for synchronous mode, I found it seems easier to reproduce the problem with setting "autovacuum_naptime = 1". But it still can't be 100% to reproduced it. (I tested it 5 times, 3 of them reproduced it.) The script and the log are attached. It took about 6min to run it(without problem) on my machine and it could be less than 6min if the server crashed. Regards Tang <> pub.log Description: pub.log
Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress
Michael Paquier writes: > On Wed, Apr 28, 2021 at 12:44:45PM +0300, Aleksander Alekseev wrote: >> I just wanted to let you know that TimescaleDB uses >> pg_isolation_regress and occasionally there are reports from some >> suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it >> makes investing the time into backpatching worth it. However if >> someone could do it, it would be nice. > FWIW, I am not really sure that this is important enough to justify a > back-patch, even it is true that there have been cases in the past > where extra binaries got added in minor releases. Yeah, I think adding a binary in a minor release is a Big Deal to packagers. I doubt that the case here justifies that. regards, tom lane
Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication
On Thu, Apr 29, 2021 at 02:34:21AM +, tanghy.f...@fujitsu.com wrote: > I met an assertion failure at the publisher in lazy_scan_heap() when > synchronous running logical replication. Could someone please take a > look at it? This assertion is new as of 7136bf34. Peter? -- Michael signature.asc Description: PGP signature
Re: Error in libpq docs for target_session_attrs, prefer-standby mode
On Wed, Apr 28, 2021 at 01:32:05PM +0200, Laurenz Albe wrote: > You are right, and the patch is good. Thanks, fixed. -- Michael signature.asc Description: PGP signature
Re: Replication slot stats misgivings
On Thu, Apr 29, 2021 at 4:58 AM Masahiko Sawada wrote: > > On Thu, Apr 29, 2021 at 5:41 AM Tom Lane wrote: > > > > It seems that the test case added by f5fc2f5b2 is still a bit > > unstable, even after c64dcc7fe: > > Hmm, I don't see the exact cause yet but there are two possibilities: > some transactions were really spilled, > This is the first test and inserts just one small record, so how it can lead to spill of data. Do you mean to say that may be some background process has written some transaction which leads to a spill of data? > and it showed the old stats due > to losing the drop (and create) slot messages. > Yeah, something like this could happen. Another possibility here could be that before the stats collector has processed drop and create messages, we have enquired about the stats which lead to it giving us the old stats. Note, that we don't wait for 'drop' or 'create' message to be delivered. So, there is a possibility of the same. What do you think? > For the former case, it > seems to better to create the slot just before the insertion and > setting logical_decoding_work_mem to the default (64MB). For the > latter case, maybe we can use a different name slot than the name used > in other tests? > How about doing both of the above suggestions? Alternatively, we can wait for both 'drop' and 'create' message to be delivered but that might be overkill. -- With Regards, Amit Kapila.
Re: WIP: WAL prefetch (another approach)
Hi, On 2021-04-28 17:59:22 -0700, Andres Freund wrote: > I can however say that pg_waldump on the standby's pg_wal does also > fail. The failure as part of the backend is "invalid memory alloc > request size", whereas in pg_waldump I get the much more helpful: > pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect > prev-link 416200FF/FF00 at 4/7F5C3200 > > In frontend code that allocation actually succeeds, because there is no > size check. But in backend code we run into the size check, and thus > don't even display a useful error. > > In 13 the header is validated before allocating space for the > record(except if header is spread across pages) - it seems inadvisable > to turn that around? I was now able to reproduce the problem again, and I'm afraid that the bug I hit is likely separate from Tom's. The allocation thing above is the issue in my case: The walsender connection ended (I restarted the primary), thus the startup switches to replaying locally. For some reason the end of the WAL contains non-zero data (I think it's because walreceiver doesn't zero out pages - that's bad!). Because the allocation happen before the header is validated, we reproducably end up in the mcxt.c ERROR path, failing recovery. To me it looks like a smaller version of the problem is present in < 14, albeit only when the page header is at a record boundary. In that case we don't validate the page header immediately, only once it's completely read. But we do believe the total size, and try to allocate that. There's a really crufty escape hatch (from 70b4f82a4b) to that: /* * Note that in much unlucky circumstances, the random data read from a * recycled segment can cause this routine to be called with a size * causing a hard failure at allocation. For a standby, this would cause * the instance to stop suddenly with a hard failure, preventing it to * retry fetching WAL from one of its sources which could allow it to move * on with replay without a manual restart. If the data comes from a past * recycled segment and is still valid, then the allocation may succeed * but record checks are going to fail so this would be short-lived. If * the allocation fails because of a memory shortage, then this is not a * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM. */ if (!AllocSizeIsValid(newSize)) return false; but it looks to me like that's pretty much the wrong fix, at least in the case where we've not yet validated the rest of the header. We don't need to allocate all that data before we've read the rest of the *fixed-size* header. It also seems to me that 70b4f82a4b should also have changed walsender to pad out the received data to an 8KB boundary? Greetings, Andres Freund
Re: Replication slot stats misgivings
Amit Kapila writes: > This is the first test and inserts just one small record, so how it > can lead to spill of data. Do you mean to say that may be some > background process has written some transaction which leads to a spill > of data? autovacuum, say? > Yeah, something like this could happen. Another possibility here could > be that before the stats collector has processed drop and create > messages, we have enquired about the stats which lead to it giving us > the old stats. Note, that we don't wait for 'drop' or 'create' message > to be delivered. So, there is a possibility of the same. What do you > think? You should take a close look at the stats test in the main regression tests. We had to jump through *high* hoops to get that to be stable, and yet it still fails semi-regularly. This looks like pretty much the same thing, and so I'm pessimistically inclined to guess that it will never be entirely stable. (At least not before the fabled stats collector rewrite, which may well introduce some entirely new set of failure modes.) Do we really need this test in this form? Perhaps it could be converted to a TAP test that's a bit more forgiving. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Andres Freund writes: > I was now able to reproduce the problem again, and I'm afraid that the > bug I hit is likely separate from Tom's. Yeah, I think so --- the symptoms seem quite distinct. My score so far today on the G4 is: 12 error-free regression test cycles on b3ee4c503 (plus one more with shared_buffers set to 16MB, on the strength of your previous hunch --- didn't fail for me though) HEAD failed on the second run with the same symptom as before: 2021-04-28 22:57:17.048 EDT [50479] FATAL: inconsistent page found, rel 1663/58183/69545, forknum 0, blkno 696 2021-04-28 22:57:17.048 EDT [50479] CONTEXT: WAL redo at 4/B72D408 for Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW This seems to me to be pretty strong evidence that I'm seeing *something* real. I'm currently trying to isolate a specific commit to pin it on. A straight "git bisect" isn't going to work because so many people had broken so many different things right around that date :-(, so it may take awhile to get a good answer. regards, tom lane
Re: Replication slot stats misgivings
On 2021-04-28 23:20:00 -0400, Tom Lane wrote: > (At least not before the fabled stats collector rewrite, which may well > introduce some entirely new set of failure modes.) FWIW, I added a function that forces a flush there. That can be done synchronously and the underlying functionality needs to exist anyway to deal with backend exit. Makes it a *lot* easier to write tests for stats related things...
Re: Small issues with CREATE TABLE COMPRESSION
On Tue, Apr 27, 2021 at 03:22:25PM +0900, Michael Paquier wrote: > Hi all, > > I have been looking at and testing the patch set for CREATE TABLE > COMPRESSION, and spotted a couple of things in parallel of some work > done by Jacob (added in CC). > > The behavior around CREATE TABLE AS and matviews is a bit confusing, > and not documented. First, at the grammar level, it is not possible > to specify which compression option is used per column when creating > the relation. So, all the relation columns would just set a column's > compression to be default_toast_compression for all the toastable > columns of the relation. That's not enforceable at column level when > the relation is created, except with a follow-up ALTER TABLE. That's > similar to STORAGE when it comes to matviews, but these are at least > documented. > > And so, ALTER MATERIALIZED VIEW supports SET COMPRESSION but this is > not mentioned in its docs: > https://www.postgresql.org/docs/devel/sql-altermaterializedview.html > > psql could have tab completion support for that. Actually ALTER matview ALTER col has no tab completion at all, right ? > Now, we don't really document any of that, and the per-column > compression value would be set to default_toast_compression while the > stored values may use a mix of the compression methods, depending on > where the toasted values come from. If this behavior is intended, this > makes me wonder in what the possibility to set the compression for a > materialized view column is useful for except for a logical > dump/restore? As of HEAD we'd just insert the toasted value from the > origin as-is so the compression of the column has no effect at all. That may be true if the mat view is trivial, but not true if it has expressions. The mat view column may be built on multiple table columns, or be of a different type than the columns it's built on top of, so the relationship may not be so direct. > Another thing here is the inconsistency that this brings with pg_dump. > For example, as the dumped values are decompressed, we could have > values compressed with pglz at the origin, with a column using lz4 > within its definition that would make everything compressed with lz4 > once the values are restored. This choice may be fine, but I think > that it would be good to document all that. That would be less > surprising to the user. Can you suggest what or where we'd say it? I think this is just the behavior that pg_dump shouldn't lose the user's compression setting. The setting itself is for "future" data, and the only way to guarantee what compression types are in use are by vacuum full/cluster or pg_dump restore. > Similarly, we may want to document that COMPRESSION does not trigger a > table rewrite, but that it is effective only for the new toast values > inserted if a tuple is rebuilt and rewritten? Good point. I started with this. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 39927be41e..8cceea41d0 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -391,7 +391,21 @@ WITH ( MODULUS numeric_literal, REM - This sets the compression method for a column. The supported compression + This sets the compression method to be used for data inserted into a column. + + This does not cause the table to be rewritten, so existing data may still + be compressed with other compression methods. If the table is rewritten with + VACUUM FULL or CLUSTER, or restored + with pg_restore, then all tuples are rewritten + with the configured compression methods. + + Also, note that when data is inserted from another relation (for example, + by INSERT ... SELECT), tuples from the source data are + not necessarily detoasted, and any previously compressed data is retained + with its existing compression method, rather than recompressing with the + compression methods of the target columns. + + The supported compression methods are pglz and lz4. lz4 is available only if --with-lz4 was used when building PostgreSQL.
Re: Use simplehash.h instead of dynahash in SMgr
On Thu, 29 Apr 2021 at 12:30, Yura Sokolov wrote: > > David Rowley писал 2021-04-29 02:51: > > Another line of thought for making it go faster would be to do > > something like get rid of the hash status field from SMgrEntry. That > > could be either coded into a single bit we'd borrow from the hash > > value, or it could be coded into the least significant bit of the data > > field. A pointer to palloc'd memory should always be MAXALIGNed, > > which means at least the lower two bits are always zero. We'd just > > need to make sure and do something like "data & ~((uintptr_t) 3)" to > > trim off the hash status bits before dereferencing the pointer. That > > would make the SMgrEntry 12 bytes on a 64-bit machine. However, it > > would also mean that some entries would span 2 cache lines, which > > might affect performance a bit. > > Then data pointer will be itself unaligned to 8 bytes. While x86 is > mostly indifferent to this, doubtfully this memory economy will pay > off. Actually, I didn't think very hard about that. The struct would still be 16 bytes and just have padding so the data pointer was aligned to 8 bytes (assuming a 64-bit machine). David
Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication
On Wed, Apr 28, 2021 at 7:34 PM tanghy.f...@fujitsu.com wrote: > >TRAP: FailedAssertion("!all_visible_according_to_vm || > >prunestate.all_visible", File: "vacuumlazy.c", Line: 1347, PID: 1274675) > > BTW, in asynchronous mode, the same problem can also happen but in a low > frequency.(I tried many times, but the problem happened only 2 times) > As for synchronous mode, I found it seems easier to reproduce the problem > with setting "autovacuum_naptime = 1". > But it still can't be 100% to reproduced it. (I tested it 5 times, 3 of them > reproduced it.) Is setting all_visible_according_to_vm false as below enough to avoid the assertion failure? diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index c3fc12d76c..76c17e063e 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -1146,6 +1146,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) { ReleaseBuffer(vmbuffer); vmbuffer = InvalidBuffer; +all_visible_according_to_vm = false; } /* Remove the collected garbage tuples from table and indexes */ -- Peter Geoghegan
Re: [PATCH] force_parallel_mode and GUC categories
On Fri, Apr 09, 2021 at 10:50:53AM +0900, Michael Paquier wrote: > - {"track_commit_timestamp", PGC_POSTMASTER, REPLICATION, > + {"track_commit_timestamp", PGC_POSTMASTER, REPLICATION_SENDING, > I can get behind this change for clarity where it gets actively used. I'm not sure what you meant? ...but, I realized just now that *zero* other GUCs use "REPLICATION". And the documentation puts it in 20.6.1. Sending Servers, so it still seems to me that this is correct to move this, too. https://www.postgresql.org/docs/devel/runtime-config-replication.html Then, I wonder if REPLICATION should be removed from guc_tables.h... -- Justin
Re: Replication slot stats misgivings
On Thu, Apr 29, 2021 at 8:50 AM Tom Lane wrote: > > Amit Kapila writes: > > This is the first test and inserts just one small record, so how it > > can lead to spill of data. Do you mean to say that may be some > > background process has written some transaction which leads to a spill > > of data? > > autovacuum, say? > > > Yeah, something like this could happen. Another possibility here could > > be that before the stats collector has processed drop and create > > messages, we have enquired about the stats which lead to it giving us > > the old stats. Note, that we don't wait for 'drop' or 'create' message > > to be delivered. So, there is a possibility of the same. What do you > > think? > > You should take a close look at the stats test in the main regression > tests. We had to jump through *high* hoops to get that to be stable, > and yet it still fails semi-regularly. This looks like pretty much the > same thing, and so I'm pessimistically inclined to guess that it will > never be entirely stable. > True, it is possible that we can't make it entirely stable but I would like to try some more before giving up on this. Otherwise, I guess the other possibility is to remove some of the latest tests added or probably change them to be more forgiving. For example, we can change the currently failing test to not check 'spill*' count and rely on just 'total*' count which will work even in scenarios we discussed for this failure but it will reduce the efficiency/completeness of the test case. > (At least not before the fabled stats collector rewrite, which may well > introduce some entirely new set of failure modes.) > > Do we really need this test in this form? Perhaps it could be converted > to a TAP test that's a bit more forgiving. > We have a TAP test for slot stats but there we are checking some scenarios across the restart. We can surely move these tests also there but it is not apparent to me how it can create a difference? -- With Regards, Amit Kapila.
Re: WIP: WAL prefetch (another approach)
On Thu, Apr 29, 2021 at 3:14 PM Andres Freund wrote: > To me it looks like a smaller version of the problem is present in < 14, > albeit only when the page header is at a record boundary. In that case > we don't validate the page header immediately, only once it's completely > read. But we do believe the total size, and try to allocate > that. > > There's a really crufty escape hatch (from 70b4f82a4b) to that: Right, I made that problem worse, and that could probably be changed to be no worse than 13 by reordering those operations. PS Sorry for my intermittent/slow responses on this thread this week, as I'm mostly away from the keyboard due to personal commitments. I'll be back in the saddle next week to tidy this up, most likely by reverting. The main thought I've been having about this whole area is that, aside from the lack of general testing of recovery, which we should definitely address[1], what it really needs is a decent test harness to drive it through all interesting scenarios and states at a lower level, independently. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com
Re: Patch to allow pg_filedump to support reading of pg_filenode.map
This is separate from the postgresql server repo. https://git.postgresql.org/gitweb/?p=pg_filedump.git +#define RELMAPPER_FILEMAGIC 0x592717 +char magic_buffer[8]; ... + if ( (int) magic_buffer & RELMAPPER_FILEMAGIC ) { This is doing bitwise arithmetic on a pointer, which seems badly wrong. I think it breaks normal use of pg_filedump - unless you happen to get a magic_buffer without those bits set. The segfault seems to confirm that, as does gcc: pg_filedump.c:2041:8: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 2041 | if ( (int) magic_buffer & RELMAPPER_FILEMAGIC ) { I think it probably means to do memcmp, instead ?? -- Justin
Re: pg_hba.conf.sample wording improvement
On 28.04.21 16:20, Tom Lane wrote: Another thought is to switch the phrase order: +# - "local" is a Unix-domain socket +# - "host" is a TCP/IP socket (encrypted or not) +# - "hostssl" is a TCP/IP socket that is SSL-encrypted +# - "hostnossl" is a TCP/IP socket that is not SSL-encrypted +# - "hostgssenc" is a TCP/IP socket that is GSSAPI-encrypted +# - "hostnogssenc" is a TCP/IP socket that is not GSSAPI-encrypted I'm not wedded to that idea, but it seems to help reduce random variations between the wordings of these lines. done that way
Re: Replication slot stats misgivings
On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada wrote: > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila wrote: > > > > > I am not sure if any of these alternatives are a good idea. What do > > you think? Do you have any other ideas for this? > > I've been considering some ideas but don't come up with a good one > yet. It’s just an idea and not tested but how about having > CreateDecodingContext() register before_shmem_exit() callback with the > decoding context to ensure that we send slot stats even on > interruption. And FreeDecodingContext() cancels the callback. > Is it a good idea to send stats while exiting and rely on the same? I think before_shmem_exit is mostly used for the cleanup purpose so not sure if we can rely on it for this purpose. I think we can't be sure that in all cases we will send all stats, so maybe Vignesh's patch is sufficient to cover the cases where we avoid losing it in cases where we would have sent a large amount of data. -- With Regards, Amit Kapila.
Re: pg_hba.conf.sample wording improvement
On 28.04.21 16:09, Alvaro Herrera wrote: Looking at it now, I wonder how well do the "hostno" options work. If I say "hostnogssenc", is an SSL-encrypted socket good? If I say "hostnossl", is a GSS-encrypted socket good? If so, how does that make sense? I think for example if you want to enforce SSL connections, then writing "hostnossl ... reject" would be sensible. That would also reject GSS-encrypted connections, but that would be what you want in that scenario.
Re: Unresolved repliaction hang and stop problem.
On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera wrote: > > On 2021-Apr-28, Amit Kapila wrote: > > > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera > > wrote: > > > > Hmm ... On what does it depend (other than plain git conflicts, which > > > are aplenty)? On a quick look to the commit, it's clear that we need to > > > be careful in order not to cause an ABI break, but that doesn't seem > > > impossible to solve, but I'm wondering if there is more to it than that. > > > > As mentioned in the commit message, we need another commit [1] change > > to make this work. > > > > [1] - > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 > > Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL > message type, but we have added such on a minor release before.) > Yeah, we can try to make it possible if it is really a pressing issue but I guess even in that case it is better to do it after we release PG14 so that it can get some more testing. > ... It's strange that replication worked for them on pg10 though and > broke on 13. What did we change anything to make it so? > No idea but probably if the other person can share the exact test case which he sees working fine on PG10 but not on PG13 then it might be a bit easier to investigate. -- With Regards, Amit Kapila.
Re: Forget close an open relation in ReorderBufferProcessTXN()
On Wed, Apr 28, 2021 at 5:36 PM osumi.takami...@fujitsu.com wrote: > > On Monday, April 26, 2021 2:05 PM Amit Kapila wrote: > > On Fri, Apr 23, 2021 at 8:03 PM osumi.takami...@fujitsu.com > > wrote: > > I think we are allowed to decode the operations on user catalog tables > > because we are using RelationIsLogicallyLogged() instead of > > RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN(). > > Based on this discussion, I think we should not be allowing decoding of > > operations on user catalog tables, so we should use > > RelationIsAccessibleInLogicalDecoding to skip such ops in > > ReorderBufferProcessTXN(). Am, I missing something? > > > > Can you please clarify? > I don't understand that point, either. > > I read the context where the user_catalog_table was introduced - [1]. > There, I couldn't find any discussion if we should skip decode operations > on that kind of tables or not. Accordingly, we just did not conclude it, I > suppose. > > What surprised me a bit is to decode operations of system catalog table are > considered like [2] > somehow at the time. I cannot find any concrete description of such use cases > in the thread, though. > > Anyway, I felt disallowing decoding of operations on user catalog tables > doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ? > I am not so sure about it because I think we don't have any example of user_catalog_tables in the core code. This is the reason I was kind of looking towards Andres to clarify this. Right now, if the user performs TRUNCATE on user_catalog_table in synchronous mode then it will hang in case the decoding plugin takes even share lock on it. The main reason is that we allow decoding of TRUNCATE operation for user_catalog_tables. I think even if we want to allow decoding of other operations on user_catalog_table, the decoding of TRUNCATE should be prohibited but maybe we shouldn't allow decoding of any operation on such tables as we don't do it for system catalog tables. -- With Regards, Amit Kapila.
Re: Replication slot stats misgivings
On Thu, Apr 29, 2021 at 11:55 AM Amit Kapila wrote: > > On Thu, Apr 29, 2021 at 4:58 AM Masahiko Sawada wrote: > > > > On Thu, Apr 29, 2021 at 5:41 AM Tom Lane wrote: > > > > > > It seems that the test case added by f5fc2f5b2 is still a bit > > > unstable, even after c64dcc7fe: > > > > Hmm, I don't see the exact cause yet but there are two possibilities: > > some transactions were really spilled, > > > > This is the first test and inserts just one small record, so how it > can lead to spill of data. Do you mean to say that may be some > background process has written some transaction which leads to a spill > of data? Not sure but I thought that the logical decoding started to decodes from a relatively old point for some reason and decoded incomplete transactions that weren’t shown in the result. > > > and it showed the old stats due > > to losing the drop (and create) slot messages. > > > > Yeah, something like this could happen. Another possibility here could > be that before the stats collector has processed drop and create > messages, we have enquired about the stats which lead to it giving us > the old stats. Note, that we don't wait for 'drop' or 'create' message > to be delivered. So, there is a possibility of the same. What do you > think? Yeah, that could happen even if any message didn't get dropped. > > > For the former case, it > > seems to better to create the slot just before the insertion and > > setting logical_decoding_work_mem to the default (64MB). For the > > latter case, maybe we can use a different name slot than the name used > > in other tests? > > > > How about doing both of the above suggestions? Alternatively, we can > wait for both 'drop' and 'create' message to be delivered but that > might be overkill. Agreed. Attached the patch doing both things. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ fix_stats_test.patch Description: Binary data
RE: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication
On Thursday, April 29, 2021 1:22 PM, Peter Geoghegan wrote >Is setting all_visible_according_to_vm false as below enough to avoid >the assertion failure? > >diff --git a/src/backend/access/heap/vacuumlazy.c >b/src/backend/access/heap/vacuumlazy.c >index c3fc12d76c..76c17e063e 100644 >--- a/src/backend/access/heap/vacuumlazy.c >+++ b/src/backend/access/heap/vacuumlazy.c >@@ -1146,6 +1146,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams >*params, bool aggressive) > { > ReleaseBuffer(vmbuffer); > vmbuffer = InvalidBuffer; >+all_visible_according_to_vm = false; > } > > /* Remove the collected garbage tuples from table and indexes */ Thanks for your reply. I tried your patch but the problem seems not be fixed. Regards, Tang
Re: [BUG] "FailedAssertion" reported when streaming in logical replication
On Wed, Apr 28, 2021 at 1:02 PM Dilip Kumar wrote: > > On Wed, Apr 28, 2021 at 12:25 PM tanghy.f...@fujitsu.com > wrote: > > > > > I have modified the patch based on the above comments. > > > > Thanks for your patch. > > I tested again after applying your patch and the problem is fixed. > > Thanks for confirming. I tried to think about how to write a test case for this scenario, but I think it will not be possible to generate an automated test case for this. Basically, we need 2 concurrent transactions and out of that, we need one transaction which just has processed only one change i.e XLOG_HEAP2_NEW_CID and another transaction should overflow the logical decoding work mem, so that we select the wrong transaction which doesn't have the base snapshot. But how to control that the transaction which is performing the DDL just write the XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should get the WAl from other transaction which overflows the buffer. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Replication slot stats misgivings
On Thu, Apr 29, 2021 at 11:14 AM Masahiko Sawada wrote: > > > > > How about doing both of the above suggestions? Alternatively, we can > > wait for both 'drop' and 'create' message to be delivered but that > > might be overkill. > > Agreed. Attached the patch doing both things. > Thanks, the patch LGTM. I'll wait for a day before committing to see if anyone has better ideas. -- With Regards, Amit Kapila.
RE: [BUG] "FailedAssertion" reported when streaming in logical replication
On Thursday, April 29, 2021 3:18 PM, Dilip Kumar wrote >I tried to think about how to write a test case for this scenario, but >I think it will not be possible to generate an automated test case for this. >Basically, we need 2 concurrent transactions and out of that, >we need one transaction which just has processed only one change i.e >XLOG_HEAP2_NEW_CID and another transaction should overflow the logical >decoding work mem, so that we select the wrong transaction which >doesn't have the base snapshot. But how to control that the >transaction which is performing the DDL just write the >XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should >get the WAl from other transaction which overflows the buffer. Thanks for your updating. Actually, I tried to make the automated test for the problem, too. But made no process on this. Agreed on your opinion " not be possible to generate an automated test case for this ". If anyone figure out a good solution for the test automation of this case. Please be kind to share that with us. Thanks. Regards, Tang