Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Fri, Jul 23, 2021 at 6:55 PM Robert Haas wrote: > > On Wed, Jul 21, 2021 at 11:55 PM Amit Kapila wrote: > > I see here we have a mix of opinions from various people. Dilip seems > > to be favoring the approach where we provide some option to the user > > for partitioned tables and automatic behavior for non-partitioned > > tables but he also seems to have mild concerns about this behavior. > > OTOH, Greg and Hou-San seem to favor an approach where we can provide > > an option to the user for both partitioned and non-partitioned tables. > > I am also in favor of providing an option to the user for the sake of > > consistency in behavior and not trying to introduce a special kind of > > invalidation as it doesn't serve the purpose for partitioned tables. > > Robert seems to be in favor of automatic behavior but it is not very > > clear to me if he is fine with dealing differently for partitioned and > > non-partitioned relations. Robert, can you please provide your opinion > > on what do you think is the best way to move forward here? > > I thought we had agreed on handling partitioned and unpartitioned > tables differently, but maybe I misunderstood the discussion. > I think for the consistency argument how about allowing users to specify a parallel-safety option for both partitioned and non-partitioned relations but for non-partitioned relations if users didn't specify, it would be computed automatically? If the user has specified parallel-safety option for non-partitioned relation then we would consider that instead of computing the value by ourselves. Another reason for hesitation to do automatically for non-partitioned relations was the new invalidation which will invalidate the cached parallel-safety for all relations in relcache for a particular database. As mentioned by Hou-San [1], it seems we need to do this whenever any function's parallel-safety is changed. OTOH, changing parallel-safety for a function is probably not that often to matter in practice which is why I think you seem to be fine with this idea. So, I think, on that premise, it is okay to go ahead with different handling for partitioned and non-partitioned relations here. [1] - https://www.postgresql.org/message-id/OS0PR01MB5716EC1D07ACCA24373C2557941B9%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Logical replication error "no record found" /* shouldn't happen */
Hello. I saw this error multiple times trying to replicate the 2-3 TB server (version 11 to version 12). I was unable to find any explanation for this error. Thanks, Michail.
Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
On Thu, Jul 22, 2021 at 02:32:35PM +0900, Michael Paquier wrote: > Okay, done those parts as per the attached. While on it, I noticed an > extra one for pg_dump --rows-per-insert. I am counting 25 translated > strings cut in total. > > Any objections to this first step? I have looked at that over the last couple of days, and applied it after some small fixes, including an indentation. The int64 and float parts are extra types we could handle, but I have not looked yet at how much benefits we'd get in those cases. -- Michael signature.asc Description: PGP signature
Re: Avoiding data loss with synchronous replication
23 июля 2021 г., в 22:54, Bossart, Nathan написал(а):On 7/23/21, 4:33 AM, "Andrey Borodin" wrote:Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().I didn't get the idea that there was a tremendous amount of supportfor the approach to block canceling waits for synchronous replication.FWIW this was my initial approach as well, but I've been trying tothink of alternatives.If we can gather support for some variation of the block-cancelsapproach, I think that would be preferred over my proposal from acomplexity standpoint. Let's clearly enumerate problems of blocking.It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive.postgres=# alter system set synchronous_standby_names to 'bogus';ALTER SYSTEMpostgres=# alter system set synchronous_commit_cancelation TO off ;ALTER SYSTEMpostgres=# select pg_reload_conf();2021-07-24 15:35:03.054 +05 [10452] LOG: received SIGHUP, reloading configuration files l --- t(1 row)postgres=# begin;BEGINpostgres=*# insert into t1 values(0);INSERT 0 1postgres=*# commit ;^CCancel request sentWARNING: canceling wait for synchronous replication requested, but cancelation is not allowedDETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.^CCancel request sentWARNING: canceling wait for synchronous replication requested, but cancelation is not allowedDETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names.Are there any other problems with blocking cancels?Robert's idea to provide a way to understandthe intent of the cancellation/termination request [0] could improvematters. Perhaps adding an argument to pg_cancel/terminate_backend()and using different signals to indicate that we want to cancel thewait would be something that folks could get on board with.Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. There cannot be any correct cancelation. And I don't think it worth to add incorrect cancelation.Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties that transaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots...Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But they cancel queries by default.Thanks!Best regards, Andrey Borodin.
Re: Avoiding data loss with synchronous replication
> 23 июля 2021 г., в 22:54, Bossart, Nathan написал(а): > > On 7/23/21, 4:33 AM, "Andrey Borodin" wrote: >> Thanks for you interest in the topic. I think in the thread [0] we almost >> agreed on general design. >> The only left question is that we want to threat pg_ctl stop and kill >> SIGTERM differently to pg_terminate_backend(). > > I didn't get the idea that there was a tremendous amount of support > for the approach to block canceling waits for synchronous replication. > FWIW this was my initial approach as well, but I've been trying to > think of alternatives. > > If we can gather support for some variation of the block-cancels > approach, I think that would be preferred over my proposal from a > complexity standpoint. Let's clearly enumerate problems of blocking. It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive. postgres=# alter system set synchronous_standby_names to 'bogus'; ALTER SYSTEM postgres=# alter system set synchronous_commit_cancelation TO off ; ALTER SYSTEM postgres=# select pg_reload_conf(); 2021-07-24 15:35:03.054 +05 [10452] LOG: received SIGHUP, reloading configuration files l --- t (1 row) postgres=# begin; BEGIN postgres=*# insert into t1 values(0); INSERT 0 1 postgres=*# commit ; ^CCancel request sent WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here. ^CCancel request sent WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here. It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names. Are there any other problems with blocking cancels? > Robert's idea to provide a way to understand > the intent of the cancellation/termination request [0] could improve > matters. Perhaps adding an argument to pg_cancel/terminate_backend() > and using different signals to indicate that we want to cancel the > wait would be something that folks could get on board with. Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. There cannot be any correct cancelation. And I don't think it worth to add incorrect cancelation. Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties that transaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots... Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But they cancel queries by default. Thanks! Best regards, Andrey Borodin.
Maintain the pathkesy for subquery from outer side information
When I am working on the UnqiueKey stuff, I find the following cases. SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a; // root->query_keys = A. root->order_pathkeys = A // Current: subroot->query_pathkeys = NIL. // Expected: subroot->_pathkeys = [A]. SELECT * FROM (SELECT * FROM t offset 0) v, t2 WHERE t2.a = t.a; // root->query_keys = NIL // Current: subroot->query_keys = NIL // Expected: subroot->xxx_pathkeys = A To resolve this issue, I want to add a root->outer_pathkeys which means the interesting order from the outer side for a subquery. To cover the cases like below // root->query_keys = root->order_keys = b. // Expected: subroot->xxx_pathkeys = (a)? (b)? SELECT * FROM (SELECT * FROM t offset 0) v, t2 WHERE t2.a = t.a order by v.b; the root->outer_pathkeys should be a list of lists. in above case subroot->outer_pathkeys should be [ [a], [b] ], this value may be checked at many places, like pathkeys_useful_for_ordering, get_useful_pathkeys_for_relation, build_index_paths and more. My list might be incomplete, but once we have a new place to check and the data is maintained already, it would be easy to improve. My thinking is we maintain the root->outer_pathkeys first, and then improve the well known function as the first step. What do you think? -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: visibility map corruption
On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote: > On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote: > > > I could perhaps see corruption happening if pg_control's oldest xid > > > value was closer to the current xid value than it should be, but I can't > > > see how having it 2-billion away could cause harm, unless perhaps > > > pg_upgrade itself used enough xids to cause the counter to wrap more > > > than 2^31 away from the oldest xid recorded in pg_control. > > > > > > What I am basically asking is how to document this and what it fixes. > > > > ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe > > take a look at those? > > Agreed. I just wanted to make sure I wasn't missing an important aspect > of this patch. Thanks. Another question --- with the previous code, the oldest xid was always set to a reasonable value, -2 billion less than the current xid. With the new code, the oldest xid might be slightly higher than the current xid if they use -x but not -u. Is that acceptable? I think we agreed it was. pg_upgrade will always set both. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Maintain the pathkesy for subquery from outer side information
Andy Fan writes: > When I am working on the UnqiueKey stuff, I find the following cases. > SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a; > // root->query_keys = A. root->order_pathkeys = A > // Current: subroot->query_pathkeys = NIL. > // Expected: subroot->_pathkeys = [A]. Why do you "expect" that? I think pushing the outer ORDER BY past a LIMIT is an unacceptable semantics change. regards, tom lane
Re: Maintain the pathkesy for subquery from outer side information
On Sat, Jul 24, 2021 at 10:14 PM Tom Lane wrote: > > Andy Fan writes: > > When I am working on the UnqiueKey stuff, I find the following cases. > > SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a; > > // root->query_keys = A. root->order_pathkeys = A > > // Current: subroot->query_pathkeys = NIL. > > // Expected: subroot->_pathkeys = [A]. > > Why do you "expect" that? I think pushing the outer ORDER BY past a > LIMIT is an unacceptable semantics change. > > regards, tom lane I don't mean push down a "ORDER BY" clause to subquery, I mean push down an "interesting order" to subquery. for example we have index t(a); then SELECT * FROM (SELECT a FROM t OFFSET 0) v ORDER BY a; In the current implementation, when we are planning the subuqery, planners think the "pathkey a" is not interesting, but it should be IIUC. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Maintain the pathkesy for subquery from outer side information
Andy Fan writes: > On Sat, Jul 24, 2021 at 10:14 PM Tom Lane wrote: >> Why do you "expect" that? I think pushing the outer ORDER BY past a >> LIMIT is an unacceptable semantics change. > I don't mean push down a "ORDER BY" clause to subquery, I mean push > down an "interesting order" to subquery. for example we have index t(a); > then SELECT * FROM (SELECT a FROM t OFFSET 0) v ORDER BY a; > In the current implementation, when we are planning the subuqery, planners > think the "pathkey a" is not interesting, but it should be IIUC. No, it should not be. (1) We have long treated "OFFSET 0" as an optimization fence. That means that the outer query shouldn't affect what plan you get for the subquery. (2) If you ignore point (1), you could argue that choosing a different result order doesn't matter for this subquery. However, it potentially *does* matter for a large fraction of the cases in which we'd not have flattened the subquery into the outer query. In subqueries involving things like volatile functions, aggregates, window functions, etc, encouraging the sub-planner to pick a different result ordering could lead to visibly different output. I think that in cases where there's not a semantic hazard involved, we'd usually have pulled up the subquery so that this is all moot anyway. regards, tom lane
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On Fri, Jul 23, 2021 at 4:40 AM Bossart, Nathan wrote: > I would suggest changing "attach from a debugger" to "attaching with a > debugger." Thanks. IMO, the following looks better: Waiting on connection startup before authentication to allow attaching a debugger to the process. Waiting on connection startup after authentication to allow attaching a debugger to the process. > IIUC you want to use the same set of flags as PostAuthDelay for > PreAuthDelay, but the stated reason in this comment for leaving out > WL_LATCH_SET suggests otherwise. It's not clear to me why the latch > possibly pointing to a shared latch in the future is an issue. Should > this instead say that we leave out WL_LATCH_SET for consistency with > PostAuthDelay? If WL_LATCH_SET is used for PostAuthDelay, the waiting doesn't happen because the MyLatch which is a shared latch would be set by SwitchToSharedLatch. More details at [1]. If WL_LATCH_SET is used for PreAuthDelay, actually there's no problem because MyLatch is still not initialized properly in BackendInitialize when waiting for PreAuthDelay, it still points to local latch, but later gets pointed to shared latch and gets set SwitchToSharedLatch. But relying on MyLatch there seems to me somewhat relying on an uninitialized variable. More details at [1]. For PreAuthDelay, with the comment I wanted to say that the MyLatch is not the correct one we would want to wait for. Since there is no problem in using it there, I changed the comment to following: /* * Let's not use WL_LATCH_SET for PreAuthDelay to be consistent with * PostAuthDelay. */ [1] - https://www.postgresql.org/message-id/flat/CALj2ACVF8AZi1bK8oH-Qoz3tYVpqFuzxcDRPdF-3p5BvF6GTxA%40mail.gmail.com Regards, Bharath Rupireddy. v2-0001-Use-a-WaitLatch-for-pre-post-_auth_delay.patch Description: Binary data
Re: Configure with thread sanitizer fails the thread test
I wrote: > After a bit more thought, HEAD+v14 is also my preference. I'm not > eager to change this in stable branches, but it doesn't seem too > late for v14. Perusing the commit log, I realized that there's another reason why v14 is a good cutoff: thread_test.c was in a different place with an allegedly different raison d'etre before 8a2121185. So done that way. regards, tom lane
Re: CREATE SEQUENCE with RESTART option
On Sat, Jul 24, 2021 at 3:20 AM Cary Huang wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation:tested, passed > > Hi > > I have applied and run your patch, which works fine in my environment. > Regarding your comments in the patch: Thanks for the review. > /* > * Restarting a sequence while defining it doesn't make any sense > * and it may override the START value. Allowing both START and > * RESTART option for CREATE SEQUENCE may cause confusion to user. > * Hence, we throw error for CREATE SEQUENCE if RESTART option is > * specified. However, it can be used with ALTER SEQUENCE. > */ > > I would remove the first sentence, because it seems like a personal opinion > to me. I am sure someone, somewhere may think it makes total sense :). Agree. > I would rephrase like this: > > /* > * Allowing both START and RESTART option for CREATE SEQUENCE > * could override the START value and cause confusion to user. Hence, > * we throw an error for CREATE SEQUENCE if RESTART option is > * specified; it can only be used with ALTER SEQUENCE. > */ > > just a thought. LGTM. PSA v2 patch. Regards, Bharath Rupireddy. From e726e40d4969f56eb4246a7016e3e2780aa2fbf1 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sat, 24 Jul 2021 16:23:42 + Subject: [PATCH v2] Disallow RESTART option for CREATE SEQUENCE Allowing both START and RESTART option for CREATE SEQUENCE could override the START value and cause confusion to user. Hence, we throw an error for CREATE SEQUENCE if RESTART option is specified; it can only be used with ALTER SEQUENCE. If users want their sequences to be starting from a value different than START value, then they can specify RESTART value with ALTER SEQUENCE. --- src/backend/commands/sequence.c| 12 src/test/regress/expected/sequence.out | 4 src/test/regress/sql/sequence.sql | 1 + 3 files changed, 17 insertions(+) diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index 72bfdc07a4..8c50f396e6 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -1281,6 +1281,18 @@ init_params(ParseState *pstate, List *options, bool for_identity, } else if (strcmp(defel->defname, "restart") == 0) { + /* + * Allowing both START and RESTART option for CREATE SEQUENCE + * could override the START value and cause confusion to user. + * Hence, we throw an error for CREATE SEQUENCE if RESTART option + * is specified; it can only be used with ALTER SEQUENCE. + */ + if (isInit) +ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("RESTART option is not allowed while creating sequence"), + parser_errposition(pstate, defel->location))); + if (restart_value) errorConflictingDefElem(defel, pstate); restart_value = defel; diff --git a/src/test/regress/expected/sequence.out b/src/test/regress/expected/sequence.out index 71c2b0f1df..f60725a9e5 100644 --- a/src/test/regress/expected/sequence.out +++ b/src/test/regress/expected/sequence.out @@ -16,6 +16,10 @@ CREATE SEQUENCE sequence_testx INCREMENT BY 1 START -10; ERROR: START value (-10) cannot be less than MINVALUE (1) CREATE SEQUENCE sequence_testx CACHE 0; ERROR: CACHE (0) must be greater than zero +CREATE SEQUENCE sequence_testx RESTART 5; +ERROR: RESTART option is not allowed while creating sequence +LINE 1: CREATE SEQUENCE sequence_testx RESTART 5; + ^ -- OWNED BY errors CREATE SEQUENCE sequence_testx OWNED BY nobody; -- nonsense word ERROR: invalid OWNED BY option diff --git a/src/test/regress/sql/sequence.sql b/src/test/regress/sql/sequence.sql index 7928ee23ee..367157fe2b 100644 --- a/src/test/regress/sql/sequence.sql +++ b/src/test/regress/sql/sequence.sql @@ -10,6 +10,7 @@ CREATE SEQUENCE sequence_testx INCREMENT BY 1 MAXVALUE -20; CREATE SEQUENCE sequence_testx INCREMENT BY -1 START 10; CREATE SEQUENCE sequence_testx INCREMENT BY 1 START -10; CREATE SEQUENCE sequence_testx CACHE 0; +CREATE SEQUENCE sequence_testx RESTART 5; -- OWNED BY errors CREATE SEQUENCE sequence_testx OWNED BY nobody; -- nonsense word -- 2.25.1
Re: DROP INDEX CONCURRENTLY on partitioned index
On Wed, Jul 14, 2021 at 09:18:12PM +0530, vignesh C wrote: > The patch does not apply on Head anymore, could you rebase and post a > patch. I'm changing the status to "Waiting for Author". I'm withdrawing this patch at least until the corresponding patch for CIC is progressed. https://commitfest.postgresql.org/33/2815/ -- Justin
log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
Hi, I've been repeatedly confused by the the number of WAL files supposedly added. Even when 100s of new WAL files are created the relevant portion of log_checkpoints will only ever list zero or one added WAL file. The reason for that is that CheckpointStats.ckpt_segs_added is only incremented in PreallocXlogFiles(). Which has the following comment: * XXX this is currently extremely conservative, since it forces only one * future log segment to exist, and even that only if we are 75% done with * the current one. This is only appropriate for very low-WAL-volume systems. Whereas in real workloads WAL files are almost exclusively created via XLogWrite()->XLogFileInit(). I think we should consider just removing that field. Or, even better, show something accurate instead. As an example, here's the log output of a workload that has a replication slot preventing WAL files from being recycled (and too small max_wal_size): 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB So it's 3 new WAL files in that timeframe, one might think? A probe instrumenting xlog file creation shows something very different: perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39 (39 is the O_CREAT BasicOpenFile(), not the recycle path). perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000 1.001030943 9 probe_postgres:XLogFileInitInternal_L39 2.002998896 8 probe_postgres:XLogFileInitInternal_L39 3.005098857 8 probe_postgres:XLogFileInitInternal_L39 4.007000426 6 probe_postgres:XLogFileInitInternal_L39 5.008423119 9 probe_postgres:XLogFileInitInternal_L39 6.013427568 8 probe_postgres:XLogFileInitInternal_L39 7.015087613 8 probe_postgres:XLogFileInitInternal_L39 8.017393339 8 probe_postgres:XLogFileInitInternal_L39 9.018425526 7 probe_postgres:XLogFileInitInternal_L39 10.020398520 5 probe_postgres:XLogFileInitInternal_L39 Greetings, Andres Freund
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
Hi, On 2021-07-24 15:50:36 -0700, Andres Freund wrote: > As an example, here's the log output of a workload that has a replication slot > preventing WAL files from being recycled (and too small max_wal_size): > > 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG: checkpoint > complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 > recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, > longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB > 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG: checkpoint > complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 > recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, > longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB > 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG: checkpoint > complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 > recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, > longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB > 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG: checkpoint > complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 > recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, > longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB > 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG: checkpoint > complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 > recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, > longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB > > So it's 3 new WAL files in that timeframe, one might think? > > A probe instrumenting xlog file creation shows something very different: > perf probe -x > /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a > XLogFileInitInternal:39 > (39 is the O_CREAT BasicOpenFile(), not the recycle path). > > perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000 > 1.001030943 9 > probe_postgres:XLogFileInitInternal_L39 > 2.002998896 8 > probe_postgres:XLogFileInitInternal_L39 > 3.005098857 8 > probe_postgres:XLogFileInitInternal_L39 > 4.007000426 6 > probe_postgres:XLogFileInitInternal_L39 > 5.008423119 9 > probe_postgres:XLogFileInitInternal_L39 > 6.013427568 8 > probe_postgres:XLogFileInitInternal_L39 > 7.015087613 8 > probe_postgres:XLogFileInitInternal_L39 > 8.017393339 8 > probe_postgres:XLogFileInitInternal_L39 > 9.018425526 7 > probe_postgres:XLogFileInitInternal_L39 > 10.020398520 5 > probe_postgres:XLogFileInitInternal_L39 And even more extreme, the logs can end up suggestiong pg_wal is shrinking, when the opposite is the case. Here's the checkpoint output from a parallel copy data load (without a replication slot holding things back): 2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22291 buffers (17.0%); 0 WAL file(s) added, 27 removed, 141 recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, longest=0.116 s, average=0.039 s; distance=2756904 kB, estimate=2756904 kB 2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 21840 buffers (16.7%); 0 WAL file(s) added, 53 removed, 149 recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, longest=0.201 s, average=0.037 s; distance=3313885 kB, estimate=3313885 kB 2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22326 buffers (17.0%); 0 WAL file(s) added, 56 removed, 149 recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, longest=0.099 s, average=0.040 s; distance=3346125 kB, estimate=3346125 kB 2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22155 buffers (16.9%); 0 WAL file(s) added, 60 removed, 148 recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, longest=0.101 s, average=0.043 s; distance=3409281 kB, estimate=3409281 kB It does look like WAL space usage is shrinking, but the opposite is true - we're creating so much WAL that the checkpointer can't checkpoint fast enough to keep WAL usage below max_wal_size. So WAL files are constantly created that then need to be removed (hence the non-zero removed counts). # time counts unit events 277.087990275 15 probe_postgres:XLogFileInitInternal_L39 278.098549960 15 probe_postgres:XLogFileInitInternal_L39 279.104787575 6 probe_postgres:XLogFileInitInternal_L39 280.108980690 5 probe_postgres:XL
Re: something is wonky with pgbench pipelining
Hi, Adding RMT. On 2021-07-21 16:55:08 -0700, Andres Freund wrote: > On 2021-07-20 14:57:15 -0400, Alvaro Herrera wrote: > > On 2021-Jul-20, Andres Freund wrote: > > > > > I think what's happening is that the first recvfrom() actually gets all 7 > > > connection results. The server doesn't have any queries to process at that > > > point. But we ask the kernel whether there is new network input over and > > > over > > > again, despite having results to process! > > > > Hmm, yeah, that seems a missed opportunity. > > > > with-isbusy: > > > ... > > > tps = 3990.424742 (without initial connection time) > > > ... > > > 1,013.71 msec task-clock#0.202 CPUs utilized > > > 80,203 raw_syscalls:sys_enter# 79.119 K/sec > > > 19,947 context-switches # 19.677 K/sec > > > 2,943,676,361 cycles:u #2.904 GHz > > >346,607,769 cycles:k #0.342 GHz > > > 8,464,188,379 instructions:u#2.88 insn per > > > cycle > > >226,665,530 instructions:k#0.65 insn per > > > cycle > > > > This is quite compelling. > > > > If you don't mind I can get this pushed soon in the next couple of days > > -- or do you want to do it yourself? > > I was thinking of pushing the attached, to both 14 and master, thinking > that was what you meant, but then I wasn't quite sure: It's a relatively > minor performance improvement, after all? OTOH, it arguably also just is > a bit of an API misuse... > > I'm inclined to push it to 14 and master, but ... RMT: ^ Greetings, Andres Freund
Re: Avoiding data loss with synchronous replication
Hi, On 2021-07-22 21:17:56 +, Bossart, Nathan wrote: > AFAICT there are a variety of ways that the aforementioned problem may > occur: > 1. Server restarts: As noted in the docs [2], "waiting transactions > will be marked fully committed once the primary database > recovers." I think there are a few options for handling this, > but the simplest would be to simply failover anytime the primary > server shut down. My proposal may offer other ways of helping > with this. > 2. Backend crashes: If a backend crashes, the postmaster process > will restart everything, leading to the same problem described in > 1. However, this behavior can be prevented with the > restart_after_crash parameter [3]. > 3. Client disconnections: During waits for synchronous replication, > interrupt processing is turned off, so disconnected clients > actually don't seem to cause a problem. The server will still > wait for synchronous replication to complete prior to making the > transaction visible on the primary. > 4. Query cancellations and backend terminations: This appears to be > the only gap where there is no way to avoid potential data loss, > and it is the main target of my proposal. > > Instead of blocking query cancellations and backend terminations, I > think we should allow them to proceed, but we should keep the > transactions marked in-progress so they do not yet become visible to > sessions on the primary. Once replication has caught up to the > the necessary point, the transactions can be marked completed, and > they would finally become visible. I think there's two aspects making this proposal problematic: First, from the user experience side of things, the issue is that this seems to propose violating read-your-own-writes. Within a single connection to a single node. Which imo is *far* worse than seeing writes that haven't yet been acknowledged as replicated after a query cancel. Second, on the implementation side, I think this proposal practically amounts to internally converting plain transaction commits into 2PC prepare/commit. With all the associated overhead (two WAL entries/flushes per commit, needing a separate set of procarray entries to hold the resources for the the prepared-but-not-committed transactions, potential for running out of the extra procarray slots). What if a user rapidly commits-cancels in a loop? You'll almost immediately run out of procarray slots to represent all those "not really committed" transactions. I think there's benefit in optionally turning all transactions into 2PC ones, but I don't see it being fast enough to be the only option. Greetings, Andres Freund
Re: Avoiding data loss with synchronous replication
Hi, On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote: > Are there any other problems with blocking cancels? Unless you have commandline access to the server, it's not hard to get into a situation where you can't change the configuration setting because all connections are hanging, and you can't even log in to do an ALTER SERVER etc. You can't kill applications to kill the connection, because they will just continue to hang. Greetings, Andres Freund
Re: Removing "long int"-related limit on hash table sizes
Hi, On 2021-07-23 17:15:24 -0400, Tom Lane wrote: > Per the discussion at [1], users on Windows are seeing nasty performance > losses in v13/v14 (compared to prior releases) for hash aggregations that > required somewhat more than 2GB in the prior releases. Ugh :(. > That's because they spill to disk where they did not before. The easy > answer of "raise hash_mem_multiplier" doesn't help, because on Windows > the product of work_mem and hash_mem_multiplier is clamped to 2GB, > thanks to the ancient decision to do a lot of memory-space-related > calculations in "long int", which is only 32 bits on Win64. We really ought to just remove every single use of long. As Thomas quipped on twitter at some point, "long is the asbestos of C". I think we've incurred far more cost due to weird workarounds to deal with the difference in long width between windows and everything else, than just removing all use of it outright would incur. And perhaps once we've done that, we shoulde experiment with putting __attribute__((deprecated)) on long, but conditionalize it so it's only used for building PG internal stuff, and doesn't leak into pg_config output. Perhaps it'll be to painful due to external headers, but it seems worth trying. But obviously that doesn't help with the issue in the release branches. > While I don't personally have the interest to fix that altogether, > it does seem like we've got a performance regression that we ought > to do something about immediately. So I took a look at getting rid of > this restriction for calculations associated with hash_mem_multiplier, > and it doesn't seem to be too bad. I propose the attached patch. > (This is against HEAD; there are minor conflicts in v13 and v14.) Hm. I wonder if we would avoid some overflow dangers on 32bit systems if we made get_hash_memory_limit() and the relevant variables 64 bit, rather than 32bit / size_t. E.g. > @@ -700,9 +697,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, > bool useskew, > inner_rel_bytes = ntuples * tupsize; > > /* > - * Target in-memory hashtable size is hash_mem kilobytes. > + * Compute in-memory hashtable size limit from GUCs. >*/ > - hash_table_bytes = hash_mem * 1024L; > + hash_table_bytes = get_hash_memory_limit(); > > /* >* Parallel Hash tries to use the combined hash_mem of all workers to > @@ -710,7 +707,14 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, > bool useskew, >* per worker and tries to process batches in parallel. >*/ > if (try_combined_hash_mem) > - hash_table_bytes += hash_table_bytes * parallel_workers; > + { > + /* Careful, this could overflow size_t */ > + double newlimit; > + > + newlimit = (double) hash_table_bytes * (double) > (parallel_workers + 1); > + newlimit = Min(newlimit, (double) SIZE_MAX); > + hash_table_bytes = (size_t) newlimit; > + } Wouldn't need to be as carful, I think? > @@ -740,12 +747,26 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, > bool useskew, >* size of skew bucket struct itself >*-- >*/ > - *num_skew_mcvs = skew_table_bytes / (tupsize + > - > (8 * sizeof(HashSkewBucket *)) + > - > sizeof(int) + > - > SKEW_BUCKET_OVERHEAD); > - if (*num_skew_mcvs > 0) > - hash_table_bytes -= skew_table_bytes; > + bytes_per_mcv = tupsize + > + (8 * sizeof(HashSkewBucket *)) + > + sizeof(int) + > + SKEW_BUCKET_OVERHEAD; > + skew_mcvs = hash_table_bytes / bytes_per_mcv; > + > + /* > + * Now scale by SKEW_HASH_MEM_PERCENT (we do it in this order > so as > + * not to worry about size_t overflow in the multiplication) > + */ > + skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100; I always have to think about the evaluation order of things like this (it's left to right for these), so I'd prefer parens around the multiplication. I did wonder briefly whether the SKEW_HASH_MEM_PERCENT / 100 just evaluates to 0... Looks like a good idea to me. Greetings, Andres Freund
Re: Removing "long int"-related limit on hash table sizes
On Sat, Jul 24, 2021 at 06:25:53PM -0700, Andres Freund wrote: > > That's because they spill to disk where they did not before. The easy > > answer of "raise hash_mem_multiplier" doesn't help, because on Windows > > the product of work_mem and hash_mem_multiplier is clamped to 2GB, > > thanks to the ancient decision to do a lot of memory-space-related > > calculations in "long int", which is only 32 bits on Win64. > > We really ought to just remove every single use of long. As Thomas > quipped on twitter at some point, "long is the asbestos of C". I think > we've incurred far more cost due to weird workarounds to deal with the > difference in long width between windows and everything else, than just > removing all use of it outright would incur. +1 As I understand it, making long of undermined length was to allow someone to choose a data type that _might_ be longer than int if the compiler/OS/CPU was optimized for that, but at this point, such optimizations just don't seem to make sense, and we know every(?) CPU supports long-long, so why not go for something concrete? Do we really want our feature limits to be determined by whether we have an optimized type longer than int? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
On 2021/07/25 7:50, Andres Freund wrote: Hi, I've been repeatedly confused by the the number of WAL files supposedly added. Even when 100s of new WAL files are created the relevant portion of log_checkpoints will only ever list zero or one added WAL file. The reason for that is that CheckpointStats.ckpt_segs_added is only incremented in PreallocXlogFiles(). Which has the following comment: * XXX this is currently extremely conservative, since it forces only one * future log segment to exist, and even that only if we are 75% done with * the current one. This is only appropriate for very low-WAL-volume systems. Whereas in real workloads WAL files are almost exclusively created via XLogWrite()->XLogFileInit(). I think we should consider just removing that field. Or, even better, show something accurate instead. +1 to show something accurate instead. It's also worth showing them in monitoring stats view like pg_stat_wal? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Hi, On 2021-06-21 05:29:19 -0700, Andres Freund wrote: > On 2021-06-16 12:12:23 -0700, Andres Freund wrote: > > Could you share your testcase? I've been working on a series of patches > > to address this (I'll share in a bit), and I've run quite a few tests, > > and didn't hit any infinite loops. > > Sorry for not yet doing that. Unfortunately I have an ongoing family > health issue (& associated travel) claiming time and energy :(. > > I've pushed the minimal fix due to beta 2. > > Beyond beta 2 I am thinking of the below to unify the horizon > determination: > > static inline GlobalVisHorizonKind > GlobalVisHorizonKindForRel(Relation rel) I finally pushed this cleanup. Greetings, Andres Freund