Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-24 Thread Amit Kapila
On Fri, Jul 23, 2021 at 6:55 PM Robert Haas  wrote:
>
> On Wed, Jul 21, 2021 at 11:55 PM Amit Kapila  wrote:
> > I see here we have a mix of opinions from various people. Dilip seems
> > to be favoring the approach where we provide some option to the user
> > for partitioned tables and automatic behavior for non-partitioned
> > tables but he also seems to have mild concerns about this behavior.
> > OTOH, Greg and Hou-San seem to favor an approach where we can provide
> > an option to the user for both partitioned and non-partitioned tables.
> > I am also in favor of providing an option to the user for the sake of
> > consistency in behavior and not trying to introduce a special kind of
> > invalidation as it doesn't serve the purpose for partitioned tables.
> > Robert seems to be in favor of automatic behavior but it is not very
> > clear to me if he is fine with dealing differently for partitioned and
> > non-partitioned relations. Robert, can you please provide your opinion
> > on what do you think is the best way to move forward here?
>
> I thought we had agreed on handling partitioned and unpartitioned
> tables differently, but maybe I misunderstood the discussion.
>

I think for the consistency argument how about allowing users to
specify a parallel-safety option for both partitioned and
non-partitioned relations but for non-partitioned relations if users
didn't specify, it would be computed automatically? If the user has
specified parallel-safety option for non-partitioned relation then we
would consider that instead of computing the value by ourselves.

Another reason for hesitation to do automatically for non-partitioned
relations was the new invalidation which will invalidate the cached
parallel-safety for all relations in relcache for a particular
database. As mentioned by Hou-San [1], it seems we need to do this
whenever any function's parallel-safety is changed. OTOH, changing
parallel-safety for a function is probably not that often to matter in
practice which is why I think you seem to be fine with this idea. So,
I think, on that premise, it is okay to go ahead with different
handling for partitioned and non-partitioned relations here.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB5716EC1D07ACCA24373C2557941B9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: Logical replication error "no record found" /* shouldn't happen */

2021-07-24 Thread Michail Nikolaev
Hello.

I saw this error multiple times trying to replicate the 2-3 TB server
(version 11 to version 12). I was unable to find any explanation for
this error.

Thanks,
Michail.




Re: Incorrect usage of strtol, atoi for non-numeric junk inputs

2021-07-24 Thread Michael Paquier
On Thu, Jul 22, 2021 at 02:32:35PM +0900, Michael Paquier wrote:
> Okay, done those parts as per the attached.  While on it, I noticed an
> extra one for pg_dump --rows-per-insert.  I am counting 25 translated
> strings cut in total.
> 
> Any objections to this first step?

I have looked at that over the last couple of days, and applied it
after some small fixes, including an indentation.  The int64 and float
parts are extra types we could handle, but I have not looked yet at
how much benefits we'd get in those cases.
--
Michael


signature.asc
Description: PGP signature


Re: Avoiding data loss with synchronous replication

2021-07-24 Thread Andrey Borodin
23 июля 2021 г., в 22:54, Bossart, Nathan  написал(а):On 7/23/21, 4:33 AM, "Andrey Borodin"  wrote:Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().I didn't get the idea that there was a tremendous amount of supportfor the approach to block canceling waits for synchronous replication.FWIW this was my initial approach as well, but I've been trying tothink of alternatives.If we can gather support for some variation of the block-cancelsapproach, I think that would be preferred over my proposal from acomplexity standpoint.  Let's clearly enumerate problems of blocking.It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive.postgres=# alter system set synchronous_standby_names to 'bogus';ALTER SYSTEMpostgres=# alter system set synchronous_commit_cancelation TO off ;ALTER SYSTEMpostgres=# select pg_reload_conf();2021-07-24 15:35:03.054 +05 [10452] LOG:  received SIGHUP, reloading configuration files l --- t(1 row)postgres=# begin;BEGINpostgres=*# insert into t1 values(0);INSERT 0 1postgres=*# commit ;^CCancel request sentWARNING:  canceling wait for synchronous replication requested, but cancelation is not allowedDETAIL:  The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.^CCancel request sentWARNING:  canceling wait for synchronous replication requested, but cancelation is not allowedDETAIL:  The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names.Are there any other problems with blocking cancels?Robert's idea to provide a way to understandthe intent of the cancellation/termination request [0] could improvematters.  Perhaps adding an argument to pg_cancel/terminate_backend()and using different signals to indicate that we want to cancel thewait would be something that folks could get on board with.Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. There cannot be any correct cancelation. And I don't think it worth to add incorrect cancelation.Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties that transaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots...Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But they cancel queries by default.Thanks!Best regards, Andrey Borodin.



Re: Avoiding data loss with synchronous replication

2021-07-24 Thread Andrey Borodin



> 23 июля 2021 г., в 22:54, Bossart, Nathan  написал(а):
> 
> On 7/23/21, 4:33 AM, "Andrey Borodin"  wrote:
>> Thanks for you interest in the topic. I think in the thread [0] we almost 
>> agreed on general design.
>> The only left question is that we want to threat pg_ctl stop and kill 
>> SIGTERM differently to pg_terminate_backend().
> 
> I didn't get the idea that there was a tremendous amount of support
> for the approach to block canceling waits for synchronous replication.
> FWIW this was my initial approach as well, but I've been trying to
> think of alternatives.
> 
> If we can gather support for some variation of the block-cancels
> approach, I think that would be preferred over my proposal from a
> complexity standpoint.  
Let's clearly enumerate problems of blocking.
It's been mentioned that backend is not responsive when cancelation is blocked. 
But on the contrary, it's very responsive.

postgres=# alter system set synchronous_standby_names to 'bogus';
ALTER SYSTEM
postgres=# alter system set synchronous_commit_cancelation TO off ;
ALTER SYSTEM
postgres=# select pg_reload_conf();
2021-07-24 15:35:03.054 +05 [10452] LOG:  received SIGHUP, reloading 
configuration files
l 
---
t
(1 row)
postgres=# begin;
BEGIN
postgres=*# insert into t1 values(0);
INSERT 0 1
postgres=*# commit ;
^CCancel request sent
WARNING:  canceling wait for synchronous replication requested, but cancelation 
is not allowed
DETAIL:  The COMMIT record has already flushed to WAL locally and might not 
have been replicated to the standby. We must wait here.
^CCancel request sent
WARNING:  canceling wait for synchronous replication requested, but cancelation 
is not allowed
DETAIL:  The COMMIT record has already flushed to WAL locally and might not 
have been replicated to the standby. We must wait here.

It tells clearly what's wrong. If it's still not enough, let's add hint about 
synchronous standby names.

Are there any other problems with blocking cancels?


> Robert's idea to provide a way to understand
> the intent of the cancellation/termination request [0] could improve
> matters.  Perhaps adding an argument to pg_cancel/terminate_backend()
> and using different signals to indicate that we want to cancel the
> wait would be something that folks could get on board with.

Semantics of cancelation assumes correct query interruption. This is not 
possible already when we committed locally. There cannot be any correct 
cancelation. And I don't think it worth to add incorrect cancelation.


Interestingly, converting transaction to 2PC is a neat idea when the backend is 
terminated. It provides more guaranties that transaction will commit correctly 
even after restart. But we may be short of max_prepared_xacts slots...
Anyway backend termination bothers me a lot less than cancelation - drivers do 
not terminate queries on their own. But they cancel queries by default.


Thanks!

Best regards, Andrey Borodin.



Maintain the pathkesy for subquery from outer side information

2021-07-24 Thread Andy Fan
When I am working on the UnqiueKey stuff, I find the following cases.

SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a;
// root->query_keys = A.  root->order_pathkeys = A
// Current: subroot->query_pathkeys = NIL.
// Expected:  subroot->_pathkeys = [A].

SELECT * FROM (SELECT * FROM t offset 0) v,  t2 WHERE t2.a = t.a;
// root->query_keys = NIL
// Current: subroot->query_keys = NIL
// Expected: subroot->xxx_pathkeys = A

To resolve this issue,  I want to add a root->outer_pathkeys which means the
interesting order from the outer side for a subquery. To cover the
cases like below

// root->query_keys = root->order_keys = b.
// Expected: subroot->xxx_pathkeys = (a)?  (b)?
SELECT * FROM (SELECT * FROM t offset 0) v,  t2
WHERE t2.a = t.a order by v.b;

the root->outer_pathkeys should be a list of lists.  in above case
subroot->outer_pathkeys should be [ [a], [b] ],  this value may be
checked at many
places, like pathkeys_useful_for_ordering, get_useful_pathkeys_for_relation,
build_index_paths and more. My list might be incomplete, but once we
have a new place to check and the data is maintained already, it would
be easy to
improve.  My thinking is we maintain the root->outer_pathkeys first, and then
improve the well known function as the first step. What do you think?


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




Re: visibility map corruption

2021-07-24 Thread Bruce Momjian
On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote:
> On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:
> > > I could perhaps see corruption happening if pg_control's oldest xid
> > > value was closer to the current xid value than it should be, but I can't
> > > see how having it 2-billion away could cause harm, unless perhaps
> > > pg_upgrade itself used enough xids to cause the counter to wrap more
> > > than 2^31 away from the oldest xid recorded in pg_control.
> > >
> > > What I am basically asking is how to document this and what it fixes.
> > 
> > ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
> > take a look at those?
> 
> Agreed.  I just wanted to make sure I wasn't missing an important aspect
> of this patch.  Thanks.

Another question --- with the previous code, the oldest xid was always
set to a reasonable value, -2 billion less than the current xid.  With
the new code, the oldest xid might be slightly higher than the current
xid if they use -x but not -u. Is that acceptable?  I think we agreed it
was.  pg_upgrade will always set both.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Maintain the pathkesy for subquery from outer side information

2021-07-24 Thread Tom Lane
Andy Fan  writes:
> When I am working on the UnqiueKey stuff, I find the following cases.
> SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a;
> // root->query_keys = A.  root->order_pathkeys = A
> // Current: subroot->query_pathkeys = NIL.
> // Expected:  subroot->_pathkeys = [A].

Why do you "expect" that?  I think pushing the outer ORDER BY past a
LIMIT is an unacceptable semantics change.

regards, tom lane




Re: Maintain the pathkesy for subquery from outer side information

2021-07-24 Thread Andy Fan
On Sat, Jul 24, 2021 at 10:14 PM Tom Lane  wrote:
>
> Andy Fan  writes:
> > When I am working on the UnqiueKey stuff, I find the following cases.
> > SELECT * FROM (SELECT * FROM t offset 0) v ORDER BY a;
> > // root->query_keys = A.  root->order_pathkeys = A
> > // Current: subroot->query_pathkeys = NIL.
> > // Expected:  subroot->_pathkeys = [A].
>
> Why do you "expect" that?  I think pushing the outer ORDER BY past a
> LIMIT is an unacceptable semantics change.
>
> regards, tom lane

I don't mean push down a "ORDER BY" clause to subquery,  I mean push
down an "interesting order" to subquery.   for example we have index t(a);
then SELECT * FROM (SELECT a FROM t OFFSET 0) v ORDER BY a;
In the current implementation, when we are planning the subuqery, planners
think the "pathkey a" is not interesting,  but it should be IIUC.



-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




Re: Maintain the pathkesy for subquery from outer side information

2021-07-24 Thread Tom Lane
Andy Fan  writes:
> On Sat, Jul 24, 2021 at 10:14 PM Tom Lane  wrote:
>> Why do you "expect" that?  I think pushing the outer ORDER BY past a
>> LIMIT is an unacceptable semantics change.

> I don't mean push down a "ORDER BY" clause to subquery,  I mean push
> down an "interesting order" to subquery.   for example we have index t(a);
> then SELECT * FROM (SELECT a FROM t OFFSET 0) v ORDER BY a;
> In the current implementation, when we are planning the subuqery, planners
> think the "pathkey a" is not interesting,  but it should be IIUC.

No, it should not be.

(1) We have long treated "OFFSET 0" as an optimization fence.  That means
that the outer query shouldn't affect what plan you get for the subquery.

(2) If you ignore point (1), you could argue that choosing a different
result order doesn't matter for this subquery.  However, it potentially
*does* matter for a large fraction of the cases in which we'd not have
flattened the subquery into the outer query.  In subqueries involving
things like volatile functions, aggregates, window functions, etc,
encouraging the sub-planner to pick a different result ordering could
lead to visibly different output.

I think that in cases where there's not a semantic hazard involved,
we'd usually have pulled up the subquery so that this is all moot
anyway.

regards, tom lane




Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-24 Thread Bharath Rupireddy
On Fri, Jul 23, 2021 at 4:40 AM Bossart, Nathan  wrote:
> I would suggest changing "attach from a debugger" to "attaching with a
> debugger."

Thanks. IMO, the following looks better:
Waiting on connection startup before authentication to allow
  attaching a debugger to the process.
Waiting on connection startup after authentication to allow
  attaching a debugger to the process.

> IIUC you want to use the same set of flags as PostAuthDelay for
> PreAuthDelay, but the stated reason in this comment for leaving out
> WL_LATCH_SET suggests otherwise.  It's not clear to me why the latch
> possibly pointing to a shared latch in the future is an issue.  Should
> this instead say that we leave out WL_LATCH_SET for consistency with
> PostAuthDelay?

If WL_LATCH_SET is used for PostAuthDelay, the waiting doesn't happen
because the MyLatch which is a shared latch would be set by
SwitchToSharedLatch. More details at [1].
If WL_LATCH_SET is used for PreAuthDelay, actually there's no problem
because MyLatch is still not initialized properly in BackendInitialize
when waiting for PreAuthDelay, it still points to local latch, but
later gets pointed to shared latch and gets set SwitchToSharedLatch.
But relying on MyLatch there seems to me somewhat relying on an
uninitialized variable. More details at [1].

For PreAuthDelay, with the comment I wanted to say that the MyLatch is
not the correct one we would want to wait for. Since there is no
problem in using it there, I changed the comment to following:
/*
 * Let's not use WL_LATCH_SET for PreAuthDelay to be consistent with
 * PostAuthDelay.
 */

[1] - 
https://www.postgresql.org/message-id/flat/CALj2ACVF8AZi1bK8oH-Qoz3tYVpqFuzxcDRPdF-3p5BvF6GTxA%40mail.gmail.com

Regards,
Bharath Rupireddy.


v2-0001-Use-a-WaitLatch-for-pre-post-_auth_delay.patch
Description: Binary data


Re: Configure with thread sanitizer fails the thread test

2021-07-24 Thread Tom Lane
I wrote:
> After a bit more thought, HEAD+v14 is also my preference.  I'm not
> eager to change this in stable branches, but it doesn't seem too
> late for v14.

Perusing the commit log, I realized that there's another reason why
v14 is a good cutoff: thread_test.c was in a different place with
an allegedly different raison d'etre before 8a2121185.  So done
that way.

regards, tom lane




Re: CREATE SEQUENCE with RESTART option

2021-07-24 Thread Bharath Rupireddy
On Sat, Jul 24, 2021 at 3:20 AM Cary Huang  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> Hi
>
> I have applied and run your patch, which works fine in my environment. 
> Regarding your comments in the patch:

Thanks for the review.

> /*
>  * Restarting a sequence while defining it doesn't make any sense
>  * and it may override the START value. Allowing both START and
>  * RESTART option for CREATE SEQUENCE may cause confusion to user.
>  * Hence, we throw error for CREATE SEQUENCE if RESTART option is
>  * specified. However, it can be used with ALTER SEQUENCE.
>  */
>
> I would remove the first sentence, because it seems like a personal opinion 
> to me. I am sure someone, somewhere may think it makes total sense :).

Agree.

> I would rephrase like this:
>
> /*
>  * Allowing both START and RESTART option for CREATE SEQUENCE
>  * could override the START value and cause confusion to user. Hence,
>  * we throw an error for CREATE SEQUENCE if RESTART option is
>  * specified; it can only be used with ALTER SEQUENCE.
>  */
>
> just a thought.

LGTM. PSA v2 patch.

Regards,
Bharath Rupireddy.
From e726e40d4969f56eb4246a7016e3e2780aa2fbf1 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 24 Jul 2021 16:23:42 +
Subject: [PATCH v2] Disallow RESTART option for CREATE SEQUENCE

Allowing both START and RESTART option for CREATE SEQUENCE
could override the START value and cause confusion to user.
Hence, we throw an error for CREATE SEQUENCE if RESTART option
is specified; it can only be used with ALTER SEQUENCE.

If users want their sequences to be starting from a value different
than START value, then they can specify RESTART value with ALTER
SEQUENCE.
---
 src/backend/commands/sequence.c| 12 
 src/test/regress/expected/sequence.out |  4 
 src/test/regress/sql/sequence.sql  |  1 +
 3 files changed, 17 insertions(+)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a4..8c50f396e6 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1281,6 +1281,18 @@ init_params(ParseState *pstate, List *options, bool for_identity,
 		}
 		else if (strcmp(defel->defname, "restart") == 0)
 		{
+			/*
+			 * Allowing both START and RESTART option for CREATE SEQUENCE
+			 * could override the START value and cause confusion to user.
+			 * Hence, we throw an error for CREATE SEQUENCE if RESTART option
+			 * is specified; it can only be used with ALTER SEQUENCE.
+			 */
+			if (isInit)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("RESTART option is not allowed while creating sequence"),
+		 parser_errposition(pstate, defel->location)));
+
 			if (restart_value)
 errorConflictingDefElem(defel, pstate);
 			restart_value = defel;
diff --git a/src/test/regress/expected/sequence.out b/src/test/regress/expected/sequence.out
index 71c2b0f1df..f60725a9e5 100644
--- a/src/test/regress/expected/sequence.out
+++ b/src/test/regress/expected/sequence.out
@@ -16,6 +16,10 @@ CREATE SEQUENCE sequence_testx INCREMENT BY 1 START -10;
 ERROR:  START value (-10) cannot be less than MINVALUE (1)
 CREATE SEQUENCE sequence_testx CACHE 0;
 ERROR:  CACHE (0) must be greater than zero
+CREATE SEQUENCE sequence_testx RESTART 5;
+ERROR:  RESTART option is not allowed while creating sequence
+LINE 1: CREATE SEQUENCE sequence_testx RESTART 5;
+   ^
 -- OWNED BY errors
 CREATE SEQUENCE sequence_testx OWNED BY nobody;  -- nonsense word
 ERROR:  invalid OWNED BY option
diff --git a/src/test/regress/sql/sequence.sql b/src/test/regress/sql/sequence.sql
index 7928ee23ee..367157fe2b 100644
--- a/src/test/regress/sql/sequence.sql
+++ b/src/test/regress/sql/sequence.sql
@@ -10,6 +10,7 @@ CREATE SEQUENCE sequence_testx INCREMENT BY 1 MAXVALUE -20;
 CREATE SEQUENCE sequence_testx INCREMENT BY -1 START 10;
 CREATE SEQUENCE sequence_testx INCREMENT BY 1 START -10;
 CREATE SEQUENCE sequence_testx CACHE 0;
+CREATE SEQUENCE sequence_testx RESTART 5;
 
 -- OWNED BY errors
 CREATE SEQUENCE sequence_testx OWNED BY nobody;  -- nonsense word
-- 
2.25.1



Re: DROP INDEX CONCURRENTLY on partitioned index

2021-07-24 Thread Justin Pryzby
On Wed, Jul 14, 2021 at 09:18:12PM +0530, vignesh C wrote:
> The patch does not apply on Head anymore, could you rebase and post a
> patch. I'm changing the status to "Waiting for Author".

I'm withdrawing this patch at least until the corresponding patch for CIC is
progressed.

https://commitfest.postgresql.org/33/2815/

-- 
Justin




log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-24 Thread Andres Freund
Hi,

I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.

The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
 * XXX this is currently extremely conservative, since it forces only one
 * future log segment to exist, and even that only if we are 75% done with
 * the current one.  This is only appropriate for very low-WAL-volume systems.

Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().

I think we should consider just removing that field. Or, even better, show
something accurate instead.

As an example, here's the log output of a workload that has a replication slot
preventing WAL files from being recycled (and too small max_wal_size):

2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 
recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, 
longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB
2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, 
longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB
2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, 
longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB
2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 
recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, 
longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB
2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, 
longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB

So it's 3 new WAL files in that timeframe, one might think?

A probe instrumenting xlog file creation shows something very different:
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres 
-a XLogFileInitInternal:39
(39 is the O_CREAT BasicOpenFile(), not the recycle path).

perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
 1.001030943  9  probe_postgres:XLogFileInitInternal_L39
 2.002998896  8  probe_postgres:XLogFileInitInternal_L39
 3.005098857  8  probe_postgres:XLogFileInitInternal_L39
 4.007000426  6  probe_postgres:XLogFileInitInternal_L39
 5.008423119  9  probe_postgres:XLogFileInitInternal_L39
 6.013427568  8  probe_postgres:XLogFileInitInternal_L39
 7.015087613  8  probe_postgres:XLogFileInitInternal_L39
 8.017393339  8  probe_postgres:XLogFileInitInternal_L39
 9.018425526  7  probe_postgres:XLogFileInitInternal_L39
10.020398520  5  probe_postgres:XLogFileInitInternal_L39

Greetings,

Andres Freund




Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-24 Thread Andres Freund
Hi,

On 2021-07-24 15:50:36 -0700, Andres Freund wrote:
> As an example, here's the log output of a workload that has a replication slot
> preventing WAL files from being recycled (and too small max_wal_size):
>
> 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 
> recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, 
> longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB
> 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, 
> longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB
> 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, 
> longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB
> 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 
> recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, 
> longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB
> 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG:  checkpoint 
> complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 
> recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, 
> longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB
>
> So it's 3 new WAL files in that timeframe, one might think?
>
> A probe instrumenting xlog file creation shows something very different:
> perf probe -x 
> /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a 
> XLogFileInitInternal:39
> (39 is the O_CREAT BasicOpenFile(), not the recycle path).
>
> perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
>  1.001030943  9  
> probe_postgres:XLogFileInitInternal_L39
>  2.002998896  8  
> probe_postgres:XLogFileInitInternal_L39
>  3.005098857  8  
> probe_postgres:XLogFileInitInternal_L39
>  4.007000426  6  
> probe_postgres:XLogFileInitInternal_L39
>  5.008423119  9  
> probe_postgres:XLogFileInitInternal_L39
>  6.013427568  8  
> probe_postgres:XLogFileInitInternal_L39
>  7.015087613  8  
> probe_postgres:XLogFileInitInternal_L39
>  8.017393339  8  
> probe_postgres:XLogFileInitInternal_L39
>  9.018425526  7  
> probe_postgres:XLogFileInitInternal_L39
> 10.020398520  5  
> probe_postgres:XLogFileInitInternal_L39

And even more extreme, the logs can end up suggestiong pg_wal is shrinking,
when the opposite is the case. Here's the checkpoint output from a parallel
copy data load (without a replication slot holding things back):

2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22291 buffers (17.0%); 0 WAL file(s) added, 27 removed, 141 
recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, 
longest=0.116 s, average=0.039 s; distance=2756904 kB, estimate=2756904 kB
2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 21840 buffers (16.7%); 0 WAL file(s) added, 53 removed, 149 
recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, 
longest=0.201 s, average=0.037 s; distance=3313885 kB, estimate=3313885 kB
2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22326 buffers (17.0%); 0 WAL file(s) added, 56 removed, 149 
recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, 
longest=0.099 s, average=0.040 s; distance=3346125 kB, estimate=3346125 kB
2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG:  checkpoint 
complete: wrote 22155 buffers (16.9%); 0 WAL file(s) added, 60 removed, 148 
recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, 
longest=0.101 s, average=0.043 s; distance=3409281 kB, estimate=3409281 kB

It does look like WAL space usage is shrinking, but the opposite is true -
we're creating so much WAL that the checkpointer can't checkpoint fast enough
to keep WAL usage below max_wal_size. So WAL files are constantly created that
then need to be removed (hence the non-zero removed counts).

#   time counts unit events
   277.087990275 15  probe_postgres:XLogFileInitInternal_L39
   278.098549960 15  probe_postgres:XLogFileInitInternal_L39
   279.104787575  6  probe_postgres:XLogFileInitInternal_L39
   280.108980690  5  probe_postgres:XL

Re: something is wonky with pgbench pipelining

2021-07-24 Thread Andres Freund
Hi,

Adding RMT.

On 2021-07-21 16:55:08 -0700, Andres Freund wrote:
> On 2021-07-20 14:57:15 -0400, Alvaro Herrera wrote:
> > On 2021-Jul-20, Andres Freund wrote:
> > 
> > > I think what's happening is that the first recvfrom() actually gets all 7
> > > connection results. The server doesn't have any queries to process at that
> > > point. But we ask the kernel whether there is new network input over and 
> > > over
> > > again, despite having results to process!
> > 
> > Hmm, yeah, that seems a missed opportunity.
> 
> > > with-isbusy:
> > > ...
> > > tps = 3990.424742 (without initial connection time)
> > > ...
> > >   1,013.71 msec task-clock#0.202 CPUs utilized
> > > 80,203  raw_syscalls:sys_enter#   79.119 K/sec
> > > 19,947  context-switches  #   19.677 K/sec
> > >  2,943,676,361  cycles:u  #2.904 GHz
> > >346,607,769  cycles:k  #0.342 GHz
> > >  8,464,188,379  instructions:u#2.88  insn per 
> > > cycle
> > >226,665,530  instructions:k#0.65  insn per 
> > > cycle
> > 
> > This is quite compelling.
> > 
> > If you don't mind I can get this pushed soon in the next couple of days
> > -- or do you want to do it yourself?
> 
> I was thinking of pushing the attached, to both 14 and master, thinking
> that was what you meant, but then I wasn't quite sure: It's a relatively
> minor performance improvement, after all? OTOH, it arguably also just is
> a bit of an API misuse...
> 
> I'm inclined to push it to 14 and master, but ...

RMT: ^

Greetings,

Andres Freund




Re: Avoiding data loss with synchronous replication

2021-07-24 Thread Andres Freund
Hi,

On 2021-07-22 21:17:56 +, Bossart, Nathan wrote:
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   1. Server restarts: As noted in the docs [2], "waiting transactions
>  will be marked fully committed once the primary database
>  recovers."  I think there are a few options for handling this,
>  but the simplest would be to simply failover anytime the primary
>  server shut down.  My proposal may offer other ways of helping
>  with this.
>   2. Backend crashes: If a backend crashes, the postmaster process
>  will restart everything, leading to the same problem described in
>  1.  However, this behavior can be prevented with the
>  restart_after_crash parameter [3].
>   3. Client disconnections: During waits for synchronous replication,
>  interrupt processing is turned off, so disconnected clients
>  actually don't seem to cause a problem.  The server will still
>  wait for synchronous replication to complete prior to making the
>  transaction visible on the primary.
>   4. Query cancellations and backend terminations: This appears to be
>  the only gap where there is no way to avoid potential data loss,
>  and it is the main target of my proposal.
> 
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.

I think there's two aspects making this proposal problematic:

First, from the user experience side of things, the issue is that this seems
to propose violating read-your-own-writes. Within a single connection to a
single node. Which imo is *far* worse than seeing writes that haven't yet been
acknowledged as replicated after a query cancel.

Second, on the implementation side, I think this proposal practically amounts
to internally converting plain transaction commits into 2PC
prepare/commit. With all the associated overhead (two WAL entries/flushes per
commit, needing a separate set of procarray entries to hold the resources for
the the prepared-but-not-committed transactions, potential for running out of
the extra procarray slots). What if a user rapidly commits-cancels in a loop?
You'll almost immediately run out of procarray slots to represent all those
"not really committed" transactions.

I think there's benefit in optionally turning all transactions into 2PC ones,
but I don't see it being fast enough to be the only option.

Greetings,

Andres Freund




Re: Avoiding data loss with synchronous replication

2021-07-24 Thread Andres Freund
Hi,

On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote:
> Are there any other problems with blocking cancels?

Unless you have commandline access to the server, it's not hard to get
into a situation where you can't change the configuration setting
because all connections are hanging, and you can't even log in to do an
ALTER SERVER etc. You can't kill applications to kill the connection,
because they will just continue to hang.

Greetings,

Andres Freund




Re: Removing "long int"-related limit on hash table sizes

2021-07-24 Thread Andres Freund
Hi,

On 2021-07-23 17:15:24 -0400, Tom Lane wrote:
> Per the discussion at [1], users on Windows are seeing nasty performance
> losses in v13/v14 (compared to prior releases) for hash aggregations that
> required somewhat more than 2GB in the prior releases.

Ugh :(.


> That's because they spill to disk where they did not before.  The easy
> answer of "raise hash_mem_multiplier" doesn't help, because on Windows
> the product of work_mem and hash_mem_multiplier is clamped to 2GB,
> thanks to the ancient decision to do a lot of memory-space-related
> calculations in "long int", which is only 32 bits on Win64.

We really ought to just remove every single use of long. As Thomas
quipped on twitter at some point, "long is the asbestos of C". I think
we've incurred far more cost due to weird workarounds to deal with the
difference in long width between windows and everything else, than just
removing all use of it outright would incur.

And perhaps once we've done that, we shoulde experiment with putting
__attribute__((deprecated)) on long, but conditionalize it so it's only
used for building PG internal stuff, and doesn't leak into pg_config
output. Perhaps it'll be to painful due to external headers, but it
seems worth trying.

But obviously that doesn't help with the issue in the release branches.


> While I don't personally have the interest to fix that altogether,
> it does seem like we've got a performance regression that we ought
> to do something about immediately.  So I took a look at getting rid of
> this restriction for calculations associated with hash_mem_multiplier,
> and it doesn't seem to be too bad.  I propose the attached patch.
> (This is against HEAD; there are minor conflicts in v13 and v14.)

Hm. I wonder if we would avoid some overflow dangers on 32bit systems if
we made get_hash_memory_limit() and the relevant variables 64 bit,
rather than 32bit / size_t.  E.g.

> @@ -700,9 +697,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
> bool useskew,
>   inner_rel_bytes = ntuples * tupsize;
>
>   /*
> -  * Target in-memory hashtable size is hash_mem kilobytes.
> +  * Compute in-memory hashtable size limit from GUCs.
>*/
> - hash_table_bytes = hash_mem * 1024L;
> + hash_table_bytes = get_hash_memory_limit();
>
>   /*
>* Parallel Hash tries to use the combined hash_mem of all workers to
> @@ -710,7 +707,14 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
> bool useskew,
>* per worker and tries to process batches in parallel.
>*/
>   if (try_combined_hash_mem)
> - hash_table_bytes += hash_table_bytes * parallel_workers;
> + {
> + /* Careful, this could overflow size_t */
> + double  newlimit;
> +
> + newlimit = (double) hash_table_bytes * (double) 
> (parallel_workers + 1);
> + newlimit = Min(newlimit, (double) SIZE_MAX);
> + hash_table_bytes = (size_t) newlimit;
> + }

Wouldn't need to be as carful, I think?



> @@ -740,12 +747,26 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
> bool useskew,
>* size of skew bucket struct itself
>*--
>*/
> - *num_skew_mcvs = skew_table_bytes / (tupsize +
> - 
>  (8 * sizeof(HashSkewBucket *)) +
> - 
>  sizeof(int) +
> - 
>  SKEW_BUCKET_OVERHEAD);
> - if (*num_skew_mcvs > 0)
> - hash_table_bytes -= skew_table_bytes;
> + bytes_per_mcv = tupsize +
> + (8 * sizeof(HashSkewBucket *)) +
> + sizeof(int) +
> + SKEW_BUCKET_OVERHEAD;
> + skew_mcvs = hash_table_bytes / bytes_per_mcv;
> +
> + /*
> +  * Now scale by SKEW_HASH_MEM_PERCENT (we do it in this order 
> so as
> +  * not to worry about size_t overflow in the multiplication)
> +  */
> + skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100;

I always have to think about the evaluation order of things like this
(it's left to right for these), so I'd prefer parens around the
multiplication. I did wonder briefly whether the SKEW_HASH_MEM_PERCENT /
100 just evaluates to 0...


Looks like a good idea to me.

Greetings,

Andres Freund




Re: Removing "long int"-related limit on hash table sizes

2021-07-24 Thread Bruce Momjian
On Sat, Jul 24, 2021 at 06:25:53PM -0700, Andres Freund wrote:
> > That's because they spill to disk where they did not before.  The easy
> > answer of "raise hash_mem_multiplier" doesn't help, because on Windows
> > the product of work_mem and hash_mem_multiplier is clamped to 2GB,
> > thanks to the ancient decision to do a lot of memory-space-related
> > calculations in "long int", which is only 32 bits on Win64.
> 
> We really ought to just remove every single use of long. As Thomas
> quipped on twitter at some point, "long is the asbestos of C". I think
> we've incurred far more cost due to weird workarounds to deal with the
> difference in long width between windows and everything else, than just
> removing all use of it outright would incur.

+1

As I understand it, making long of undermined length was to allow
someone to choose a data type that _might_ be longer than int if the
compiler/OS/CPU was optimized for that, but at this point, such
optimizations just don't seem to make sense, and we know every(?) CPU
supports long-long, so why not go for something concrete?  Do we really
want our feature limits to be determined by whether we have an optimized
type longer than int?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-24 Thread Fujii Masao




On 2021/07/25 7:50, Andres Freund wrote:

Hi,

I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.

The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
  * XXX this is currently extremely conservative, since it forces only one
  * future log segment to exist, and even that only if we are 75% done with
  * the current one.  This is only appropriate for very low-WAL-volume systems.

Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().

I think we should consider just removing that field. Or, even better, show
something accurate instead.


+1 to show something accurate instead.

It's also worth showing them in monitoring stats view like pg_stat_wal?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-07-24 Thread Andres Freund
Hi,

On 2021-06-21 05:29:19 -0700, Andres Freund wrote:
> On 2021-06-16 12:12:23 -0700, Andres Freund wrote:
> > Could you share your testcase? I've been working on a series of patches
> > to address this (I'll share in a bit), and I've run quite a few tests,
> > and didn't hit any infinite loops.
> 
> Sorry for not yet doing that. Unfortunately I have an ongoing family
> health issue (& associated travel) claiming time and energy :(.
> 
> I've pushed the minimal fix due to beta 2.
> 
> Beyond beta 2 I am thinking of the below to unify the horizon
> determination:
> 
> static inline GlobalVisHorizonKind
> GlobalVisHorizonKindForRel(Relation rel)

I finally pushed this cleanup.

Greetings,

Andres Freund