date:20210615

Re: Decoding of two-phase xacts missing from CREATE_REPLICATION_SLOT command

2021-06-15 Thread Ajin Cherian

On Fri, Jun 11, 2021 at 8:14 PM Amit Kapila  wrote:
>

> Also, I can take care of the below cosmetic issues before committing
> if we decide to do this for PG-14.
>
> Few cosmetic issues:
> ==
> 1. git diff --check shows
> src/bin/pg_basebackup/t/030_pg_recvlogical.pl:109: new blank line at EOF.
>
> 2.
> +
>
>The following example shows SQL interface that can be used to decode 
> prepared
>transactions. Before you use two-phase commit commands, you must set
>
> Spurious line addition.
>

Fixed.

> 3.
> /* Build query */
>   appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\"", slot_name);
>   if (is_temporary)
>   appendPQExpBufferStr(query, " TEMPORARY");
> +
>   if (is_physical)
>
> Spurious line addition.
>

Fixed.

> 4.
>   appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
> + if (two_phase && PQserverVersion(conn) >= 14)
> + appendPQExpBufferStr(query, " TWO_PHASE");
> +
>   if (PQserverVersion(conn) >= 10)
>   /* pg_recvlogical doesn't use an exported snapshot, so suppress */
>   appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
>
> I think it might be better to append TWO_PHASE after NOEXPORT_SNAPSHOT
> but it doesn't matter much.
>

I haven't changed this, I like to keep it this way.

> 5.
> +$node->safe_psql('postgres',
> + "BEGIN;INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
>
> There is no space after BEGIN but there is a space after INSERT. For
> consistency-sake, I will have space after BEGIN as well.

Changed this.

regards,
Ajin Cherian
Fujitsu Australia


v5-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATION.patch
Description: Binary data


v5-0002-Add-support-for-two-phase-decoding-in-pg_recvlogi.patch
Description: Binary data

Re: Use singular number when appropriate

2021-06-15 Thread Laurenz Albe

On Tue, 2021-06-15 at 04:59 +, David Fetter wrote:
> I thought about using the dual, but wasn't sure how many languages
> support it.

I think none of the languages for which we cater uses the dual.
But I guess you were joking, since the tests are not translated ...

>   if (fail_count == 0 && fail_ignore_count == 0)
>   snprintf(buf, sizeof(buf),
> -  _(" All %d tests passed. "),
> -  success_count);
> +  _(" %s %d test%s passed. "),
> +  success_count == 1 ? "The" : "All",
> +  success_count,
> +  success_count == 1 ? "" : "s");

... and that wouldn't be translatable.

Yours,
Laurenz Albe

Re: Error on pgbench logs

2021-06-15 Thread Fabien COELHO



Hello Michaël,


I think we don't have to call doLog() before logAgg(). If we call doLog(),
we will count an extra transaction that is not actually processed because
accumStats() is called in this.


Yes, calling both is weird.


The motivation to call doLog is to catch up zeros on slow rates, so as to 
avoid holes in the log, including at the end of the run. This "trick" was 
already used by the code. I agree that it would record a non existant 
transaction, which is not desirable. I wanted to avoid a special 
parameter, but this seems unrealistic.


Is using logAgg() directly in the context actually right when it comes 
to sample_rate?


The point is just to trigger the last display, which is not triggered by 
the previous I think because of the precision: the start of the run is

not exactly the start of the thread.

We may not log anything on HEAD if sample_rate is enabled, but we would 
finish by logging something all the time with this patch.


I do not get it.

If I am following this code correctly, we don't care about accumStats() 
in the code path of a thread we are done with, right?


Yes.

Attached a v3 which adds a boolean to distinguish recording vs flushing.

--
Fabien.

Re: Error on pgbench logs

2021-06-15 Thread Yugo NAGATA

On Tue, 15 Jun 2021 10:05:29 +0200 (CEST)
Fabien COELHO  wrote:

> 
> Hello Michaël,
> 
> >> I think we don't have to call doLog() before logAgg(). If we call doLog(),
> >> we will count an extra transaction that is not actually processed because
> >> accumStats() is called in this.
> >
> > Yes, calling both is weird.
> 
> The motivation to call doLog is to catch up zeros on slow rates, so as to 
> avoid holes in the log, including at the end of the run. This "trick" was 
> already used by the code. I agree that it would record a non existant 
> transaction, which is not desirable. I wanted to avoid a special 
> parameter, but this seems unrealistic.
> 
> > Is using logAgg() directly in the context actually right when it comes 
> > to sample_rate?
> 
> The point is just to trigger the last display, which is not triggered by 
> the previous I think because of the precision: the start of the run is
> not exactly the start of the thread.
> 
> > We may not log anything on HEAD if sample_rate is enabled, but we would 
> > finish by logging something all the time with this patch.
> 
> I do not get it.

It was not a problem  because --sampling-rate --aggregate-interval cannot be
used at the same time.
 
> > If I am following this code correctly, we don't care about accumStats() 
> > in the code path of a thread we are done with, right?
> 
> Yes.
> 
> Attached a v3 which adds a boolean to distinguish recording vs flushing.

Sorry, but I can't find any patach attached...

-- 
Yugo NAGATA

Re: Signed vs. Unsigned (some)

2021-06-15 Thread Kyotaro Horiguchi

At Fri, 11 Jun 2021 23:05:29 -0300, Ranier Vilela  wrote 
in 
> Hi,
> 
> Removing legitimate warnings can it be worth it?

>From what the warning comes from? And what is the exact message?

> -1 CAST can be wrong, when there is an invalid value defined
> (InvalidBucket, InvalidBlockNumber).
> I think depending on the compiler -1 CAST may be different from
> InvalidBucket or InvalidBlockNumber.

The definitions are not ((type) -1) but ((type) 0x) so
actually they might be different if we forget to widen the constant
when widening the types.  Regarding to the compiler behavior, I think
we are assuming C99[1] and C99 defines that -1 is converted to
Uxxx_MAX. (6.3.1.3 Singed and unsigned integers)

I'm +0.2 on it.  It might be worthwhile as a matter of style.

> pg_rewind is one special case.
> All cases of XLogSegNo (uint64) initialization are zero, but in pg_rewind
> was used -1?
> I did not find it InvalidXLogSegNo!

I'm not sure whether that is a thinko that the variable is signed or
that it is intentional to assign the maximum value.  Anyway, actually
there's no need for initializing the variable at all. So I don't think
it's worth changing the initial value. If any compiler actually
complains about the assignment changing it to zero seems reasonable.

> Not tested.
> 
> Trivial patch attached.

Please don't quickly update the patch responding to my comments alone.
I might be a minority.

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Error on pgbench logs

2021-06-15 Thread Michael Paquier

On Tue, Jun 15, 2021 at 05:15:14PM +0900, Yugo NAGATA wrote:
> On Tue, 15 Jun 2021 10:05:29 +0200 (CEST) Fabien COELHO  
> wrote:
> It was not a problem  because --sampling-rate --aggregate-interval cannot be
> used at the same time.

Yep, you are right, thanks.  I have missed that both options cannot be
specified at the same time.
--
Michael


signature.asc
Description: PGP signature

RE: psql - factor out echo code

2021-06-15 Thread Shinya11.Kato

>> Wouldn't it be better to comment it like any other function?
>
>Sure. Attached.

Thank you for your revision.
I think this patch is good, so I will move it to ready for committer.

Best regards,
Shinya Kato

RE: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread Shinya11.Kato

>> I don't see the need to extend the unit to YB.
>> What use case do you have in mind?
>
>Practical or no, I saw no reason not to support all defined units. I assume 
>we’ll
>get to a need sooner or later. :)

Thank you for your reply!
Hmmm, I didn't think YB was necessary, but what do others think?

Best regards,
Shinya Kato

Re: Error on pgbench logs

2021-06-15 Thread Fabien COELHO




Attached a v3 which adds a boolean to distinguish recording vs flushing.


Better with the attachement… sorry for the noise.

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..3b27ffebf8 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -648,7 +648,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 		 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
   StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 			 bool skipped, StatsData *agg);
@@ -3765,6 +3765,30 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			agg->start_time,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+agg->lag.sum,
+agg->lag.sum2,
+agg->lag.min,
+agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
@@ -3774,7 +3798,7 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3793,42 +3817,29 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(100)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-	agg->start_time,
-	agg->cnt,
-	agg->latency.sum,
-	agg->latency.sum2,
-	agg->latency.min,
-	agg->latency.max);
-			if (throttle_delay)
-			{
-fprintf(logfile, " %.0f %.0f %.0f %.0f",
-		agg->lag.sum,
-		agg->lag.sum2,
-		agg->lag.min,
-		agg->lag.max);
-if (latency_limit)
-	fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
-	else
+	else if (tx)
 	{
 		/* no, print raw transactions */
 		if (skipped)
@@ -3889,7 +3900,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -6794,8 +6805,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}

RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-15 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> Well, we're talking about running this commit routine from within
> CommitTransaction(), right? So I think it is in fact running in the
> server. And if that's so, then you have to worry about how to make it
> respond to interrupts. You can't just call some functions
> DBMSX_xa_commit() and wait for infinite time for it to return. Look at
> pgfdw_get_result() for an example of what real code to do this looks
> like.

Postgres can do that, but other implementations can not necessaily do it, I'm 
afraid.  But before that, the FDW interface documentation doesn't describe 
anything about how to handle interrupts.  Actually, odbc_fdw and possibly other 
FDWs don't respond to interrupts.


> Honestly, I am not quite sure what any specification has to say about
> this. We're talking about what happens when a user does something with
> a foreign table and then type COMMIT. That's all about providing a set
> of behaviors that are consistent with how PostgreSQL works in other
> situations. You can't negotiate away the requirement to handle errors
> in a way that works with PostgreSQL's infrastructure, or the
> requirement that any length operation handle interrupts properly, by
> appealing to a specification.

What we're talking here is mainly whether commit should return success or 
failure when some participants failed to commit in the second phase of 2PC.  
That's new to Postgres, isn't it?  Anyway, we should respect existing relevant 
specifications and (well-known) implementations before we conclude that we have 
to devise our own behavior.


> That sounds more like a limitation of the present implementation than
> a fundamental problem. We shouldn't reject the idea of having a
> resolver process handle this just because the initial implementation
> might be slow. If there's no fundamental problem with the idea,
> parallelism and concurrency can be improved in separate patches at a
> later time. It's much more important at this stage to reject ideas
> that are not theoretically sound.

We talked about that, and unfortunately, I haven't seen a good and feasible 
idea to enhance the current approach that involves the resolver from the 
beginning of 2PC processing.  Honestly, I don't understand why such a "one 
prepare, one commit in turn" serialization approach can be allowed in 
PostgreSQL where developers pursue best performance and even tries to refrain 
from adding an if statement in a hot path.  As I showed and Ikeda-san said, 
other implementations have each client session send prepare and commit 
requests.  That's a natural way to achieve reasonable concurrency and 
performance.


Regards
Takayuki Tsunakawa

Re: Signed vs. Unsigned (some)

2021-06-15 Thread Ranier Vilela

Hi Kyotaro,

Thanks for taking a look.

Em ter., 15 de jun. de 2021 às 05:17, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Fri, 11 Jun 2021 23:05:29 -0300, Ranier Vilela 
> wrote in
> > Hi,
> >
> > Removing legitimate warnings can it be worth it?
>
> From what the warning comes from? And what is the exact message?
>
msvc 64 bits compiler (Level4)
warning C4245: '=': conversion from 'int' to 'Bucket', signed/unsigned
mismatch


> > -1 CAST can be wrong, when there is an invalid value defined
> > (InvalidBucket, InvalidBlockNumber).
> > I think depending on the compiler -1 CAST may be different from
> > InvalidBucket or InvalidBlockNumber.
>
> The definitions are not ((type) -1) but ((type) 0x) so
> actually they might be different if we forget to widen the constant
> when widening the types.  Regarding to the compiler behavior, I think
> we are assuming C99[1] and C99 defines that -1 is converted to
> Uxxx_MAX. (6.3.1.3 Singed and unsigned integers)
>
> I'm +0.2 on it.  It might be worthwhile as a matter of style.
>
I think about more than style.
This is one of the tricks that should not be used.


> > pg_rewind is one special case.
> > All cases of XLogSegNo (uint64) initialization are zero, but in pg_rewind
> > was used -1?
> > I did not find it InvalidXLogSegNo!
>
> I'm not sure whether that is a thinko that the variable is signed or
> that it is intentional to assign the maximum value.

It is a thinko.

  Anyway, actually
> there's no need for initializing the variable at all. So I don't think
> it's worth changing the initial value.

It is the case of removing the initialization then?


> If any compiler actually
> complains about the assignment changing it to zero seems reasonable.
>
Same case.
msvc 64 bits compiler (Level4)
warning C4245: '=': initialization from 'int' to 'XLogSegNo',
signed/unsigned mismatch


> > Not tested.
> >
> > Trivial patch attached.
>
> Please don't quickly update the patch responding to my comments alone.
> I might be a minority.
>
Ok.

best regards,
Ranier Vilela

Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Ranier Vilela

Em sex., 11 de jun. de 2021 às 19:49, Andres Freund 
escreveu:

> Hi,
>
> On 2020-04-23 14:36:15 +0900, Kyotaro Horiguchi wrote:
> > At Thu, 23 Apr 2020 01:21:21 -0300, Ranier Vilela 
> wrote in
> > > Em qua., 22 de abr. de 2020 às 23:27, Kyotaro Horiguchi <
> > > horikyota@gmail.com> escreveu:
> > > >
> > > > -   strncpy(sqlca->sqlerrm.sqlerrmc, message,
> > > > sizeof(sqlca->sqlerrm.sqlerrmc));
> > > > -   sqlca->sqlerrm.sqlerrmc[sizeof(sqlca->sqlerrm.sqlerrmc) - 1]
> = 0;
> > > > +   sqlca->sqlerrm.sqlerrmc[sizeof(sqlca->sqlerrm.sqlerrmc) - 1]
> =
> > > > '\0';
> > > > +   strncpy(sqlca->sqlerrm.sqlerrmc, message,
> > > > sizeof(sqlca->sqlerrm.sqlerrmc) - 1);
> > > >
> > > > The existing strncpy then terminating by NUL works fine. I don't
> think
> > > > there's any point in doing the reverse way.  Actually
> > > > sizeof(sqlca->sqlerrm.sqlerrmc) - 1 is enough for the length but the
> > > > existing code is not necessarily a bug.
> > > >
> > > Without understanding then, why Coveriy claims bug here.
> >
> > Well, handling non-terminated strings with str* functions are a sign
> > of bug in most cases.  Coverity is very useful but false positives are
> > annoying.  I wonder what if we attach Coverity annotations to such
> > codes.
>
> It might be worth doing something about this, for other reasons. We have
> disabled -Wstringop-truncation in 716585235b1. But I've enabled it in my
> debug build, because I find it useful. The only warning we're getting
> in non-optimized builds is
>
> /home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c: In
> function ‘ECPGset_var’:
> /home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c:565:17:
> warning: ‘strncpy’ output truncated before terminating nul copying 5 bytes
> from a string of the same length [-Wstringop-truncation]
>   565 | strncpy(sqlca->sqlstate, "YE001",
> sizeof(sqlca->sqlstate));
>   |
>  ^~
>
memcpy would not suffer from it?

regards,
Ranier Vilela

Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-06-15 Thread Ranier Vilela

Em seg., 14 de jun. de 2021 às 21:01, Ranier Vilela 
escreveu:

> I took it a step further.
>
> Transactions
>
> HEAD  patched
> 1000220710586781
> 1014616710388685
> 100489191059
> 10065764,333 10436275 3,55021946687555
>
> TPS
> HEAD   patched
> 33469,016009 35399,010472
> 33950,624679 34733,252336
> 33639,8429 34578,495043
> 33686,494529 34903,585950 3,48700968070122
>
> 3,55% Is it worth touch procarray.c for real?
>
> With msvc 64 bits, the asm generated:
> HEAD
> 213.731 bytes procarray.asm
> patched
> 212.035 bytes procarray.asm
>
> Patch attached.
>
Added to next CF (https://commitfest.postgresql.org/33/3169/)

regards,
Ranier Vilela

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Amit Kapila

On Mon, Jun 14, 2021 at 9:08 PM Robert Haas  wrote:
>
> On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila  wrote:
> > Why do you think we don't need to check index AM functions? Say we
> > have an index expression that uses function and if its parallel safety
> > is changed then probably that also impacts whether we can do insert in
> > parallel. Because otherwise, we will end up executing some parallel
> > unsafe function in parallel mode during index insertion.
>
> I'm not saying that we don't need to check index expressions. I agree
> that we need to check those. The index AM functions are things like
> btint4cmp(). I don't think that a function like that should ever be
> parallel-unsafe.
>

Okay, but I think if we go with your suggested model where whenever
there is a change in parallel-safety of any function, we need to send
the new invalidation then I think it won't matter whether the function
is associated with index expression, check constraint in the table, or
is used in any other way.

>
> Even if this line of thinking is correct, there's a big issue for
> partitioning hierarchies because there you need to know stuff about
> relations that you don't have any other reason to open. I'm just
> arguing that if there's no partitioning, the problem seems reasonably
> solvable. Either you changed something about the relation, in which
> case you've got to lock it and issue invalidations, or you've changed
> something about the function, which could be handled via a new type of
> invalidation. I don't really see why the cost would be particularly
> bad. Suppose that for every relation, you have a flag which is either
> PARALLEL_DML_SAFE, PARALLEL_DML_RESTRICTED, PARALLEL_DML_UNSAFE, or
> PARALLEL_DML_SAFETY_UNKNOWN. When someone sends a message saying "some
> existing function's parallel-safety changed!" you reset that flag for
> every relation in the relcache to PARALLEL_DML_SAFETY_UNKNOWN. Then if
> somebody does DML on that relation and we want to consider
> parallelism, it's got to recompute that flag. None of that sounds
> horribly expensive.
>

Sounds reasonable. I will think more on this and see if anything else
comes to mind apart from what you have mentioned.

> I mean, it could be somewhat annoying if you have 100k relations open
> and sit around all day flipping parallel-safety markings on and off
> and then doing a single-row insert after each flip, but if that's the
> only scenario where we incur significant extra overhead from this kind
> of design, it seems clearly better than forcing users to set a flag
> manually. Maybe it isn't, but I don't really see what the other
> problem would be right now. Except, of course, for partitioning, which
> I'm not quite sure what to do about.
>

Yeah, dealing with partitioned tables is tricky. I think if we don't
want to check upfront the parallel safety of all the partitions then
the other option as discussed could be to ask the user to specify the
parallel safety of partitioned tables. We can additionally check the
parallel safety of partitions when we are trying to insert into a
particular partition and error out if we detect any parallel-unsafe
clause and we are in parallel-mode. So, in this case, we won't be
completely relying on the users. Users can either change the parallel
safe option of the table or remove/change the parallel-unsafe clause
after error. The new invalidation message as we are discussing would
invalidate the parallel-safety for individual partitions but not the
root partition (partitioned table itself). For root partition, we will
rely on information specified by the user.

I am not sure if we have a simple way to check the parallel safety of
partitioned tables. In some way, we need to rely on user either (a) by
providing an option to specify whether parallel Inserts (and/or other
DMLs) can be performed, or (b) by providing a guc and/or rel option
which indicate that we can check the parallel-safety of all the
partitions. Yet another option that I don't like could be to
parallelize inserts on non-partitioned tables.

-- 
With Regards,
Amit Kapila.

RE: locking [user] catalog tables vs 2pc vs logical rep

2021-06-15 Thread osumi.takami...@fujitsu.com

On Tuesday, June 15, 2021 1:51 PM vignesh C  wrote:
> > Attached the patch-set that addressed those two comments.
> > I also fixed the commit message a bit in the 2PC specific patch to HEAD.
> > No other changes.
> >
> > Please check.
> 
> Thanks for the updated patches, the patch applies cleanly in all branches.
> Please add a commitfest entry for this, so that we don't miss it.
Thank you. I've registered the patch-set in [1].
I'll wait for other reviews from other developers, if any.


[1] - https://commitfest.postgresql.org/33/3170/

Best Regards,
Takamichi Osumi

Re: Decoding speculative insert with toast leaks memory

2021-06-15 Thread Amit Kapila

On Mon, Jun 14, 2021 at 12:06 PM Dilip Kumar  wrote:
>
> On Mon, Jun 14, 2021 at 9:44 AM Dilip Kumar  wrote:
> >
> > On Mon, Jun 14, 2021 at 8:34 AM Amit Kapila  wrote:
> > >
> > > I think the test in this patch is quite similar to what Noah has
> > > pointed in the nearby thread [1] to be failing at some intervals. Can
> > > you also please once verify the same and if we can expect similar
> > > failures here then we might want to consider dropping the test in this
> > > patch for now? We can always come back to it once we find a good
> > > solution to make it pass consistently.
> >
> > test insert-conflict-do-nothing   ... ok  646 ms
> > test insert-conflict-do-nothing-2 ... ok 1994 ms
> > test insert-conflict-do-update... ok 1786 ms
> > test insert-conflict-do-update-2  ... ok 2689 ms
> > test insert-conflict-do-update-3  ... ok  851 ms
> > test insert-conflict-specconflict ... FAILED 3695 ms
> > test delete-abort-savept  ... ok 1238 ms
> >
> > Yeah, this is the same test that we have used base for our test so
> > let's not push this test until it becomes stable.
>
> Patches without test case.
>

Pushed!

-- 
With Regards,
Amit Kapila.

[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Dilip Kumar

Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints.  The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.

One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.


Design Idea:
-
First, create the target database directory along with the version
file and WAL-log this operation.  Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content.  Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache.  Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.  Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation.  Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

WAL sequence using pg_waldump

1. (new wal to create db dir and write PG_VERSION file)
rmgr: Database desc: CREATE create dir 1663/16394

2. (new wal to create and write relmap file)
rmgr: RelMap   desc: CREATE database 16394 tablespace 1663 size 512

2. (create relfilenode)
rmgr: Storage  desc: CREATE base/16394/16384
rmgr: Storage  desc: CREATE base/16394/2619

3. (write page data)
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW

4. (create other forks)
rmgr: Storage  desc: CREATE base/16394/2619_fsm
rmgr: Storage  CREATE base/16394/2619_vm
.

I have attached a POC patch, which shows this idea, with this patch
all basic sanity testing and the "check-world" is passing.

Open points:
---
- This is a POC patch so needs more refactoring/cleanup and testing.
- Might need to relook into the SMGR level API usage.


Credits:
---
Thanks to Robert Haas, for suggesting this idea and the high-level design.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From e472d3cb744dc45641d36e919098f9570f80a8fd Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sat, 5 Jun 2021 17:08:13 +0530
Subject: [PATCH v1] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged and so that we can avoid the checkpoints.
---
 src/backend/access/rmgrdesc/dbasedesc.c  |   3 +-
 src/backend/access/rmgrdesc/relmapdesc.c |  10 +
 src/backend/access/transam/xlogutils.c   |  12 +-
 src/backend/commands/dbcommands.c| 653 ---
 src/backend/storage/buffer/bufmgr.c  |  13 +-
 src/backend/utils/cache/relmapper.c  | 222 +++
 src/bin/pg_rewind/parsexlog.c|   5 +
 src/include/commands/dbcommands_xlog.h   |   7 +-
 src/include/storage/bufmgr.h |   3 +-
 src/include/utils/relmapper.h|   6 +-
 10 files changed, 613 insertions(+), 321 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy di

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote:
> Currently, CREATE DATABASE forces a checkpoint, then copies all the
> files, then forces another checkpoint. The comments in the createdb()
> function explain the reasons for this. The attached patch fixes this
> problem by making CREATE DATABASE completely WAL-logged so that now we
> can avoid checkpoints.  The patch modifies both CREATE DATABASE and
> ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.
> 
> One main advantage of this change is that it will be cheaper. Forcing
> checkpoints on an idle system is no big deal, but when the system is
> under heavy write load, it's very expensive. Another advantage is that
> it makes things better for features like TDE, which might want the
> pages in the source database to be encrypted using a different key or
> nonce than the pages in the target database.

I only had a quick look at the patch but AFAICS your patch makes the new
behavior mandatory.  Wouldn't it make sense to have a way to use the previous
approach?  People creating wanting to copy somewhat big database and with a
slow replication may prefer to pay 2 checkpoints rather than stream everything.
Same for people who have an otherwise idle system (I often use that to make
temporary backups and/or prepare multiple dataset and most of the time the
checkpoint is basically free).

Re: Race condition in recovery?

2021-06-15 Thread Andrew Dunstan


On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane  wrote in 
>> I think jacana uses msys[2?], so this likely indicates a problem
>> in path sanitization for the archive command.  Andrew, any advice?
> Thanks for fixing it.
>
> # I haven't still succeed to run TAP tests on MSYS2 environment. I
> # cannot install IPC::Run for msys perl..
>
> regards.
>


Unpack the attached somewhere and point your PERL5LIB at it. That's all
I do.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com



IPC-Run-Win-0.94.tgz
Description: application/compressed-tar

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Heikki Linnakangas


On 15/06/2021 14:20, Dilip Kumar wrote:

Design Idea:
-
First, create the target database directory along with the version
file and WAL-log this operation.  Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content.  Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache.  Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.


I guess that would work, but you could also walk the database directory 
like copydir() does. How you find the relations to copy is orthogonal to 
whether you WAL-log them or use checkpoints. And whether you use the 
buffer cache is also orthogonal to the rest of the proposal; you could 
issue FlushDatabaseBuffers() instead of a checkpoint.



Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation.  Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.


Yeah, WAL-logging the contents of the source database would certainly be 
less weird than the current system. As Julien also pointed out, the 
question is, are there people using on "CREATE DATABASE foo TEMPLATE 
bar" to copy a large source database, on the premise that it's fast 
because it skips WAL-logging?


In principle, we could have both mechanisms, and use the new WAL-logged 
system if the database is small, and the old system with checkpoints if 
it's large. But I don't like idea of having to maintain both.


- Heikki

Re: Race condition in recovery?

2021-06-15 Thread Robert Haas

On Mon, Jun 14, 2021 at 3:47 PM Andrew Dunstan  wrote:
> So, will you feel happier with this applied? I haven't tested it yet but
> I'm confident it will work.

I'm not all that unhappy now, but yeah, that looks like an improvement
to me. I'm still afraid that I will keep writing tests that blow up on
Windows but that's a bigger problem than we can hope to fix on this
thread, and I do think this discussion has helped.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error on pgbench logs

2021-06-15 Thread Yugo NAGATA

On Tue, 15 Jun 2021 18:01:18 +0900
Michael Paquier  wrote:

> On Tue, Jun 15, 2021 at 05:15:14PM +0900, Yugo NAGATA wrote:
> > On Tue, 15 Jun 2021 10:05:29 +0200 (CEST) Fabien COELHO 
> >  wrote:
> > It was not a problem  because --sampling-rate --aggregate-interval cannot be
> > used at the same time.
> 
> Yep, you are right, thanks.  I have missed that both options cannot be
> specified at the same time.

Maybe, adding Assert(sample_rate == 0.0 || agg_interval == 0) or moving
the check of sample_rate into the else block could improve code readability?


-- 
Yugo NAGATA

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Dilip Kumar

On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas  wrote:
>
> On 15/06/2021 14:20, Dilip Kumar wrote:
> > Design Idea:
. Then
> > we can get the relfilenode of every file we need to copy, and prepare
> > a list of all such relfilenode.
>
> I guess that would work, but you could also walk the database directory
> like copydir() does. How you find the relations to copy is orthogonal to
> whether you WAL-log them or use checkpoints. And whether you use the
> buffer cache is also orthogonal to the rest of the proposal; you could
> issue FlushDatabaseBuffers() instead of a checkpoint.

Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache.  And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.

> Yeah, WAL-logging the contents of the source database would certainly be
> less weird than the current system. As Julien also pointed out, the
> question is, are there people using on "CREATE DATABASE foo TEMPLATE
> bar" to copy a large source database, on the premise that it's fast
> because it skips WAL-logging?
>
> In principle, we could have both mechanisms, and use the new WAL-logged
> system if the database is small, and the old system with checkpoints if
> it's large. But I don't like idea of having to maintain both.

Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.  I think code wise maintaining two
approaches will not be a very difficult job because the old approach
just calls copydir(), but I am thinking about how can we decide which
approach is better in which scenario.  I don't think we can take calls
just based on the database size?  It would also depend upon many other
factors e.g. how busy your system is, how many total dirty buffers are
there in the cluster right? because checkpoint will affect the
performance of the operation going on in other databases in the
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Error on pgbench logs

2021-06-15 Thread Yugo NAGATA

On Tue, 15 Jun 2021 11:38:00 +0200 (CEST)
Fabien COELHO  wrote:

> 
> > Attached a v3 which adds a boolean to distinguish recording vs flushing.

I am fine with this version, but I think it would be better if we have a comment
explaining what "tx" is for.

Also, how about adding Assert(tx) instead of using "else if (tx)" because
we are assuming that tx is always true when agg_interval is not used, right?

-- 
Yugo NAGATA

Re: unnesting multirange data types

2021-06-15 Thread Alexander Korotkov

On Mon, Jun 14, 2021 at 4:14 PM Alexander Korotkov  wrote:
> On Mon, Jun 14, 2021 at 3:50 PM Jonathan S. Katz  wrote:
> > On 6/13/21 5:18 PM, Alexander Korotkov wrote:
> >
> > >> "Expands an array into a set of rows. The array's elements are read out
> > >> in storage order."
> > >>
> > >> If we tweaked the multirange "unnest" function, we could change it to:
> > >>
> > >> +   
> > >> +Expands a multirange into a set of rows.
> > >> +The ranges are read out in storage order (ascending).
> > >> +   
> > >>
> > >> to match what the array "unnest" function docs, or
> > >>
> > >> +   
> > >> +Expands a multirange into a set of rows that each
> > >> +contain an individual range.
> > >> +The ranges are read out in storage order (ascending).
> > >> +   
> > >>
> > >> to be a bit more specific. However, I think this is also bordering on
> > >> overengineering the text, given there has been a lack of feedback on the
> > >> "unnest" array function description being confusing.
> > >
> > > I think it's not necessarily to say about rows here.  Our
> > > documentation already has already a number of examples, where we
> > > describe set of returned values without speaking about rows including:
> > > json_array_elements, json_array_elements_text, json_object_keys,
> > > pg_listening_channels, pg_tablespace_databases...
> >
> > I do agree -- my main point was that I don't think we need to change
> > anything. I proposed alternatives just to show some other ways of
> > looking at it. But as I mentioned, at this point I think it's
> > overengineering the text.
> >
> > If folks are good with the method + code, I think this is ready.
>
> Cool, thank you for the summary.  I'll wait for two days since I've
> published the last revision of the patch [1] (comes tomorrow), and
> push it if no new issues arise.

Pushed!  Thanks to thread participants for raising this topic and
review.  I'll be around to resolve issues if any.

--
Regards,
Alexander Korotkov

Re: doc issue missing type name "multirange" in chapter title

2021-06-15 Thread Alexander Korotkov

On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov  wrote:
> On Fri, Jun 11, 2021 at 11:16 PM Tom Lane  wrote:
> >
> > Alexander Korotkov  writes:
> > > What about "Range/Multirange Functions and Operators"?
> >
> > Better than a comma, I guess.  Personally I didn't have a
> > problem with the form with two "ands".
>
> Thank you.  I propose to push the slash option because it both evades
> double "and" and it's aligned with sibling section headers (we have
> "Date/Time Functions and Operators").
>
> Any objections?

I heard no objection.  So, pushed.

--
Regards,
Alexander Korotkov

Re: doc issue missing type name "multirange" in chapter title

2021-06-15 Thread Pavel Stehule

út 15. 6. 2021 v 15:11 odesílatel Alexander Korotkov 
napsal:

> On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov 
> wrote:
> > On Fri, Jun 11, 2021 at 11:16 PM Tom Lane  wrote:
> > >
> > > Alexander Korotkov  writes:
> > > > What about "Range/Multirange Functions and Operators"?
> > >
> > > Better than a comma, I guess.  Personally I didn't have a
> > > problem with the form with two "ands".
> >
> > Thank you.  I propose to push the slash option because it both evades
> > double "and" and it's aligned with sibling section headers (we have
> > "Date/Time Functions and Operators").
> >
> > Any objections?
>
> I heard no objection.  So, pushed.
>

Thank you

Pavel


> --
> Regards,
> Alexander Korotkov
>

Re: Function scan FDW pushdown

2021-06-15 Thread Ashutosh Bapat

Hi Alexander,

On Thu, May 20, 2021 at 11:13 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> The attached patch allows pushing joins with function RTEs to PostgreSQL
> data sources.
> This makes executing queries like this
>
> create foreign table f_pgbench_accounts (aid int, bid int, abalance int,
> filler char(84)) SERVER local_srv OPTIONS (table_name
> 'pgbench_accounts');
> select * from f_pgbench_accounts join unnest(array[1,2,3]) ON unnest =
> aid;
>

It will be good to provide some practical examples where this is useful.

> more efficient.
>
> with patch:
>
>
> So far I don't know how to visualize actual function expression used in
> function RTE, as in postgresExplainForeignScan() es->rtable comes from
> queryDesc->plannedstmt->rtable, and rte->functions is already 0.

The actual function expression will be part of the Remote SQL of
ForeignScan node so no need to visualize it separately.

The patch will have problems when there are multiple foreign tables
all on different servers or use different FDWs. In such a case the
function scan's RelOptInfo will get the fpinfo based on the first
foreign table the function scan is paired with during join planning.
But that may not be the best foreign table to join. We should be able
to plan all the possible joins. Current infra to add one fpinfo per
RelOptInfo won't help there. We need something better.

The patch targets only postgres FDW, how do you see this working with
other FDWs?

If we come up with the right approach we could use it for 1. pushing
down queries with IN () clause 2. joining a small local table with a
large foreign table by sending the local table rows down to the
foreign server.

-- 
Best Wishes,
Ashutosh Bapat

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Adam Brusselback

Am I mistaken in thinking that this would allow CREATE DATABASE to run
inside a transaction block now, further reducing the DDL commands that are
non-transactional?

Re: Case expression pushdown

2021-06-15 Thread Ashutosh Bapat

Looks quite useful to me. Can you please add this to the next commitfest?

On Wed, Jun 9, 2021 at 5:25 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> This patch allows pushing case expressions to foreign servers, so that
> more types of updates could be executed directly.
>
> For example, without patch:
>
> EXPLAIN (VERBOSE, COSTS OFF)
> UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
> WHERE c1 > 1000;
>QUERY PLAN
> ---
>Update on public.ft2 d
> Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2 WHERE ctid = $1
> ->  Foreign Scan on public.ft2 d
>   Output: CASE WHEN (c2 > 0) THEN c2 ELSE 0 END, ctid, d.*
>   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM
> "S 1"."T 1" WHERE (("C 1" > 1000)) FOR UPDATE
>
>
> EXPLAIN (VERBOSE, COSTS OFF)
> UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
> WHERE c1 > 1000;
> QUERY PLAN
> 
>   Update on public.ft2 d
> ->  Foreign Update on public.ft2 d
>   Remote SQL: UPDATE "S 1"."T 1" SET c2 = (CASE WHEN (c2 > 0)
> THEN c2 ELSE 0 END) WHERE (("C 1" > 1000))
>
>
> --
> Best regards,
> Alexander Pyhalov,
> Postgres Professional



-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Rowley

On Tue, 15 Jun 2021 at 21:24,  wrote:
> Hmmm, I didn't think YB was necessary, but what do others think?

For me personally, without consulting Wikipedia, I know that Petabyte
comes after Terabyte and then I'm pretty sure it's Exabyte.  After
that, I'd need to check.

Assuming I'm not the only person who can't tell exactly how many bytes
are in a Yottabyte, would it actually be a readability improvement if
we started showing these units to people?

I'd say there might be some argument to implement as far as PB one
day, maybe not that far out into the future, especially if we got
something like built-in clustering. But I just don't think there's any
need to go all out and take it all the way to YB.  There's an above
zero chance we'll break something of someones by doing this, so I
think any changes here should be driven off an actual requirement.

I really think this change is more likely to upset someone than please someone.

Just my thoughts.

David

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread Isaac Morland

On Tue, 15 Jun 2021 at 05:24,  wrote:

> >> I don't see the need to extend the unit to YB.
> >> What use case do you have in mind?
> >
> >Practical or no, I saw no reason not to support all defined units. I
> assume we’ll
> >get to a need sooner or later. :)
>
> Thank you for your reply!
> Hmmm, I didn't think YB was necessary, but what do others think?
>

If I’m reading the code correctly, the difference between supporting YB and
not supporting it is whether there is a line for it in the list of prefixes
and their multiples. As such, I don’t see why we’re even discussing whether
or not to include all the standard prefixes. A YB is still an absurd amount
of storage, but that’s not the point; just put all the standard prefixes
and be done with it. If actual code changes were required in the new code
as they are in the old it might be worth discussing.

One question: why is there no “k” in the list of prefixes?

Also: why not have only the prefixes in the array, and use a single fixed
output format "%s %sB" all the time?

It feels like it should be possible to calculate the appropriate idx to use
(while adjusting the number to print as is done now) and then just have one
psprintf call for all cases.

A more significant question is YB vs. YiB. I know there is a long tradition
within computer-related fields of saying that k = 1024, M = 1024^2, etc.,
but we’re not special enough to override the more general principles of SI
(Système International) which provide that k = 1000, M = 1000^2 and so on
universally and provide the alternate prefixes ki, Mi, etc. which use 1024
as the multiple.

So I would suggest either display 200 as 2MB or as 1.907MiB.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Andrew Dunstan



On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
>
> Yeah, WAL-logging the contents of the source database would certainly
> be less weird than the current system. As Julien also pointed out, the
> question is, are there people using on "CREATE DATABASE foo TEMPLATE
> bar" to copy a large source database, on the premise that it's fast
> because it skips WAL-logging?


I'm 100% certain there are. It's not even a niche case.


>
> In principle, we could have both mechanisms, and use the new
> WAL-logged system if the database is small, and the old system with
> checkpoints if it's large. But I don't like idea of having to maintain
> both.
>
>

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: unnesting multirange data types

2021-06-15 Thread Tom Lane

Alexander Korotkov  writes:
> Pushed!  Thanks to thread participants for raising this topic and
> review.  I'll be around to resolve issues if any.

Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
before pushing?

regards, tom lane

SSL/TLS instead of SSL in docs

2021-06-15 Thread Daniel Gustafsson

In the NSS thread it was discussed (20210603210642.gf22...@momjian.us etc) that
we use SSL rather than TLS in the documentation, which is technically somewhat
incorrect.  Consensus came to using SSL/TLS instead for referring to encrypted
connections.  Since this isn't really limited to the NSS work, I'm breaking
this out into a new thread.

Looking at the docs it turns out that we have a mix of SSL (with one ssl),
SSL/TLS and TLS for referring to the same thing.  The attached changes the
documentation to consistently use SSL/TLS when referring to an encrypted
connection using a TLS protocol, leaving bare SSL and TLS only for referring to
the actual protocols.  I *think* I found all instances, there are many so I
might have missed some, but this version seemed like a good place to continue
the discussion from the previous thread.

Admittedly it gets pretty unwieldy with the  markup on SSL and TLS
but I opted for being consistent, since I don't know of any rules for when it
can/should be omitted (and it seems quite arbitrary right now).  Mentions in
titles were previously not marked up so I've left those as is.  I've also left
line breaks as an excercise for later to make the diff more readable.

While in there I added IMO missing items to the glossary and acronyms sections
as well as fixed up markup around OpenSSL.

This only deals with docs, but if this is deemed interesting then userfacing
messages in the code should use SSL/TLS as well of course.

Thoughts?

--
Daniel Gustafsson   https://vmware.com/



v1-0003-docs-Consistent-OpenSSL-markup.patch
Description: Binary data


v1-0002-docs-Replace-usage-of-SSL-with-SSL-TLS.patch
Description: Binary data


v1-0001-docs-SSL-TLS-related-acronyms-and-glossary.patch
Description: Binary data

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Robert Haas

On Tue, Jun 15, 2021 at 7:05 AM Amit Kapila  wrote:
> Okay, but I think if we go with your suggested model where whenever
> there is a change in parallel-safety of any function, we need to send
> the new invalidation then I think it won't matter whether the function
> is associated with index expression, check constraint in the table, or
> is used in any other way.

No, it will still matter, because I'm proposing that the
parallel-safety of functions that we only access via operator classes
will not even be checked. Also, if we decided to make the system more
fine-grained - e.g. by invalidating on the specific OID of the
function that was changed rather than doing something that is
database-wide or global - then it matters even more.

> Yeah, dealing with partitioned tables is tricky. I think if we don't
> want to check upfront the parallel safety of all the partitions then
> the other option as discussed could be to ask the user to specify the
> parallel safety of partitioned tables.

Just to be clear here, I don't think it really matters what we *want*
to do. I don't think it's reasonably *possible* to check all the
partitions, because we don't hold locks on them. When we're assessing
a bunch of stuff related to an individual relation, we have a lock on
it. I think - though we should double-check tablecmds.c - that this is
enough to prevent all of the dependent objects - triggers,
constraints, etc. - from changing. So the stuff we care about is
stable. But the situation with a partitioned table is different. In
that case, we can't even examine that stuff without locking all the
partitions. And even if we do lock all the partitions, the stuff could
change immediately afterward and we wouldn't know. So I think it would
be difficult to make it correct.

Now, maybe it could be done, and I think that's worth a little more
thought. For example, perhaps whenever we invalidate a relation, we
could also somehow send some new, special kind of invalidation for its
parent saying, essentially, "hey, one of your children has changed in
a way you might care about." But that's not good enough, because it
only goes up one level. The grandparent would still be unaware that a
change it potentially cares about has occurred someplace down in the
partitioning hierarchy. That seems hard to patch up, again because of
the locking rules. The child can know the OID of its parent without
locking the parent, but it can't know the OID of its grandparent
without locking its parent. Walking up the whole partitioning
hierarchy might be an issue for a number of reasons, including
possible deadlocks, and possible race conditions where we don't emit
all of the right invalidations in the face of concurrent changes. So I
don't quite see a way around this part of the problem, but I may well
be missing something. In fact I hope I am missing something, because
solving this problem would be really nice.

> We can additionally check the
> parallel safety of partitions when we are trying to insert into a
> particular partition and error out if we detect any parallel-unsafe
> clause and we are in parallel-mode. So, in this case, we won't be
> completely relying on the users. Users can either change the parallel
> safe option of the table or remove/change the parallel-unsafe clause
> after error. The new invalidation message as we are discussing would
> invalidate the parallel-safety for individual partitions but not the
> root partition (partitioned table itself). For root partition, we will
> rely on information specified by the user.

Yeah, that may be the best we can do. Just to be clear, I think we
would want to check whether the relation is still parallel-safe at the
start of the operation, but not have a run-time check at each function
call.

> I am not sure if we have a simple way to check the parallel safety of
> partitioned tables. In some way, we need to rely on user either (a) by
> providing an option to specify whether parallel Inserts (and/or other
> DMLs) can be performed, or (b) by providing a guc and/or rel option
> which indicate that we can check the parallel-safety of all the
> partitions. Yet another option that I don't like could be to
> parallelize inserts on non-partitioned tables.

If we figure out a way to check the partitions automatically that
actually works, we don't need a switch for it; we can (and should)
just do it that way all the time. But if we can't come up with a
correct algorithm for that, then we'll need to add some kind of option
where the user declares whether it's OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan  wrote:
>
> Rather than use size, I'd be inclined to say use this if the source
> database is marked as a template, and use the copydir approach for
> anything that isn't.

Looks like a good approach.

Confused by the default privilege

2021-06-15 Thread 孙冰

Hi,

My use case is to create an isolated interface schema consisting of only
views and functions (possibly many schemas, for multi-tenancy or
multi-version), which has the minimal access exposure. To reduce the mental
and maintenance burden, I am inclined to create one role per interface
schema, instead of creating separate roles for the owner and the user. As a
consequence, the default privileges must be revoked from the owner.
Explicit revocation works just fine, except that it requires repetitive and
forgettable statements for each object in the schema.

The default privileges come to rescue. It mostly works, despite a bit of
confusion to me.

The ending contents are some experiments and demonstrations. To sum up, I
have to either leave some non-critical privileges (e.g., trigger,
references) by the default privilege mechanism or manually revoke all
privileges, to stop the owner having all the default privileges. Plus, the
first alternative is not applicable to functions because there is only one
privilege for functions (execute).

To me, it is confusing and less intuitive. Or is there something I miss?

TL;DR
Revoking all default privileges is effectively equivalent to revoking
nothing, because an empty string of access privileges is handled as
'default'.

Maybe 'NULL' for 'default', and '' (empty string) means nothing?

Regards.


--

drop owned by owner;
drop role if exists owner, guest;

create role owner;
create role guest;

drop schema if exists s;
create schema if not exists s authorization owner;

DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA
1. tables
1.1. no-op

set role to owner;
create or replace view s.v1 as select 1;

\dp+ s.v1

Schema Name Type Access privileges Column privileges Policies
s v1 view

select * from information_schema.role_table_grants where table_name='v1';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v1 INSERT YES NO
owner owner postgres s v1 SELECT YES YES
owner owner postgres s v1 UPDATE YES NO
owner owner postgres s v1 DELETE YES NO
owner owner postgres s v1 TRUNCATE YES NO
owner owner postgres s v1 REFERENCES YES NO
owner owner postgres s v1 TRIGGER YES NO

set role to owner;
select * from s.v1;

?column?
1
1.2. default privilege: revoke all from owner

alter default privileges for user owner revoke all on tables from owner;
\ddp+

Owner Schema Type Access privileges
owner   table

set role to owner;
create or replace view s.v2 as select 1;

\dp+ s.v2

Schema Name Type Access privileges Column privileges Policies
s v2 view

select * from information_schema.role_table_grants where table_name='v2';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v2 INSERT YES NO
owner owner postgres s v2 SELECT YES YES
owner owner postgres s v2 UPDATE YES NO
owner owner postgres s v2 DELETE YES NO
owner owner postgres s v2 TRUNCATE YES NO
owner owner postgres s v2 REFERENCES YES NO
owner owner postgres s v2 TRIGGER YES NO

set role to owner;
select * from s.v2;

?column?
1
1.3. default privilege: revoke all but one from owner

alter default privileges for user owner revoke all on tables from owner;
alter default privileges for user owner grant trigger on tables to owner;
\ddp+

Owner Schema Type Access privileges
owner   table owner=t/owner

set role to owner;
create or replace view s.v3 as select 1;

\dp+ s.v3

Schema Name Type Access privileges Column privileges Policies
s v3 view owner=t/owner

select * from information_schema.role_table_grants where table_name='v3';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v3 TRIGGER YES NO

set role to owner;
select * from s.v3;

ERROR:  42501: permission denied for view v3
LOCATION:  aclcheck_error, aclchk.c:3461

1.4. manual revoke all from owner

alter default privileges for user owner revoke all on tables from owner;
\ddp+

Owner Schema Type Access privileges
owner   table

set role to owner;
create or replace view s.v4 as select 1;

\dp+ s.v4

Schema Name Type Access privileges Column privileges Policies
s v4 view

select * from information_schema.role_table_grants where table_name='v4';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v4 INSERT YES NO
owner owner postgres s v4 SELECT YES YES
owner owner postgres s v4 UPDATE YES NO
owner owner postgres s v4 DELETE YES NO
owner owner postgres s v4 TRUNCATE YES NO
owner owner postgres s v4 REFERENCES YES NO
owner owner postgres s v4 TRIGGER YES NO

set role to owner;
select * from s.v4;

?column?
1

So far, the situation is identical to s.v2.

set role to owner;
revoke all on table s.v4 from owner;

\dp+ s.v4

Schema Name Type Access privileges Column privileges Policies
s v4 view

select

Re: Confused by the default privilege

2021-06-15 Thread 孙冰

Gee, I pasted the ending demonstration as html.

Re-pasting a text version.

--


┌
│ drop owned by owner;
│ drop role if exists owner, guest;
│
│ create role owner;
│ create role guest;
│
│ drop schema if exists s;
│ create schema if not exists s authorization owner;
└

DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA


1 tables


1.1 no-op


  ┌
  │ set role to owner;
  │ create or replace view s.v1 as select 1;
  └

  ┌
  │ \dp+ s.v1
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  
   s   v1view
  

  ┌
  │ select * from information_schema.role_table_grants where
table_name='v1';
  └


━
   grantor  grantee  table_catalog  table_schema  table_name
 privilege_type  is_grantable  with_hierarchy

─
   ownerownerpostgres   s v1  INSERT
   YES   NO
   ownerownerpostgres   s v1  SELECT
   YES   YES
   ownerownerpostgres   s v1  UPDATE
   YES   NO
   ownerownerpostgres   s v1  DELETE
   YES   NO
   ownerownerpostgres   s v1  TRUNCATE
   YES   NO
   ownerownerpostgres   s v1  REFERENCES
   YES   NO
   ownerownerpostgres   s v1  TRIGGER
  YES   NO

━

  ┌
  │ set role to owner;
  │ select * from s.v1;
  └

  ━━
   ?column?
  ──
  1
  ━━


1.2 default privilege: `revoke all from owner'
───

  ┌
  │ alter default privileges for user owner revoke all on tables from owner;
  │ \ddp+
  └

  ━
   Owner  Schema  Type   Access privileges
  ─
   owner  table
  ━

  ┌
  │ set role to owner;
  │ create or replace view s.v2 as select 1;
  └

  ┌
  │ \dp+ s.v2
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  
   s   v2view
  

  ┌
  │ select * from information_schema.role_table_grants where
table_name='v2';
  └


━
   grantor  grantee  table_catalog  table_schema  table_name
 privilege_type  is_grantable  with_hierarchy

─
   ownerownerpostgres   s v2  INSERT
   YES   NO
   ownerownerpostgres   s v2  SELECT
   YES   YES
   ownerownerpostgres   s v2  UPDATE
   YES   NO
   ownerownerpostgres   s v2  DELETE
   YES   NO
   ownerownerpostgres   s v2  TRUNCATE
   YES   NO
   ownerownerpostgres   s v2  REFERENCES
   YES   NO
   ownerownerpostgres   s v2  TRIGGER
  YES   NO

━

  ┌
  │ set role to owner;
  │ select * from s.v2;
  └

  ━━
   ?column?
  ──
  1
  ━━


1.3 default privilege: `revoke all but one from owner'
───

  ┌
  │ alter default privileges for user owner revoke all on tables from owner;
  │ alter default privileges for user owner grant trigger on tables to
owner;
  │ \ddp+
  └

  ━
   Owner  Schema  Type   Access privileges
  ─
   owner  table  owner=t/owner
  ━

  ┌
  │ set role to owner;
  │ create or replace view s.v3 as select 1;
  └

  ┌
  │ \dp+ s.v3
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  ─

Re: Fix around conn_duration in pgbench

2021-06-15 Thread Yugo NAGATA

On Mon, 14 Jun 2021 10:57:07 +0200 (CEST)
Fabien COELHO  wrote:
 
> > However, I found that conn_duration is calculated even when -C/--connect
> > is not specified, which is waste. SO we can remove this code as fixed in
> > the attached patch.
> 
> I'm fine with the implied code simplification, but it deserves a comment.

Thank you for adding comments!
 
> > In addition, deconnection delays are not cumulated even under -C/--connect
> > in spite of mentioned in the comment. I also fixed this in the attached 
> > patch.
> 
> I'm fine with that, even if it only concerns is_connect. I've updated the 
> command to work whether now is initially set or not. 

Ok. I agree with your update. 
 
> Also, there is the issue of connection failures: the attached version adds 
> an error message and exit for initial connections consistently.
> This is not done with is_connect, though, and I'm unsure what we should 
> really do.

Well, as to connection failures, I think that we should discuss in the other
thread [1] where this issue was originally raised or in a new thread because
we can discuss this as a separated issue from the originally proposed patch.

[1] 
https://www.postgresql.org/message-id/flat/TYCPR01MB5870057375ACA8A73099C649F5349%40TYCPR01MB5870.jpnprd01.prod.outlook.com.

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Amit Kapila

On Mon, Jun 14, 2021 at 9:08 PM Robert Haas  wrote:
>
> On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila  wrote:
>
> > Yeah, this could be one idea but I think even if we use pg_proc OID,
> > we still need to check all the rel cache entries to find which one
> > contains the invalidated OID and that could be expensive. I wonder
> > can't we directly find the relation involved and register invalidation
> > for the same? We are able to find the relation to which trigger
> > function is associated during drop function via findDependentObjects
> > by scanning pg_depend. Assuming, we are able to find the relation for
> > trigger function by scanning pg_depend, what kinds of problems do we
> > envision in registering the invalidation for the same?
>
> I don't think that finding the relation involved and registering an
> invalidation for the same will work properly. Suppose there is a
> concurrently-running transaction which has created a new table and
> attached a trigger function to it. You can't see any of the catalog
> entries for that relation yet, so you can't invalidate it, but
> invalidation needs to happen. Even if you used some snapshot that can
> see those catalog entries before they are committed, I doubt it fixes
> the race condition. You can't hold any lock on that relation, because
> the creating transaction holds AccessExclusiveLock, but the whole
> invalidation mechanism is built around the assumption that the sender
> puts messages into the shared queue first and then releases locks,
> while the receiver first acquires a conflicting lock and then
> processes messages from the queue.
>

Won't such messages be proceesed at start transaction time
(AtStart_Cache->AcceptInvalidationMessages)?

> Without locks, that synchronization
> algorithm can't work reliably. As a consequence of all that, I believe
> that, not just in this particular case but in general, the
> invalidation message needs to describe the thing that has actually
> changed, rather than any derived property. We can make invalidations
> that say "some function's parallel-safety flag has changed" or "this
> particular function's parallel-safety flag has changed" or "this
> particular function has changed in some way" (this one, we have
> already), but anything that involves trying to figure out what the
> consequences of such a change might be and saying "hey, you, please
> update XYZ because I changed something somewhere that could affect
> that" is not going to be correct.
>
> > I think we probably need to worry about the additional cost to find
> > dependent objects and if there are any race conditions in doing so as
> > pointed out by Tom in his email [1]. The concern related to cost could
> > be addressed by your idea of registering such an invalidation only
> > when the user changes the parallel safety of the function which we
> > don't expect to be a frequent operation. Now, here the race condition,
> > I could think of could be that by the time we change parallel-safety
> > (say making it unsafe) of a function, some of the other sessions might
> > have already started processing insert on a relation where that
> > function is associated via trigger or check constraint in which case
> > there could be a problem. I think to avoid that we need to acquire an
> > Exclusive lock on the relation as we are doing in Rename Policy kind
> > of operations.
>
> Well, the big issue here is that we don't actually lock functions
> while they are in use. So there's absolutely nothing that prevents a
> function from being altered in any arbitrary way, or even dropped,
> while code that uses it is running. I don't really know what happens
> in practice if you do that sort of thing: can you get the same query
> to run with one function definition for the first part of execution
> and some other definition for the rest of execution? I tend to doubt
> it, because I suspect we cache the function definition at some point.
>

It is possible that in the same statement execution a different
function definition can be executed. Say, in session-1 we are
inserting three rows, on first-row execution definition-1 of function
in index expression gets executed. Now, from session-2, we change the
definition of the function to definition-2. Now, in session-1, on
second-row insertion, while executing definition-1 of function, we
insert in another table that will accept the invalidation message
registered in session-2. Now, on third-row insertion, the new
definition (definition-2) of function will be executed.

> If that's the case, caching the parallel-safety marking at the same
> point seems OK too, or at least no worse than what we're doing
> already. But on the other hand if it is possible for a query's notion
> of the function definition to shift while the query is in flight, then
> this is just another example of that and no worse than any other.
> Instead of changing the parallel-safety flag, somebody could redefine
> the function so that it divides by zero or pro

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Christensen

On Tue, Jun 15, 2021 at 8:31 AM Isaac Morland 
wrote:

> On Tue, 15 Jun 2021 at 05:24,  wrote:
>
>> >> I don't see the need to extend the unit to YB.
>> >> What use case do you have in mind?
>> >
>> >Practical or no, I saw no reason not to support all defined units. I
>> assume we’ll
>> >get to a need sooner or later. :)
>>
>> Thank you for your reply!
>> Hmmm, I didn't think YB was necessary, but what do others think?
>>
>
> If I’m reading the code correctly, the difference between supporting YB
> and not supporting it is whether there is a line for it in the list of
> prefixes and their multiples. As such, I don’t see why we’re even
> discussing whether or not to include all the standard prefixes. A YB is
> still an absurd amount of storage, but that’s not the point; just put all
> the standard prefixes and be done with it. If actual code changes were
> required in the new code as they are in the old it might be worth
> discussing.
>

Agreed, this is why I went this way.  One and done.

> One question: why is there no “k” in the list of prefixes?
>

kB has a special-case code block before you get to this point.  I didn't
look into the reasons, but assume there are some.

> Also: why not have only the prefixes in the array, and use a single fixed
> output format "%s %sB" all the time?
>
> It feels like it should be possible to calculate the appropriate idx to
> use (while adjusting the number to print as is done now) and then just have
> one psprintf call for all cases.
>

Sure, if that seems more readable/understandable.

> A more significant question is YB vs. YiB. I know there is a long
> tradition within computer-related fields of saying that k = 1024, M =
> 1024^2, etc., but we’re not special enough to override the more general
> principles of SI (Système International) which provide that k = 1000, M =
> 1000^2 and so on universally and provide the alternate prefixes ki, Mi,
> etc. which use 1024 as the multiple.
>
> So I would suggest either display 200 as 2MB or as 1.907MiB.
>

Heh, I was just expanding the existing logic; if others want to have this
particular battle go ahead and I'll adjust the code/prefixes, but obviously
the logic will need to change if we want to support true MB instead of MiB
as MB.

Also, this will presumably be a breaking change for anyone using the
existing units MB == 1024 * 1024, as we've had for something like 20
years.  Changing these units to the *iB will be trivial with this patch,
but not looking forward to garnering the consensus to change this part.

David

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Christensen

On Tue, Jun 15, 2021 at 8:26 AM David Rowley  wrote:

> On Tue, 15 Jun 2021 at 21:24,  wrote:
> > Hmmm, I didn't think YB was necessary, but what do others think?
>
> For me personally, without consulting Wikipedia, I know that Petabyte
> comes after Terabyte and then I'm pretty sure it's Exabyte.  After
> that, I'd need to check.
>
> Assuming I'm not the only person who can't tell exactly how many bytes
> are in a Yottabyte, would it actually be a readability improvement if
> we started showing these units to people?
>

I hadn't really thought about that TBH; to me it seemed like an
improvement, but I do see that others might not, and adding confusion is
definitely not helpful.  That said, it seems like having the code
structured in a way that we can expand via adding an element to a table
instead of the existing way it's written with nested if blocks is still a
useful refactor, whatever we decide the cutoff units should be.

> I'd say there might be some argument to implement as far as PB one
> day, maybe not that far out into the future, especially if we got
> something like built-in clustering. But I just don't think there's any
> need to go all out and take it all the way to YB.  There's an above
> zero chance we'll break something of someones by doing this, so I
> think any changes here should be driven off an actual requirement.
>

I got motivated to do this due to some (granted synthetic) work/workloads,
where I was seeing 6+digit TB numbers and thought it was ugly.  Looked at
the code and thought the refactor was the way to go, and just stuck all of
the known units in.

> I really think this change is more likely to upset someone than please
> someone.
>

I'd be interested to see reactions from people; to me, it seems a +1, but
seems like -1, 0, +1 all valid opinions here; I'd expect more 0's and +1s,
but I'm probably biased since I wrote this. :-)

Re: pg_stat_statements and "IN" conditions

2021-06-15 Thread Dmitry Dolgov

> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > >
> > > +   int         lastExprLenght = 0;
> > >
> > > Did you mean to name the variable lastExprLenghth ?
> > >
> > > w.r.t. extracting to helper method, the second and third
> > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > It is up to you whether to create the helper method.
> > > I am fine with the current formation.
> >
> > Dmitry, thoughts on this review?
>
> Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> the helper is not strictly necessary I wanted to wait a bit hoping for
> more feedback and eventually to post an accumulated patch. Doesn't make
> sense to post another version only to fix one typo :)

Hi,

I've prepared a new rebased version to deal with the new way of
computing query id, but as always there is one tricky part. From what I
understand, now an external module can provide custom implementation for
query id computation algorithm. It seems natural to think this machinery
could be used instead of patch in the thread, i.e. one could create a
custom logic that will enable constants collapsing as needed, so that
same queries with different number of constants in an array will be
hashed into the same record.

But there is a limitation in how such queries will be normalized
afterwards — to reduce level of surprise it's necessary to display the
fact that a certain query in fact had more constants that are showed in
pgss record. Ideally LocationLen needs to carry some bits of information
on what exactly could be skipped, and generate_normalized_query needs to
understand that, both are not reachable for an external module with
custom query id logic (without replicating significant part of the
existing code). Hence, a new version of the patch.
>From 81a217485385452c7223a140dd4c98eeb5270945 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Thu, 10 Jun 2021 13:15:35 +0200
Subject: [PATCH v4] Prevent jumbling of every element in ArrayExpr

pg_stat_statements produces multiple entries for queries like

SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on number of parameters, because every element of ArrayExpr is
jumbled. Make Consts (or any expression that could be reduced to a
Const) contribute nothing to the jumble hash if they're part of a series
and at position further that specified threshold. Do the same for
similar queries with VALUES as well.
---
 .../expected/pg_stat_statements.out   | 835 +-
 .../pg_stat_statements/pg_stat_statements.c   |  42 +-
 .../sql/pg_stat_statements.sql| 163 
 src/backend/utils/misc/guc.c  |  13 +
 src/backend/utils/misc/queryjumble.c  | 274 +-
 src/include/utils/queryjumble.h   |  11 +-
 6 files changed, 1323 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 40b5109b55..3fc1978066 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -205,7 +205,7 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  INSERT INTO test VALUES(generate_series($1, $2), $3) | 1 |   10
  SELECT * FROM test ORDER BY a| 1 |   12
  SELECT * FROM test WHERE a > $1 ORDER BY a   | 2 |4
- SELECT * FROM test WHERE a IN ($1, $2, $3, $4, $5)   | 1 |8
+ SELECT * FROM test WHERE a IN ($1, $2, $3, $4, ...)  | 1 |8
  SELECT pg_stat_statements_reset()| 1 |1
  SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 |0
  UPDATE test SET b = $1 WHERE a = $2  | 6 |6
@@ -1067,4 +1067,837 @@ SELECT COUNT(*) FROM pg_stat_statements WHERE query LIKE '%SELECT GROUPING%';
  2
 (1 row)
 
+--
+-- Consts merging
+--
+SET pg_stat_statements.merge_threshold = 5;
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- Normal
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3);
+ id | data 
++--
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query  | calls 
++---
+ SELECT * FROM test_merge WHERE id IN ($1, $2, $3)  | 1
+ SELECT pg_stat_statements_reset()  | 1

Re: Use singular number when appropriate

2021-06-15 Thread David Fetter

On Tue, Jun 15, 2021 at 09:37:11AM +0200, Laurenz Albe wrote:
> On Tue, 2021-06-15 at 04:59 +, David Fetter wrote:
> > I thought about using the dual, but wasn't sure how many languages
> > support it.
> 
> I think none of the languages for which we cater uses the dual.  But
> I guess you were joking, since the tests are not translated ...

I was.

> > if (fail_count == 0 && fail_ignore_count == 0)
> > snprintf(buf, sizeof(buf),
> > -_(" All %d tests passed. "),
> > -success_count);
> > +_(" %s %d test%s passed. "),
> > +success_count == 1 ? "The" : "All",
> > +success_count,
> > +success_count == 1 ? "" : "s");
> 
> ... and that wouldn't be translatable.

Thanks, will rearrange.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Duplicate history file?

2021-06-15 Thread Stephen Frost

Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier  
> wrote in 
> > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> > > I think cp can be an example as far as we explain the limitations. (On
> > > the other hand "test !-f" cannot since it actually prevents server
> > > from working correctly.)
> > 
> > Disagreed.  I think that we should not try to change this area until
> > we can document a reliable solution, and a simple "cp" is not that.
> 
> Isn't removing cp from the documentation a change in this area? I
> basically agree to not to change anything but the current example
> "test ! -f  && cp .." and relevant description has been known to
> be problematic in a certain situation.

[...]

> - Write the full (known) requirements and use a pseudo tool-name in
>  the example?

I'm generally in favor of just using a pseudo tool-name and then perhaps
providing a link to a new place on .Org where people can ask to have
their PG backup solution listed, or something along those lines.

>  - provide a minimal implement of the command?

Having been down this road for a rather long time, I can't accept this
as a serious suggestion.  No, not even with Perl.  Been there, done
that, not going back.

>  - recommend some external tools (that we can guarantee that they
>comform the requriements)?

The requirements are things which are learned over years and changes
over time.  Trying to document them and keep up with them would be a
pretty serious project all on its own.  There are external projects who
spend serious time and energy doing their best to provide the tooling
needed here and we should be promoting those, not trying to pretend like
this is a simple thing which anyone could write a short perl script to
accomplish.

>  - not recommend any tools?

This is the approach that has been tried and it's, objectively, failed
miserably.  Our users are ending up with invalid and unusable backups,
corrupted WAL segments, inability to use PITR, and various other issues
because we've been trying to pretend that this isn't a hard problem.  We
really need to stop that and accept that it's hard and promote the tools
which have been explicitly written to address that hard problem.

> > Hmm.  A simple command that could be used as reference is for example
> > "dd" that flushes the file by itself, or we could just revisit the
> > discussions about having a pg_copy command, or we could document a
> > small utility in perl that does the job.
> 
> I think we should do that if pg_copy comforms the mandatory
> requirements but maybe it's in the future. Showing the minimal
> implement in perl looks good.

Already tried doing it in perl.  No, it's not simple and it's also
entirely vaporware today and implies that we're going to develop this
tool, improve it in the future as we realize it needs to be improved,
and maintain it as part of core forever.  If we want to actually adopt
and pull in a backup tool to be part of core then we should talk about
things which actually exist, such as the various existing projects that
have been written to specifically work to address all the requirements
which are understood today, not say "well, we can just write a simple
perl script to do it" because it's not actually that simple.

Providing yet another half solution would be doubling-down on the failed
approach to document a "simple" solution and would be a disservice to
our users.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Rowley

On Wed, 16 Jun 2021 at 02:58, David Christensen
 wrote:
> That said, it seems like having the code structured in a way that we can 
> expand via adding an element to a table instead of the existing way it's 
> written with nested if blocks is still a useful refactor, whatever we decide 
> the cutoff units should be.

I had not really looked at the patch, but if there's a cleanup portion
to the same patch as you're adding the YB too, then maybe it's worth
separating those out into another patch so that the two can be
considered independently.

David

Re: Delegating superuser tasks to new security roles

2021-06-15 Thread Stephen Frost

Greetings,

* torikoshia (torikos...@oss.nttdata.com) wrote:
> On 2021-06-14 23:53, Mark Dilger wrote:
> >>On Jun 14, 2021, at 5:51 AM, torikoshia 
> >>wrote:
> >>BTW, do these patches enable non-superusers to create user with
> >>bypassrls?
[...]
> >Do you believe that functionality should be added?  I have not thought
> >much about that issue.
> 
> I just noticed that because I was looking into operations that can only be
> done by superusers.

In general, I agree with the sentiment that we should be providing a way
to have non-superusers able to do things that only a superuser can do
today.  I'd love to get rid of all of the explicit superuser checks in
the backend except the one that makes a superuser a member of all roles.

Thanks,

Stephen

signature.asc
Description: PGP signature

Should wal receiver reply to wal sender more aggressively?

2021-06-15 Thread Paul Guo

While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop
before replying the write/flush/apply LSN to wal senders in
XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN
information in time, thus finally slows down those transactions due to
syncrep wait (assuming default synchronous_commit)

In my TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first
argument false means no force , i.e. it notifies unless max time of
guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait other calls
of XLogWalRcvSendReply()
to notify the wal sender.

I thought and tried enhancing this by force-replying to the wal sender
each when receiving
a maximum bytes (e.g. 128K) but several things confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
kernel is usually not
  efficient when handling small packets but we are not replying that
aggressively so why
  is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
max_connections tuned)
  but found no obvious performance difference with and without the
code change. I did not
  see obvious system (IO/CPU/network) bottleneck - probably the
bottleneck is in PG itself.
  I did not investigate further at this moment, but the change should
in theory help, no?

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process.
Network io is async, blocking at epoll/poll, wal write is mostly
non-blocking, but for wal flush,
probably we could decouple it to a dedicated process. And maybe use
sync_file_range instead
of wal file fsync in issue_xlog_fsync()? We should sync those wal
contents with lower LSN at
first and reply to the wal sender in time, right?.

Below is the related code:

/* See if we can read data immediately */
len = walrcv_receive(wrconn, &buf, &wait_fd);
if (len != 0)
{
/*
 * Process the received data,
and any subsequent data we
 * can read without blocking.
 */
for (;;)
{
if (len > 0)
{
/*
 * Something
was received from primary, so reset
 * timeout
 */

last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;

XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
}
else if (len == 0)
break;
else if (len < 0)
{
ereport(LOG,

 (errmsg("replication terminated by primary server"),

  errdetail("End of WAL reached on timeline %u at %X/%X.",

startpointTLI,

LSN_FORMAT_ARGS(LogstreamResult.Write;
endofwal = true;
break;
}
len =
walrcv_receive(wrconn, &buf, &wait_fd);
}

/* Let the primary know that
we received some data. */
XLogWalRcvSendReply(false, false);

/*
 * If we've written some
records, flush them to disk and
 * let the startup process and
primary server know about
 * them.
 */
XLogWalRcvFlush(false);

-- 
Paul Guo (Vmware)

Re: unnesting multirange data types

2021-06-15 Thread Alexander Korotkov

On Tue, Jun 15, 2021 at 4:49 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > Pushed!  Thanks to thread participants for raising this topic and
> > review.  I'll be around to resolve issues if any.
>
> Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
> before pushing?

Yes, I'm looking at this now.

I did run "check-world", but it passed for me.  Probably the same
reason it passed for some buildfarm animals...

--
Regards,
Alexander Korotkov

Re: Different compression methods for FPI

2021-06-15 Thread Justin Pryzby

On Tue, Jun 15, 2021 at 07:53:26AM +0200, Peter Eisentraut wrote:
> On 15.06.21 07:37, Michael Paquier wrote:
> > > > Actually, I was just thinking that default yes/no/on/off stuff maybe 
> > > > should be
> > > > defined to mean "lz4" rather than meaning pglz for "backwards 
> > > > compatibility".
> > > +1
> > I am not sure that we have any reasons to be that aggressive about
> > this one either, and this would mean that wal_compression=on implies a
> > different method depending on the build options.  I would just stick
> > with the past, careful practice that we have to use a default
> > backward-compatible value as default, while being able to use a new
> > option.

You're right, I hadn't though this through all the way.
There's precedent if the default is non-static (wal_sync_method).

But I think yes/on/true/1 should be a compatibility alias for a static thing,
and then the only option is pglz.

Of course, changing the default to LZ4 is still a possibility.

> If we think this new thing is strictly better than the old thing, then why
> not make it the default.  What would be the gain from sticking to the old
> default?
> 
> The point that the default would depend on build options is a valid one.
> I'm not sure whether it's important enough by itself.

Re: a path towards replacing GEQO with something better

2021-06-15 Thread Robert Haas

On Wed, Jun 9, 2021 at 9:24 PM John Naylor  wrote:
> 3) It actually improves the existing exhaustive search, because the 
> complexity of the join order problem depends on the query shape: a "chain" 
> shape (linear) vs. a "star" shape (as in star schema), for the most common 
> examples. The size of the DP table grows like this (for n >= 4):
>
> Chain: (n^3 - n) / 6   (including bushy plans)
> Star:  (n - 1) * 2^(n - 2)
>
>  n  chain   star
> 
>  4 10 12
>  5 20 32
>  6 35 80
>  7 56192
>  8 84448
>  9120   1024
> 10165   2304
> 11220   5120
> 12286  11264
> 13364  24576
> 14455  53248
> 15560 114688
> ...
> 64  43680 290536219160925437952

I don't quite understand the difference between the "chain" case and
the "star" case. Can you show sample queries for each one? e.g. SELECT
... FROM a_1, a_2, ..., a_n WHERE ?

One idea I just ran across in
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
is to try to economize by skipping consideration of bushy plans. We
could start doing that when some budget is exceeded, similar to what
you are proposing here, but probably the budget for skipping
consideration of bushy plans would be smaller than the budget for
switching to IDP. The idea of skipping bushy plan generation in some
cases makes sense to me intuitively because most of the plans
PostgreSQL generates are mostly left-deep, and even when we do
generate bushy plans, they're not always a whole lot better than the
nearest equivalent left-deep plan. The paper argues that considering
bushy plans makes measurable gains, but also that failure to consider
such plans isn't catastrophically bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

change logging defaults

2021-06-15 Thread Justin Pryzby

I propose to change some defaults:

log_autovacuum_min_duration = 0
log_checkpoints = on
log_lock_waits = on (and log_recovery_conflict_waits too?)
log_temp_files = 128kB

Note that pg_regress does this:
| fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
| fputs("log_autovacuum_min_duration = 0\n", pg_conf);
| fputs("log_checkpoints = on\n", pg_conf);
| fputs("log_line_prefix = '%m %b[%p] %q%a '\n", pg_conf);
| fputs("log_lock_waits = on\n", pg_conf);
| fputs("log_temp_files = 128kB\n", pg_conf);
| fputs("max_prepared_transactions = 2\n", pg_conf);

-- 
Justin

Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> 
> The requirements are things which are learned over years and changes
> over time.  Trying to document them and keep up with them would be a
> pretty serious project all on its own.  There are external projects who
> spend serious time and energy doing their best to provide the tooling
> needed here and we should be promoting those, not trying to pretend like
> this is a simple thing which anyone could write a short perl script to
> accomplish.

The fact that this is such a complex problem is the very reason why we should
spend a lot of energy documenting the various requirements.  Otherwise, how
could anyone implement a valid program for that and how could anyone validate
that a solution claiming to do its job actually does its job?

> Already tried doing it in perl.  No, it's not simple and it's also
> entirely vaporware today and implies that we're going to develop this
> tool, improve it in the future as we realize it needs to be improved,
> and maintain it as part of core forever.  If we want to actually adopt
> and pull in a backup tool to be part of core then we should talk about
> things which actually exist, such as the various existing projects that
> have been written to specifically work to address all the requirements
> which are understood today, not say "well, we can just write a simple
> perl script to do it" because it's not actually that simple.

Adopting a full backup solution seems like a bit extreme.  On the other hand,
having some real core implementation of an archive_command for the most general
use cases (local copy, distant copy over ssh...) could make sense.  This would
remove that burden for some, probably most, of the 3rd party backup tools, and
would also ensure that the various requirements are properly documented since
it would be the implementation reference.

Re: Case expression pushdown

2021-06-15 Thread Alexander Pyhalov


Hi.

Ashutosh Bapat писал 2021-06-15 16:24:
Looks quite useful to me. Can you please add this to the next 
commitfest?




Addded to commitfest. Here is an updated patch version.

--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 80d60eb9b1630ee55d1825964e0e976ae6c289a1 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Tue, 30 Mar 2021 13:24:14 +0300
Subject: [PATCH] Allow pushing CASE expression to foreign server

---
 contrib/postgres_fdw/deparse.c| 118 ++
 .../postgres_fdw/expected/postgres_fdw.out|  47 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  22 
 3 files changed, 187 insertions(+)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 31919fda8c6..3621fed4b54 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -87,6 +87,7 @@ typedef struct foreign_loc_cxt
 {
 	Oid			collation;		/* OID of current collation, if any */
 	FDWCollateState state;		/* state of current collation choice */
+	Expr	   *case_arg;		/* the last case arg to inspect */
 } foreign_loc_cxt;
 
 /*
@@ -101,6 +102,7 @@ typedef struct deparse_expr_cxt
  * a base relation. */
 	StringInfo	buf;			/* output buffer to append to */
 	List	  **params_list;	/* exprs that will become remote Params */
+	List	   *case_args;		/* list of args to deparse CaseTestExpr */
 } deparse_expr_cxt;
 
 #define REL_ALIAS_PREFIX	"r"
@@ -186,6 +188,9 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
 
+static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
+static void deparseCaseTestExpr(CaseTestExpr *node, deparse_expr_cxt *context);
+
 /*
  * Helper functions
  */
@@ -254,6 +259,7 @@ is_foreign_expr(PlannerInfo *root,
 		glob_cxt.relids = baserel->relids;
 	loc_cxt.collation = InvalidOid;
 	loc_cxt.state = FDW_COLLATE_NONE;
+	loc_cxt.case_arg = NULL;
 	if (!foreign_expr_walker((Node *) expr, &glob_cxt, &loc_cxt))
 		return false;
 
@@ -312,6 +318,7 @@ foreign_expr_walker(Node *node,
 	/* Set up inner_cxt for possible recursion to child nodes */
 	inner_cxt.collation = InvalidOid;
 	inner_cxt.state = FDW_COLLATE_NONE;
+	inner_cxt.case_arg = outer_cxt->case_arg;
 
 	switch (nodeTag(node))
 	{
@@ -509,6 +516,62 @@ foreign_expr_walker(Node *node,
 	state = FDW_COLLATE_UNSAFE;
 			}
 			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *ce = (CaseExpr *) node;
+ListCell   *arg;
+
+if (ce->arg)
+	inner_cxt.case_arg = ce->arg;
+
+foreach(arg, ce->args)
+{
+	CaseWhen   *w = lfirst_node(CaseWhen, arg);
+
+	if (!foreign_expr_walker((Node *) w->expr,
+			 glob_cxt, &inner_cxt))
+		return false;
+
+	if (!foreign_expr_walker((Node *) w->result,
+			 glob_cxt, &inner_cxt))
+		return false;
+}
+
+if (!foreign_expr_walker((Node *) ce->defresult,
+		 glob_cxt, &inner_cxt))
+	return false;
+
+collation = ce->casecollid;
+if (collation == InvalidOid)
+	state = FDW_COLLATE_NONE;
+else if (inner_cxt.state == FDW_COLLATE_SAFE &&
+		 collation == inner_cxt.collation)
+	state = FDW_COLLATE_SAFE;
+else if (collation == DEFAULT_COLLATION_OID)
+	state = FDW_COLLATE_NONE;
+else
+	state = FDW_COLLATE_UNSAFE;
+			}
+			break;
+		case T_CaseTestExpr:
+			{
+Expr	   *arg;
+
+Assert(outer_cxt->case_arg != NULL);
+arg = outer_cxt->case_arg;
+
+if (!foreign_expr_walker((Node *) arg,
+		 glob_cxt, &inner_cxt))
+	return false;
+
+/*
+ * Collation and state just bubble up from the previously
+ * saved case argument
+ */
+collation = inner_cxt.collation;
+state = inner_cxt.state;
+			}
+			break;
 		case T_OpExpr:
 		case T_DistinctExpr:	/* struct-equivalent to OpExpr */
 			{
@@ -1019,6 +1082,7 @@ deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *rel,
 	context.foreignrel = rel;
 	context.scanrel = IS_UPPER_REL(rel) ? fpinfo->outerrel : rel;
 	context.params_list = params_list;
+	context.case_args = NIL;
 
 	/* Construct SELECT clause */
 	deparseSelectSql(tlist, is_subquery, retrieved_attrs, &context);
@@ -1598,6 +1662,7 @@ deparseFromExprForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *foreignrel,
 			context.scanrel = foreignrel;
 			context.root = root;
 			context.params_list = params_list;
+			context.case_args = NIL;
 
 			appendStringInfoChar(buf, '(');
 			appendConditions(fpinfo->joinclauses, &context);
@@ -1901,6 +1966,7 @@ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,
 	context.scanrel = foreignrel;
 	context.buf = buf;
 	context.params_list = params_list;
+	context.case_args = NIL;
 
 	appendStringInfoString(buf, "UPDATE ");
 	deparseRelation(buf, rel);
@@ -2008,6 +2074,7 @@ deparseDirectDeleteSql(StringInfo buf, PlannerInfo *root,
 	context.scanrel = foreignrel;
 	context.buf = buf;

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane

Robert Haas  writes:
> Oh, maybe I'm the one who misunderstood...

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out.  If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

regards, tom lane

Re: a path towards replacing GEQO with something better

2021-06-15 Thread Tom Lane

Robert Haas  writes:
> One idea I just ran across in
> https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans. We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.

It's not catastrophically bad until you hit a query where the only
correct plans are bushy.  These do exist, and I think they're not
that uncommon.

Still, I take your point that maybe we could ratchet down the cost of
exhaustive search by skimping on this part.  Maybe it'd work to skip
bushy so long as we'd found at least one left-deep or right-deep path
for the current rel.

regards, tom lane

Re: change logging defaults

2021-06-15 Thread Tom Lane

Justin Pryzby  writes:
> I propose to change some defaults:
> log_autovacuum_min_duration = 0
> log_checkpoints = on
> log_lock_waits = on (and log_recovery_conflict_waits too?)
> log_temp_files = 128kB

Why?

Based on reports that I see, some quite large percentage of Postgres
DBAs never look at the postmaster log at all.  So making the log
bulkier isn't something that will be useful to them.  People who do
want these reports are certainly capable of turning them on.

> Note that pg_regress does this:

What we find useful for testing seems to me to be nearly
unrelated to production needs.

regards, tom lane

disfavoring unparameterized nested loops

2021-06-15 Thread Robert Haas

>From time to time, someone tells me that they've configured
enable_nestloop=false on postgresql.conf, which is a pretty bad idea
since there are a significant number of cases where such plans are the
only reasonable way of executing some query. However, it's no great
secret that PostgreSQL's optimizer sometimes produces nested loops
that are very, very, very slow, generally because it has grossly
underestimated the cardinality of the inner side of the nested loop.
The frustration which users experience as a result of such events is
understandable.

I read 
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
today and found out that the authors of that paper did something a bit
more nuanced which, in their experiments, was very successful. It
sounds like what they basically did is disabled unparameterized nested
loops. They argue that such nested loops figure to gain very little as
compared with a hash join, but that they might turn out to lose a lot
if the cardinality estimation is inaccurate, and they present
experimental results to back up those claims. One observation that the
paper makes along the way is that every system they tested is more
likely to underestimate the cardinality of joins than to overestimate
it, and that this tendency becomes more pronounced as the size of the
join planning problem increases. On reflection, I believe this matches
my experience, and it makes sense that it should be so, since it
occasionally happens that the join selectivity estimate is essentially
zero, and a bigger join problem is more likely to have at least one
such case. On the other hand, the join selectivity estimate can never
be +infinity. Hence, it's more important in general for a database
system to be resilient against underestimates than to be resilient
against overestimates. Being less willing to choose unparameterized
nested loops is one way to move in that direction.

How to do that is not very clear. One very simple thing we could do
would be to introduce enable_nestloop=parameterized or
enable_parameterized_nestloop=false. That is a pretty blunt instrument
but the authors of the paper seem to have done that with positive
results, so maybe it's actually not a dumb idea. Another approach
would be to introduce a large fuzz factor for such nested loops e.g.
keep them only if the cost estimate is better than the comparable hash
join plan by at least a factor of N (or if no such plan is being
generated for some reason). I'm not very confident that this would
actually do what we want, though. In the problematic cases, a nested
loop is likely to look extremely cheap, so just imagining that the
cost might be higher is not very protective. Perhaps a better approach
would be something like: if the estimated number of inner rows is less
than 100, then re-estimate the cost of this approach and of the best
available hash join on the assumption that there are 100 inner rows.
If the hash join still wins, keep it; if it loses under that
assumption, throw it out. I think it's likely that this approach would
eliminate a large number of highly risky nested loop plans, probably
even with s/100/10/g, without causing many other problems (except
perhaps increased planner CPU consumption ... but maybe that's not too
bad).

Just to be clear, I do understand that there are cases where no Hash
Join is possible, but anything we do here could be made to apply only
when a hash join is in fact possible. We could also think about making
the point of comparison the best other plans of any sort rather than a
hash join specifically, which feels a little more principled but might
actually be worse. When a Nested Loop is a stupid idea, it's stupid
precisely because the inner side is big and we could've avoided
recomputing it over and over by using a Hash Join instead, not because
some Merge Join based plan turns out to be better. I mean, it is
possible that some Merge Join plan does turn out to be better, but
that's not rage-inducing in the same way. Nobody looks at a
complicated join plan that happened to use a Nested Loop and says
"obviously, this is inferior to a merge join," or if they do, they're
probably full of hot air. But people look at complicated join plans
that happen to use a Nested Loop and say "obviously, this is inferior
to a hash join" *all the time* and assuming the inner path is
unparameterized, they are very often correct.

Thoughts? I'd be particularly curious to hear about any cases anyone
knows about where an unparameterized nested loop and a hash join with
batches=1 are both possible and where the unparameterized nested loop
is *much* cheaper than the hash join.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Andres Freund

Hi,

On 2021-06-15 12:51:28 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > Oh, maybe I'm the one who misunderstood...
> 
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.
> 
> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 9:51 AM Tom Lane  wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

FWIW I think that the concept itself is basically reasonable. The
implementation is very flawed, though, so it hardly enters into it.

> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

ISTM that this is currently everybody's responsibility, and therefore
nobody's responsibility. That's probably why the problems haven't been
resolved yet.

I propose that the revert question be explicitly timeboxed. If the
issues haven't been fixed by some date, then "snapshot too old"
automatically gets reverted without further discussion. This gives
qualified hackers the opportunity to save the feature if they feel
strongly about it, and are actually willing to take responsibility for
its ongoing maintenance.

-- 
Peter Geoghegan

Re: a path towards replacing GEQO with something better

2021-06-15 Thread Robert Haas

On Tue, Jun 15, 2021 at 1:00 PM Tom Lane  wrote:
> Still, I take your point that maybe we could ratchet down the cost of
> exhaustive search by skimping on this part.  Maybe it'd work to skip
> bushy so long as we'd found at least one left-deep or right-deep path
> for the current rel.

Yes, that sounds better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: unnesting multirange data types

2021-06-15 Thread Tom Lane

Alexander Korotkov  writes:
> I did run "check-world", but it passed for me.  Probably the same
> reason it passed for some buildfarm animals...

The only buildfarm animals that have passed since this went in
are the ones that don't run the pg_dump or pg_upgrade tests.

It looks to me like the proximate problem is that you should
have taught pg_dump to skip these new auto-generated functions.
However, I fail to see why we need auto-generated functions
for this at all.  Couldn't we have done it with one polymorphic
function?

I think this ought to be reverted and reviewed more carefully.

regards, tom lane

Re: unnesting multirange data types

2021-06-15 Thread Alvaro Herrera

On 2021-Jun-15, Tom Lane wrote:

> It looks to me like the proximate problem is that you should
> have taught pg_dump to skip these new auto-generated functions.
> However, I fail to see why we need auto-generated functions
> for this at all.  Couldn't we have done it with one polymorphic
> function?

I think such a function would need to take anycompatiblerangearray,
which is not something we currently have.

> I think this ought to be reverted and reviewed more carefully.

It seems to me that removing the cast-to-range[] is a sufficient fix,
and that we can do with only the unnest part for pg14; the casts can be
added in 15 (if at all).  That would mean reverting only half the patch.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Andres Freund

Hi,

On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> memcpy would not suffer from it?

It'd not be correct for short sqlstates - you'd read beyond the end of
the source buffer. There are cases of it in the ecpg code.

Greetings,

Andres Freund

Re: unnesting multirange data types

2021-06-15 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Jun-15, Tom Lane wrote:
>> I think this ought to be reverted and reviewed more carefully.

> It seems to me that removing the cast-to-range[] is a sufficient fix,
> and that we can do with only the unnest part for pg14; the casts can be
> added in 15 (if at all).  That would mean reverting only half the patch.

Might be a reasonable solution.  But right now I'm annoyed that the
buildfarm is broken, and I'm also convinced that this didn't get
adequate testing.  I think "revert and reconsider" is the way
forward for today.

regards, tom lane

Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Tom Lane

Andres Freund  writes:
> On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
>> memcpy would not suffer from it?

> It'd not be correct for short sqlstates - you'd read beyond the end of
> the source buffer. There are cases of it in the ecpg code.

What's a "short SQLSTATE"?  They're all five characters by definition.

regards, tom lane

Re: unnesting multirange data types

2021-06-15 Thread Jonathan S. Katz

On 6/15/21 1:52 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2021-Jun-15, Tom Lane wrote:
>>> I think this ought to be reverted and reviewed more carefully.
> 
>> It seems to me that removing the cast-to-range[] is a sufficient fix,
>> and that we can do with only the unnest part for pg14; the casts can be
>> added in 15 (if at all).  That would mean reverting only half the patch.
> 
> Might be a reasonable solution.  But right now I'm annoyed that the
> buildfarm is broken, and I'm also convinced that this didn't get
> adequate testing.

I had focused testing primarily on the "unnest" cases that I had
described in my original note. I did a couple of casts and had no issue;
I did not test with pg_dump / pg_upgrade, but noting to do so in the
future in cases like this.

>  I think "revert and reconsider" is the way
> forward for today.

I don't want the buildfarm broken so I'm fine if this is the best way
forward. If we can keep the "unnest" functionality I would strongly
suggest it as that was the premise of the original note to complete the
utility of multiranges for v14. The casting, while convenient, is not
needed.

Jonathan

OpenPGP_signature
Description: OpenPGP digital signature

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Jun 15, 2021 at 9:51 AM Tom Lane  wrote:
>> So, it's well over a year later, and so far as I can see exactly
>> nothing has been done about snapshot_too_old's problems.

> I propose that the revert question be explicitly timeboxed. If the
> issues haven't been fixed by some date, then "snapshot too old"
> automatically gets reverted without further discussion. This gives
> qualified hackers the opportunity to save the feature if they feel
> strongly about it, and are actually willing to take responsibility for
> its ongoing maintenance.

The goal I have in mind is for snapshot_too_old to be fixed or gone
in v15.  I don't feel a need to force the issue sooner than that, so
there's plenty of time for someone to step up, if anyone wishes to.

I imagine that we should just ignore the question of whether anything
can be done for it in the back branches.  Given the problems
identified upthread, fixing it in a non-back-patchable way would be
challenging enough.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 10:09 AM Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

I think that it's probably a good idea as-is.

Simple heuristics that are very frequently wrong when considered in a
naive way can work very well in practice. This seems to happen when
they capture some kind of extreme naturally occuring cost/benefit
asymmetry -- especially one with fixed well understood costs and
unlimited benefits (this business with unparameterized nestloop joins
is about *avoiding* the inverse asymmetry, but that seems very
similar). My go to example of such an asymmetry is the rightmost page
split heuristic of applying leaf fillfactor regardless of any of the
other specifics; we effectively assume that all indexes are on columns
with ever-increasing values. Which is obviously wrong.

We're choosing between two alternatives (unparameterized nested loop
vs hash join) that are really very similar when things go as expected,
but diverge sharply when there is a misestimation -- who wouldn't take
the "conservative" choice here?

I guess that there is a hesitation to not introduce heuristics like
this because it doesn't fit into some larger framework that captures
risk -- it might be seen as an ugly special case. But isn't this
already actually kind of special, whether or not we officially think
so?

-- 
Peter Geoghegan

Re: a path towards replacing GEQO with something better

2021-06-15 Thread John Naylor

On Tue, Jun 15, 2021 at 12:15 PM Robert Haas  wrote:
>
> On Wed, Jun 9, 2021 at 9:24 PM John Naylor 
wrote:
> > 3) It actually improves the existing exhaustive search, because the
complexity of the join order problem depends on the query shape: a "chain"
shape (linear) vs. a "star" shape (as in star schema), for the most common
examples. The size of the DP table grows like this (for n >= 4):
...
> I don't quite understand the difference between the "chain" case and
> the "star" case. Can you show sample queries for each one? e.g. SELECT
> ... FROM a_1, a_2, ..., a_n WHERE ?

There's a very simple example in the optimizer README:

--
SELECT  *
FROMtab1, tab2, tab3, tab4
WHERE   tab1.col = tab2.col AND
tab2.col = tab3.col AND
tab3.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:
{1 2},{2 3},{3 4}
{1 2 3},{2 3 4}
{1 2 3 4}
(other possibilities will be excluded for lack of join clauses)

SELECT  *
FROMtab1, tab2, tab3, tab4
WHERE   tab1.col = tab2.col AND
tab1.col = tab3.col AND
tab1.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:
{1 2},{1 3},{1 4}
{1 2 3},{1 3 4},{1 2 4}
{1 2 3 4}
--

The first one is chain, and the second is star. Four is the smallest set
where we have a difference. I should now point out an imprecision in my
language: By "size of the DP table", the numbers in my first email refer to
the number of joinrels times the number of possible joins (not paths, and
ignoring commutativity). Here are the steps laid out, with cumulative
counts:

join_level, # joins,  cumulative # joins:

linear, n=4
 2   3   3
 3   4   7
 4   3  10

star, n=4
 2   3   3
 3   6   9
 4   3  12

And of course, the chain query also has three at the last level, because it
tries two left- (or right-) deep joins and one bushy join.

> One idea I just ran across in
>
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans.

That's a good paper, and it did influence my thinking.

You likely already know this, but for the archives: If only chain queries
could have bushy plans, it wouldn't matter because they are so cheap to
enumerate. But, since star queries can introduce a large number of extra
joins via equivalence (same join column or FK), making them resemble
"clique" queries, bushy joins get excessively numerous.

> We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.

I think that makes sense. There are a few things we could do within the
"grey zone" -- too many rels to finish exhaustive search, but not enough to
justify starting directly with the greedy step -- to increase our chances
of completing, and that's a very simple one.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 11:01 AM Tom Lane  wrote:
> The goal I have in mind is for snapshot_too_old to be fixed or gone
> in v15.  I don't feel a need to force the issue sooner than that, so
> there's plenty of time for someone to step up, if anyone wishes to.

Seems more than reasonable to me. A year ought to be plenty of time if
the feature truly is salvageable.

What do other people think? Ideally we could commit to that hard
deadline now. To me the important thing is to actually have a real
deadline that forces the issue one way or another. This situation must
not be allowed to drag on forever.

-- 
Peter Geoghegan

Re: Duplicate history file?

2021-06-15 Thread Stephen Frost

Greetings,

* Julien Rouhaud (rjuju...@gmail.com) wrote:
> On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> > The requirements are things which are learned over years and changes
> > over time.  Trying to document them and keep up with them would be a
> > pretty serious project all on its own.  There are external projects who
> > spend serious time and energy doing their best to provide the tooling
> > needed here and we should be promoting those, not trying to pretend like
> > this is a simple thing which anyone could write a short perl script to
> > accomplish.
> 
> The fact that this is such a complex problem is the very reason why we should
> spend a lot of energy documenting the various requirements.  Otherwise, how
> could anyone implement a valid program for that and how could anyone validate
> that a solution claiming to do its job actually does its job?

Reading the code.

> > Already tried doing it in perl.  No, it's not simple and it's also
> > entirely vaporware today and implies that we're going to develop this
> > tool, improve it in the future as we realize it needs to be improved,
> > and maintain it as part of core forever.  If we want to actually adopt
> > and pull in a backup tool to be part of core then we should talk about
> > things which actually exist, such as the various existing projects that
> > have been written to specifically work to address all the requirements
> > which are understood today, not say "well, we can just write a simple
> > perl script to do it" because it's not actually that simple.
> 
> Adopting a full backup solution seems like a bit extreme.  On the other hand,
> having some real core implementation of an archive_command for the most 
> general
> use cases (local copy, distant copy over ssh...) could make sense.  This would
> remove that burden for some, probably most, of the 3rd party backup tools, and
> would also ensure that the various requirements are properly documented since
> it would be the implementation reference.

Having a database platform that hasn't got a full backup solution is a
pretty awkward position to be in.

I'd like to see something a bit more specific than handwaving about how
core could provide something in this area which would remove the burden
from other tools.  Would also be good to know who is going to write that
and maintain it.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: unnesting multirange data types

2021-06-15 Thread Andrew Dunstan



On 6/15/21 12:06 PM, Alexander Korotkov wrote:
> On Tue, Jun 15, 2021 at 4:49 PM Tom Lane  wrote:
>> Alexander Korotkov  writes:
>>> Pushed!  Thanks to thread participants for raising this topic and
>>> review.  I'll be around to resolve issues if any.
>> Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
>> before pushing?
> Yes, I'm looking at this now.
>
> I did run "check-world", but it passed for me.  Probably the same
> reason it passed for some buildfarm animals...
>

Did you configure with --enable-tap-tests? If not, then `make
check-world` won't run the tests that are failing here.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: unnesting multirange data types

2021-06-15 Thread Alexander Korotkov

On Tue, Jun 15, 2021 at 8:18 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I did run "check-world", but it passed for me.  Probably the same
> > reason it passed for some buildfarm animals...
>
> The only buildfarm animals that have passed since this went in
> are the ones that don't run the pg_dump or pg_upgrade tests.
>
> It looks to me like the proximate problem is that you should
> have taught pg_dump to skip these new auto-generated functions.
> However, I fail to see why we need auto-generated functions
> for this at all.  Couldn't we have done it with one polymorphic
> function?
>
> I think this ought to be reverted and reviewed more carefully.

Thank you for your feedback.  I've reverted the patch.

I'm going to have closer look at this tomorrow.

--
Regards,
Alexander Korotkov

Re: unnesting multirange data types

2021-06-15 Thread Andrew Dunstan



On 6/15/21 1:52 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2021-Jun-15, Tom Lane wrote:
>>> I think this ought to be reverted and reviewed more carefully.
>> It seems to me that removing the cast-to-range[] is a sufficient fix,
>> and that we can do with only the unnest part for pg14; the casts can be
>> added in 15 (if at all).  That would mean reverting only half the patch.
> Might be a reasonable solution.  But right now I'm annoyed that the
> buildfarm is broken, and I'm also convinced that this didn't get
> adequate testing.  I think "revert and reconsider" is the way
> forward for today.
>
>   



(RMT hat on) That would be my inclination at this stage. The commit
message states that it's trivial, but it seems not to be, and I suspect
it should not have been done at this stage of the development cycle.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Andres Freund

Hi,

On 2021-06-15 13:53:08 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> >> memcpy would not suffer from it?
> 
> > It'd not be correct for short sqlstates - you'd read beyond the end of
> > the source buffer. There are cases of it in the ecpg code.
> 
> What's a "short SQLSTATE"?  They're all five characters by definition.

I thought there were places that just dealt with "00" etc. And there are - but
it's just comparisons.

I still don't fully feel comfortable just using memcpy() though, given that
the sqlstates originate remotely / from libpq, making it hard to rely on the
fact that the buffer "ought to" always be at least 5 bytes long? As far as I
can tell there's no enforcement of PQresultErrorField(..., PG_DIAG_SQLSTATE)
being that long.

Greetings,

Andres Freund

Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Andres Freund

Hi,

On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
> This is a followup to the conversation at [1], in which we speculated
> about constraining the isolationtester's behavior by annotating the
> specfiles, in order to eliminate common buildfarm failures such as [2]
> and reduce the need to use long delays to stabilize the test results.
> 
> I've spent a couple days hacking on this idea, and I think it has worked
> out really well.  On my machine, the time needed for "make installcheck"
> in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
> of removing all the multiple-second delays we used before.

Very cool stuff. All the reliability things aside, isolationtester
frequently is the slowest test in a parallel check world...


> Also, while I'm not fool enough to claim that this will reduce the
> rate of bogus failures to zero, I do think it addresses all the
> repeating failures we've seen lately.

And it should make it easier to fix some others and also to make it
easier to write some tests that were too hard to get to reliable today.


> This is still WIP to some extent, as I've not spent much time looking at
> specfiles other than the ones with big delays; there may be additional
> improvements possible in some places.  Also, I've not worried about
> whether the tests pass in serializable mode, since we have problems there
> already [4].  But this seemed like a good point at which to solicit
> feedback and see what the cfbot thinks of it.

Are there spec output changes / new failures, if you apply the patch,
but do not apply the changes to the spec files?


Will look at the patch itself in a bit.

Greetings,

Andres Freund

Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Tom Lane

Andres Freund  writes:
> On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
>> This is still WIP to some extent, as I've not spent much time looking at
>> specfiles other than the ones with big delays; there may be additional
>> improvements possible in some places.  Also, I've not worried about
>> whether the tests pass in serializable mode, since we have problems there
>> already [4].  But this seemed like a good point at which to solicit
>> feedback and see what the cfbot thinks of it.

> Are there spec output changes / new failures, if you apply the patch,
> but do not apply the changes to the spec files?

If you make only the code changes, there are a bunch of diffs stemming
from the removal of the 'error in steps' message prefix.  If you just
mechanically remove those from the .out files without touching the .spec
files, most tests pass, but I don't recall whether that's 100% the case.

> Will look at the patch itself in a bit.

I'll have a v2 in a little bit --- the cfbot pointed out that there
were some contrib tests I'd missed fixing, and I found a couple of
other improvements.

regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Robert Haas

On Tue, Jun 15, 2021 at 12:51 PM Tom Lane  wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

Progress has been pretty limited, but not altogether nonexistent.
55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
fix, the time->XID mapping, which is one of the main things that
Andres said was broken originally. Also, there are patches on this
thread from Thomas Munro to add some test coverage for that case,
another problem Andres noted in his original email. I guess it
wouldn't be too hard to get something committed there, and I'm willing
to do it if Thomas doesn't want to and if there's any prospect of
salvaging the feature.

But that's not clear to me. I'm not clear how exactly how many
problems we know about and need to fix in order to keep the feature,
and I'm also not clear how deep the hole goes. Like, if we need to get
a certain number of specific bugs fixed, I might be willing to do
that. If we need to commit to a major rewrite of the current
implementation, that's more than I can do. But I guess I don't
understand exactly how bad the current problems are. Reviewing
complaints from Andres from this thread:

> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

This seems specific enough to be analyzed and anything that is broken
can be fixed.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

I think it's unclear whether there are live problems in master in this area.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

Discussion on this point trailed off. Upon rereading, I think Andres
is correct that there's an issue; the snapshot's LSN needs to be set
to a value not older than the last xlog insertion that has been
completed rather than, as now, the last one that is started. I guess
to get that value we would need to do something like
WaitXLogInsertionsToFinish(), or some approximation of it e.g.
GetXLogWriteRecPtr() at the risk of unnecessary snapshot-too-old
errors.

> * In read-mostly workloads it can trigger errors in sessions that are
>  much younger than old_snapshot_threshold, if the transactionid is not
>  advancing.
>
>  I've not tried to reproduce, but I suspect this can also cause wrong
>  query results. Because a later snapshot can have the same xmin as
>  older transactions, it sure looks like we can end up with data from an
>  older xmin getting removed, but the newer snapshot's whenTaken will
>  prevent TestForOldSnapshot_impl() from raising an error.

I haven't really wrapped my head around this one, but it seems
amenable to a localized fix. It basically amounts to a complaint that
GetOldSnapshotThresholdTimestamp() is returning a newer value than it
should. I don't know exactly what's required to make it not do that,
but it doesn't seem intractable.

> * I am fairly sure that it can cause crashes (or even data corruption),
>  because it assumes that DML never needs to check for old snapshots
>  (with no meaningful justificiation). Leading to heap_update/delete to
>  assume the page header is a tuple.

I don't understand the issue here, really. I assume there might be a
wrong word here, because assuming that the page header is a tuple
doesn't sound like a thing that would actually happen. I think one of
the key problems for this feature is figuring out whether you've got
snapshot-too-old checks in all the right places. I think what is being
alleged here is that heap_update() and heap_delete() need them, and
that it's not good enough to rely on the scan that found the tuple to
be updated or deleted having already performed those checks. It is not
clear to me whether that is true, or how it could cause crashes.
Andres may have explained this to me at some point, but if he did I
have unfortunately forgotten.

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion. I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified. For the record, and to
Peter

Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Andrew Dunstan



On 6/14/21 10:57 PM, Tom Lane wrote:
> This is a followup to the conversation at [1], in which we speculated
> about constraining the isolationtester's behavior by annotating the
> specfiles, in order to eliminate common buildfarm failures such as [2]
> and reduce the need to use long delays to stabilize the test results.
>
> I've spent a couple days hacking on this idea, and I think it has worked
> out really well.  On my machine, the time needed for "make installcheck"
> in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
> of removing all the multiple-second delays we used before.  Also,
> while I'm not fool enough to claim that this will reduce the rate of
> bogus failures to zero, I do think it addresses all the repeating
> failures we've seen lately.
>
> In the credit-where-credit-is-due department, this owes some inspiration
> to the patch Asim Praveen offered at [3], though this takes the idea a
> good bit further.
>
> This is still WIP to some extent, as I've not spent much time looking at
> specfiles other than the ones with big delays; there may be additional
> improvements possible in some places.  Also, I've not worried about
> whether the tests pass in serializable mode, since we have problems there
> already [4].  But this seemed like a good point at which to solicit
> feedback and see what the cfbot thinks of it.
>
>   


Cool stuff. Minor gripe while we're on $subject - I wish we'd rename it.
It's long outgrown the original purpose that gave it its name, and
keeping the name makes it unnecessarily obscure. Yes, I know Lisp still
has car and cdr, but we don't need to follow that example.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

2021-06-15 Thread Robert Haas

On Tue, Jun 15, 2021 at 2:04 PM Peter Geoghegan  wrote:
> I guess that there is a hesitation to not introduce heuristics like
> this because it doesn't fit into some larger framework that captures
> risk -- it might be seen as an ugly special case. But isn't this
> already actually kind of special, whether or not we officially think
> so?

Yes, I think it is. Reading the paper really helped me crystallize my
thoughts about this, because when I've studied the problems myself, I
came, as you postulate here, to the conclusion that there's a lot of
stuff the planner does where there is risk and uncertainty, and thus
that a general framework would be necessary to deal with it. But the
fact that an academic researcher called this problem out as the only
one worth treating specially makes me think that perhaps it deserves
special handling. In defense of that approach, note that this is a
case where we know both that the Nested Loop is risky and that Hash
Join is a similar alternative with probably similar cost. I am not
sure there are any other cases where we can say quite so generally
both that a certain thing is risky and what we could do instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Ranier Vilela

Em ter., 15 de jun. de 2021 às 15:48, Andres Freund 
escreveu:

> Hi,
>
> On 2021-06-15 13:53:08 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> > >> memcpy would not suffer from it?
> >
> > > It'd not be correct for short sqlstates - you'd read beyond the end of
> > > the source buffer. There are cases of it in the ecpg code.
> >
> > What's a "short SQLSTATE"?  They're all five characters by definition.
>
> I thought there were places that just dealt with "00" etc. And there are -
> but
> it's just comparisons.
>
> I still don't fully feel comfortable just using memcpy() though, given that
> the sqlstates originate remotely / from libpq, making it hard to rely on
> the
> fact that the buffer "ought to" always be at least 5 bytes long? As far as
> I
> can tell there's no enforcement of PQresultErrorField(...,
> PG_DIAG_SQLSTATE)
> being that long.
>
And replacing with snprintf, what do you guys think?

n = snprintf(sqlca->sqlstate, sizeof(sqlca->sqlstate), "%s", sqlstate);
Assert(n >= 0 && n < sizeof(sqlca->sqlstate));

regards,
Ranier Vilela


fix_buffer_null_not_terminated.patch
Description: Binary data

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane

Robert Haas  writes:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion.

TBH, I am not clear on that either.

> I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified.

I think it's a klugy, unprincipled solution to a valid real-world
problem.  I suspect the implementation issues are not unrelated to
the kluginess of the concept.  Thus, I would really like to see us
throw this away and find something better.  I admit I have nothing
to offer about what a better solution to the problem would look like.
But I would really like it to not involve random-seeming query failures.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 12:31 PM Robert Haas  wrote:
> Yes, I think it is. Reading the paper really helped me crystallize my
> thoughts about this, because when I've studied the problems myself, I
> came, as you postulate here, to the conclusion that there's a lot of
> stuff the planner does where there is risk and uncertainty, and thus
> that a general framework would be necessary to deal with it.

It is an example (perhaps the only example in the optimizer) of an
oasis of certainty in an ocean of uncertainty. As uncertain as
everything is, we seemingly can make strong robust statements about
the relative merits of each strategy *in general*, just in this
particular instance. It's just not reasonable to make such a reckless
choice, no matter what your general risk tolerance is.

Goetz Graefe is interviewed here, and goes into his philosophy on
robustness -- it seems really interesting to me:

https://sigmodrecord.org/publications/sigmodRecord/2009/pdfs/05_Profiles_Graefe.pdf

> In defense of that approach, note that this is a
> case where we know both that the Nested Loop is risky and that Hash
> Join is a similar alternative with probably similar cost. I am not
> sure there are any other cases where we can say quite so generally
> both that a certain thing is risky and what we could do instead.

I tend to think of a hash join as like a nested loop join with an
inner index scan where you build the index yourself, dynamically. That
might be why I find it easy to make this mental leap. In theory you
could do this by giving the nestloop join runtime smarts -- make it
turn into a hash join adaptively. Like Graefe's G-Join design. That
way you could do this in a theoretically pure way.

I don't think that that's actually necessary just to deal with this
case -- it probably really is as simple as it seems. I point this out
because perhaps it's useful to have that theoretical anchoring.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 12:49 PM Tom Lane  wrote:
> Robert Haas  writes:
> > My general point here is that I would like to know whether we have a
> > finite number of reasonably localized bugs or a three-ring disaster
> > that is unrecoverable no matter what we do. Andres seems to think it
> > is the latter, and I *think* Peter Geoghegan agrees, but I think that
> > the point might be worth a little more discussion.
>
> TBH, I am not clear on that either.

I don't know for sure which it is, but that in itself isn't actually
what matters to me. The most concerning thing is that I don't really
know how to *assess* the design now. The clear presence of at least
several very severe bugs doesn't necessarily prove anything (it just
*hints* at major design problems).

If I could make a very clear definitive statement on this then I'd
probably have to do ~1/3 of the total required work -- that'd be my
guess. If it was easy to be quite sure here then we wouldn't still be
here 12 months later. In any case I don't think that the feature
deserves to be treated all that differently to something that was
committed much more recently, given what we know. Frankly it took me
about 5 minutes to find a very serious bug in the feature, pretty much
without giving it any thought. That is not a good sign.

> I think it's a klugy, unprincipled solution to a valid real-world
> problem.  I suspect the implementation issues are not unrelated to
> the kluginess of the concept.  Thus, I would really like to see us
> throw this away and find something better.  I admit I have nothing
> to offer about what a better solution to the problem would look like.
> But I would really like it to not involve random-seeming query failures.

I would be very happy to see somebody take this up, because it is
important. The reality is that anybody that undertakes this task
should start with the assumption that they're starting from scratch,
at least until they learn otherwise. So ISTM that it might as well be
true that it needs a total rewrite, even if it turns out to not be
strictly true.

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 12:17 PM Robert Haas  wrote:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion. I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified. For the record, and to
> Peter's point, I think it's reasonable to set v15 feature freeze as a
> drop-dead date for getting this feature into acceptable shape, but I
> would like to try to nail down what we think "acceptable" means in
> this context.

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

If it really is incredibly important that we keep this feature, or one
like it, then I have to imagine that somebody will step forward --
there is still ample opportunity. But if nobody steps forward, I'll be
forced to conclude that perhaps it wasn't quite as important as I
first thought. Anybody can agree that it's important in an abstract
sense -- that's easy. What we need is a committer willing to sign on
the dotted line, which we're no closer to today than we were a year
ago. Actions speak louder than words.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan

On Wed, Apr 1, 2020 at 4:59 PM Andres Freund  wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

This is still true, right? Nobody fixed this bug after 14 months? Even
though we're talking about returning rows that are not visible to the
xact's snapshot?

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane

Peter Geoghegan  writes:
> What I had in mind was this: a committer adopting the feature
> themselves. The committer would be morally obligated to maintain the
> feature on an ongoing basis, just as if they were the original
> committer. This seems like the only sensible way of resolving this
> issue once and for all.

Yeah, it seems clear that we need somebody to do that, given that
Kevin Grittner has been inactive for awhile.  Even if the known
problems can be resolved by drive-by patches, I think this area
needs an ongoing commitment from someone.

regards, tom lane

Re: Error on pgbench logs

2021-06-15 Thread Michael Paquier

On Tue, Jun 15, 2021 at 09:53:06PM +0900, Yugo NAGATA wrote:
> I am fine with this version, but I think it would be better if we have a 
> comment
> explaining what "tx" is for.
> 
> Also, how about adding Assert(tx) instead of using "else if (tx)" because
> we are assuming that tx is always true when agg_interval is not used, right?

Agreed on both points.  From what I get, this code could be clarified
much more, and perhaps partially refactored to have less spaghetti
code between the point where we call it at the end of a thread or when
gathering stats of a transaction mid-run, but that's not something to
do post-beta1.  I am not completely sure that the result would be
worth it either.

Let's document things and let's the readers know better the
assumptions this area of the code relies on, for clarity.  The 
dependency between agg_interval and sample_rate is one of those
things, somebody needs now to look at the option parsing why only one
is possible at the time.  Using an extra tx flag to track what to do
after the loop for the aggregate print to the log file is an
improvement in this direction.
--
Michael

signature.asc
Description: PGP signature

Improving isolationtester's data output

2021-06-15 Thread Tom Lane

I've been spending a lot of time looking at isolationtester results
over the past couple of days, and gotten really annoyed at how poorly
it formats query results.  In particular, any column heading or value
that is 15 characters or longer is not separated from the next column,
rendering the output quite confusing.

Attached is a little hack that tries to improve that case while making
minimal changes to the output files otherwise.

There's still a good deal to be desired here: notably, the code still
does nothing to ensure vertical alignment of successive lines when
there are wide headings or values.  But doing anything about that
would involve much-more-invasive changes of the output files.
If we wanted to buy into that, I'd think about discarding this
ad-hoc code altogether in favor of using one of libpq's fe-print.c
routines.  But I'm not really sure that the small legibility gains
that would result are worth massive changes in the output files.

Thoughts?

regards, tom lane

diff --git a/src/test/isolation/expected/detach-partition-concurrently-3.out b/src/test/isolation/expected/detach-partition-concurrently-3.out
index 96ee090d53..3f553711b3 100644
--- a/src/test/isolation/expected/detach-partition-concurrently-3.out
+++ b/src/test/isolation/expected/detach-partition-concurrently-3.out
@@ -9,7 +9,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -35,7 +35,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -54,7 +54,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -77,7 +77,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -95,7 +95,7 @@ a
 1  
 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach2: <... completed>
@@ -121,7 +121,7 @@ a
 1  
 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach2: <... completed>
@@ -150,7 +150,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -172,7 +172,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -194,7 +194,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -213,7 +213,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -233,7 +233,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -252,7 +252,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_li

Re: Improving isolationtester's data output

2021-06-15 Thread Alvaro Herrera

On 2021-Jun-15, Tom Lane wrote:

> I've been spending a lot of time looking at isolationtester results
> over the past couple of days, and gotten really annoyed at how poorly
> it formats query results.  In particular, any column heading or value
> that is 15 characters or longer is not separated from the next column,
> rendering the output quite confusing.

Yeah, I noticed this too.

> Attached is a little hack that tries to improve that case while making
> minimal changes to the output files otherwise.

Seems pretty reasonable.

> There's still a good deal to be desired here: notably, the code still
> does nothing to ensure vertical alignment of successive lines when
> there are wide headings or values.  But doing anything about that
> would involve much-more-invasive changes of the output files.
> If we wanted to buy into that, I'd think about discarding this
> ad-hoc code altogether in favor of using one of libpq's fe-print.c
> routines.  But I'm not really sure that the small legibility gains
> that would result are worth massive changes in the output files.

Shrug -- it's a one time change.  It wouldn't bother me, for one.

-- 
Álvaro Herrera   Valdivia, Chile
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

Re: Improving isolationtester's data output

2021-06-15 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Jun-15, Tom Lane wrote:
>> If we wanted to buy into that, I'd think about discarding this
>> ad-hoc code altogether in favor of using one of libpq's fe-print.c
>> routines.  But I'm not really sure that the small legibility gains
>> that would result are worth massive changes in the output files.

> Shrug -- it's a one time change.  It wouldn't bother me, for one.

Going forward it wouldn't be a problem, but back-patching isolation
test cases might find it annoying.  On the other hand, my nearby
patch to improve isolation test stability is already going to create
issues of that sort.  (Unless, dare I say it, we back-patch that.)

I do find it a bit attractive to create some regression-testing
coverage of fe-print.c.  We are never going to remove that code,
AFAICS, so getting some benefit from it would be nice.

regards, tom lane

Re: Support for NSS as a libpq TLS backend

2021-06-15 Thread Jacob Champion

On Wed, 2021-06-16 at 00:08 +0200, Daniel Gustafsson wrote:
> > On 15 Jun 2021, at 00:15, Jacob Champion  wrote:
> > Attached is a quick patch; does it work on your machine?
> 
> It does, thanks!  I've included it in the attached v37 along with a few tiny
> non-functional improvements in comment spelling etc.

Great, thanks!

I've been tracking down reference leaks in the client. These open
references prevent NSS from shutting down cleanly, which then makes it
impossible to open a new context in the future. This probably affects
other libpq clients more than it affects psql.

The first step to fixing that is not ignoring failures during NSS
shutdown, so I've tried a patch to pgtls_close() that pushes any
failures through the pqInternalNotice(). That seems to be working well.
The tests were still mostly green, so I taught connect_ok() to fail if
any stderr showed up, and that exposed quite a few failures.

I am currently stuck on one last failing test. This leak seems to only
show up when using TLSv1.2 or below. There doesn't seem to be a
substantial difference in libpq code coverage between 1.2 and 1.3, so
I'm worried that either 1) there's some API we use that "requires"
cleanup, but only on 1.2 and below, or 2) there's some bug in my
version of NSS.

Attached are a few work-in-progress patches. I think the reference
cleanups themselves are probably solid, but the rest of it could use
some feedback. Are there better ways to test for this? and can anyone
reproduce the TLSv1.2 leak?

--Jacob
From 6976d15a4be50276ea2f77fd5c37d238493f9447 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 15 Jun 2021 10:01:14 -0700
Subject: [PATCH 1/3] nss: don't ignore failures during context shutdown

The biggest culprit of a shutdown failure so far seems to be object
leaks. A failure here may prevent future contexts from being created
(and they'll fail in confusing ways), so don't ignore it.

There's not a great way to signal errors from this layer of the stack,
but a notice will at least give us a chance of seeing the problem.
---
 src/interfaces/libpq/fe-secure-nss.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-secure-nss.c 
b/src/interfaces/libpq/fe-secure-nss.c
index 1b9bd4f1a5..90f57e0d99 100644
--- a/src/interfaces/libpq/fe-secure-nss.c
+++ b/src/interfaces/libpq/fe-secure-nss.c
@@ -139,7 +139,16 @@ pgtls_close(PGconn *conn)
 {
if (conn->nss_context)
{
-   NSS_ShutdownContext(conn->nss_context);
+   SECStatus   status;
+   status = NSS_ShutdownContext(conn->nss_context);
+
+   if (status != SECSuccess)
+   {
+   pqInternalNotice(&conn->noticeHooks,
+"unable to shut down 
NSS context: %s",
+
pg_SSLerrmessage(PR_GetError()));
+   }
+
conn->nss_context = NULL;
}
 }
-- 
2.25.1

From acabc4d51d38124db748a6be9dca0ff83693e039 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 15 Jun 2021 14:37:28 -0700
Subject: [PATCH 2/3] test: check for empty stderr during connect_ok()

...in a similar manner to command_like(), to catch notices-as-errors
coming from NSS.
---
 src/test/authentication/t/001_password.pl | 2 +-
 src/test/authentication/t/002_saslprep.pl | 2 +-
 src/test/kerberos/t/001_auth.pl   | 2 +-
 src/test/ldap/t/001_auth.pl   | 2 +-
 src/test/perl/PostgresNode.pm | 5 -
 src/test/ssl/t/001_ssltests.pl| 4 ++--
 src/test/ssl/t/002_scram.pl   | 4 ++--
 7 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/test/authentication/t/001_password.pl 
b/src/test/authentication/t/001_password.pl
index 427a360198..327dfb4ed9 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -20,7 +20,7 @@ if (!$use_unix_sockets)
 }
 else
 {
-   plan tests => 23;
+   plan tests => 32;
 }
 
 
diff --git a/src/test/authentication/t/002_saslprep.pl 
b/src/test/authentication/t/002_saslprep.pl
index f080a0ccba..d6508df8bd 100644
--- a/src/test/authentication/t/002_saslprep.pl
+++ b/src/test/authentication/t/002_saslprep.pl
@@ -17,7 +17,7 @@ if (!$use_unix_sockets)
 }
 else
 {
-   plan tests => 12;
+   plan tests => 20;
 }
 
 # Delete pg_hba.conf from the given node, add a new entry to it
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index b5594924ca..62015339a0 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -23,7 +23,7 @@ use Time::HiRes qw(usleep);
 
 if ($ENV{with_gssapi} eq 'yes')
 {
-   plan tests => 44;
+   plan tests => 54;
 }
 else
 {
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 0ae14e4c85..e857c26258 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -9

Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley

On Wed, 16 Jun 2021 at 05:09, Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

It's not great that people are having to use such blunt instruments to
get the planner to behave.  It might not be a terrible idea to provide
them with something a bit sharper as you suggest. The GUC idea is
certainly something that could be done without too much effort.

There was some talk of doing that in [1].

> Another approach
> would be to introduce a large fuzz factor for such nested loops e.g.
> keep them only if the cost estimate is better than the comparable hash
> join plan by at least a factor of N (or if no such plan is being
> generated for some reason).

In my experience, the most common reason that the planner chooses
non-parameterized nested loops wrongly is when there's row
underestimation that says there's just going to be 1 row returned by
some set of joins.  The problem often comes when some subsequent join
is planned and the planner sees the given join rel only produces one
row.  The cheapest join method we have to join 1 row is Nested Loop.
So the planner just sticks the 1-row join rel on the outer side
thinking the executor will only need to scan the inner side of the
join once.  When the outer row count blows up, then we end up scanning
that inner side many more times. The problem is compounded when you
nest it a few joins deep

Most of the time when I see that happen it's down to either the
selectivity of some correlated base-quals being multiplied down to a
number low enough that we clamp the estimate to be 1 row.   The other
case is similar, but with join quals.

It seems to me that each time we multiply 2 selectivities together
that the risk of the resulting selectivity being out increases.  The
risk is likely lower when we have some extended statistics which
allows us to do fewer selectivity multiplications.

For that 1-row case, doing an unparameterized nested loop is only the
cheapest join method by a tiny amount.  It really wouldn't be much
more expensive to just put that single row into a hash table.  If that
1 estimated row turns out to be 10 actual rows then it's likely not
too big a deal for the hash join code to accommodate the 9 additional
rows.

This makes me think that it's insanely risky for the planner to be
picking Nested Loop joins in this case. And that puts me down the path
of thinking that this problem should be solved by having the planner
take into account risk as well as costs when generating plans.

I don't really have any concrete ideas on that, but a basic idea that
I have been considering is that a Path has a risk_factor field that is
decided somewhere like clauselist_selectivity(). Maybe the risk can go
up by 1 each time we multiply an individual selectivity. (As of
master, estimate_num_groups() allows the estimation to pass back some
further information to the caller. I added that for Result Cache so I
could allow the caller to get visibility about when the estimate fell
back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get
similar treatment to allow the risk_factor or number of nstats_used to
be passed back).  We'd then add a GUC, something like
planner_risk_adversion which could be set to anything from 0.0 to some
positive number. During add_path() we could do the cost comparison
like:  path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion)
< path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion).
That way, if you set planner_risk_adversion to 0.0, then the planner
does as it does today, i.e takes big risks.

The problem I have with this idea is that I really don't know how to
properly calculate what the risk_factor should be set to.  It seems
easy at first to set it to something that has the planner avoid these
silly 1-row estimate nested loop mistakes, but I think what we'd set
the risk_factor to would become much more important when more and more
Path types start using it. So if we did this and just guessed the
risk_factor, that might be fine when only 1 of the paths being
compared had a non-zero risk_factor, but as soon as both paths have
one set, unless they're set to something sensible, then we just end up
comparing garbage costs to garbage costs.

Another common mistake the planner makes is around WHERE a = 
ORDER BY b LIMIT n; where there are separate indexes on (a) and (b).
Scanning the (b) index is pretty risky. All the "a" values you need
might be right at the end of the index. It seems safer to scan the (a)
index as we'd likely have statistics that tell us how many rows exist
with . We don't have any stats that tell us where in the (b)
index are all the rows with a = .

I don't really think we should solve this by having nodeNestloop.c
fa

Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan

On Tue, Jun 15, 2021 at 5:00 PM David Rowley  wrote:
> I don't really think we should solve this by having nodeNestloop.c
> fall back on hashing when the going gets tough.  Overloading our nodes
> that way is not a sustainable thing to do.  I'd rather see the
> executor throw the plan back at the planner along with some hints
> about what was wrong with it.  We could do that providing we've not
> sent anything back to the client yet.

It wasn't a serious suggestion -- it was just a way of framing the
issue at hand that I thought might be useful.

If we did have something like that (which FWIW I think makes sense but
is hard to do right in a general way) then it might be expected to
preemptively refuse to even start down the road of using an
unparameterized nestloop join very early, or even before execution
time. Such an adaptive join algorithm/node might be expected to have a
huge bias against this particular plan shape, that can be reduced to a
simple heuristic. But you can have the simple heuristic without
needing to build everything else.

Whether or not we throw the plan back at the planner or "really change
our minds at execution time" seems like a distinction without a
difference. Either way we're changing our minds about the plan based
on information that is fundamentally execution time information, not
plan time information. Have I missed something?

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Andres Freund

Hi,

On 2021-06-15 15:17:05 -0400, Robert Haas wrote:
> But that's not clear to me. I'm not clear how exactly how many
> problems we know about and need to fix in order to keep the feature,
> and I'm also not clear how deep the hole goes. Like, if we need to get
> a certain number of specific bugs fixed, I might be willing to do
> that. If we need to commit to a major rewrite of the current
> implementation, that's more than I can do. But I guess I don't
> understand exactly how bad the current problems are. Reviewing
> complaints from Andres from this thread:

One important complaints I think your (useful!) list missed is that there's
missing *read side* checks that demonstrably lead to wrong query results:
https://www.postgresql.org/message-id/CAH2-Wz%3DFQ9rbBKkt1nXvz27kmd4A8i1%2B7dcLTNqpCYibxX83VQ%40mail.gmail.com
and that it's currently very hard to figure out where they need to be, because
there's no real explained model of what needs to be checked and what not.

> > * I am fairly sure that it can cause crashes (or even data corruption),
> >  because it assumes that DML never needs to check for old snapshots
> >  (with no meaningful justificiation). Leading to heap_update/delete to
> >  assume the page header is a tuple.
> 
> I don't understand the issue here, really. I assume there might be a
> wrong word here, because assuming that the page header is a tuple
> doesn't sound like a thing that would actually happen.

I suspect what I was thinking of is that a tuple could get pruned away due to
s_t_o, which would leave a LP_DEAD item around. As heap_update/delete neither
checks s_t_o, nor balks at targetting LP_DEAD items, we'd use the offset from
a the LP_DEAD item. ItemIdSetDead() sets lp_off to 0 - which would mean that
the page header is interpreted as a tuple. Right?

> I think one of the key problems for this feature is figuring out
> whether you've got snapshot-too-old checks in all the right places. I
> think what is being alleged here is that heap_update() and
> heap_delete() need them, and that it's not good enough to rely on the
> scan that found the tuple to be updated or deleted having already
> performed those checks. It is not clear to me whether that is true, or
> how it could cause crashes.  Andres may have explained this to me at
> some point, but if he did I have unfortunately forgotten.

I don't think it is sufficient to rely on the scan. That works only as long as
the page with the to-be-modified tuple is pinned (since that'd prevent pruning
/ vacuuming from working on the page), but I am fairly sure that there are
plans where the target tuple is not pinned from the point it was scanned until
it is modified. In which case it is entirely possible that the u/d target can
be pruned away due to s_t_o between the scan checking s_t_o and the u/d
executing.

> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter

Correct. I think there's numerous architectural issues the way the feature is
implemented right now, and that it'd be a substantial project to address them.

> For the record, and to Peter's point, I think it's reasonable to set
> v15 feature freeze as a drop-dead date for getting this feature into
> acceptable shape, but I would like to try to nail down what we think
> "acceptable" means in this context.

I think the absolute minimum would be to have
- actually working tests
- a halfway thorough code review of the feature
- added documentation explaining where exactly s_t_o tests need to be
- bugfixes obviously

If I were to work on the feature, I cannot imagine being sufficient confident
the feature works as long as the xid->time mapping granularity is a
minute. It's just not possible to write reasonable tests with the granularity
being that high. Or even to do manual tests of it - I'm not that patient. But
I "can accept" if somebody actually doing the work differs on this.

Greetings,

Andres Freund

Re: Different compression methods for FPI

2021-06-15 Thread Michael Paquier

On Tue, Jun 15, 2021 at 11:14:56AM -0500, Justin Pryzby wrote:
> You're right, I hadn't though this through all the way.
> There's precedent if the default is non-static (wal_sync_method).
> 
> But I think yes/on/true/1 should be a compatibility alias for a static thing,
> and then the only option is pglz.
> 
> Of course, changing the default to LZ4 is still a possibility.

We have not reached yet a conclusion with the way we are going to
parameterize all that, so let's adapt depending on the feedback.  For
now, I am really interested in this patch set, so I am going to run
some tests of my own and test more around the compression levels we
have at our disposals with the proposed algos.

From I'd like us to finish with here is one new algorithm method, able
to cover a large range of cases as mentioned upthread, from
low-CPU/low-compression to high-CPU/high-compression.  It does not
seem like a good idea to be stuck with an algo that only specializes
in one or the other, for example.
--
Michael

signature.asc
Description: PGP signature

Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 02:28:04PM -0400, Stephen Frost wrote:
> 
> * Julien Rouhaud (rjuju...@gmail.com) wrote:
> > On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> > 
> > The fact that this is such a complex problem is the very reason why we 
> > should
> > spend a lot of energy documenting the various requirements.  Otherwise, how
> > could anyone implement a valid program for that and how could anyone 
> > validate
> > that a solution claiming to do its job actually does its job?
> 
> Reading the code.

Oh, if it's as simple as that then surely documenting the various requirements
won't be an issue.

Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley

On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan  wrote:
> Whether or not we throw the plan back at the planner or "really change
> our minds at execution time" seems like a distinction without a
> difference.

What is "really change our minds at execution time"?  Is that switch
to another plan without consulting the planner?  If so what decides
what that new plan should be? The planner is meant to be the expert in
that department. The new information might cause the join order to
completely change. It might not be as simple as swapping a Nested Loop
for a Hash Join.

> Either way we're changing our minds about the plan based
> on information that is fundamentally execution time information, not
> plan time information. Have I missed something?

I don't really see why you think the number of rows that a given join
might produce is execution information. It's exactly the sort of
information the planner needs to make a good plan. It's just that
today we get that information from statistics. Plenty of other DBMSs
make decisions from sampling. FWIW, we do already have a minimalist
sampling already in get_actual_variable_range().

I'm just trying to highlight that I don't think overloading nodes is a
good path to go down.  It's not a sustainable practice. It's a path
towards just having a single node that does everything. If your
suggestion was not serious then there's no point in discussing it
further.

David

1 2 >

1 - 100 of 133 matches

Mail list logo