Re: Decoding of two-phase xacts missing from CREATE_REPLICATION_SLOT command
On Fri, Jun 11, 2021 at 8:14 PM Amit Kapila wrote: > > Also, I can take care of the below cosmetic issues before committing > if we decide to do this for PG-14. > > Few cosmetic issues: > == > 1. git diff --check shows > src/bin/pg_basebackup/t/030_pg_recvlogical.pl:109: new blank line at EOF. > > 2. > + > >The following example shows SQL interface that can be used to decode > prepared >transactions. Before you use two-phase commit commands, you must set > > Spurious line addition. > Fixed. > 3. > /* Build query */ > appendPQExpBuffer(query, "CREATE_REPLICATION_SLOT \"%s\"", slot_name); > if (is_temporary) > appendPQExpBufferStr(query, " TEMPORARY"); > + > if (is_physical) > > Spurious line addition. > Fixed. > 4. > appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin); > + if (two_phase && PQserverVersion(conn) >= 14) > + appendPQExpBufferStr(query, " TWO_PHASE"); > + > if (PQserverVersion(conn) >= 10) > /* pg_recvlogical doesn't use an exported snapshot, so suppress */ > appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT"); > > I think it might be better to append TWO_PHASE after NOEXPORT_SNAPSHOT > but it doesn't matter much. > I haven't changed this, I like to keep it this way. > 5. > +$node->safe_psql('postgres', > + "BEGIN;INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'"); > > There is no space after BEGIN but there is a space after INSERT. For > consistency-sake, I will have space after BEGIN as well. Changed this. regards, Ajin Cherian Fujitsu Australia v5-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATION.patch Description: Binary data v5-0002-Add-support-for-two-phase-decoding-in-pg_recvlogi.patch Description: Binary data
Re: Use singular number when appropriate
On Tue, 2021-06-15 at 04:59 +, David Fetter wrote: > I thought about using the dual, but wasn't sure how many languages > support it. I think none of the languages for which we cater uses the dual. But I guess you were joking, since the tests are not translated ... > if (fail_count == 0 && fail_ignore_count == 0) > snprintf(buf, sizeof(buf), > - _(" All %d tests passed. "), > - success_count); > + _(" %s %d test%s passed. "), > + success_count == 1 ? "The" : "All", > + success_count, > + success_count == 1 ? "" : "s"); ... and that wouldn't be translatable. Yours, Laurenz Albe
Re: Error on pgbench logs
Hello Michaël, I think we don't have to call doLog() before logAgg(). If we call doLog(), we will count an extra transaction that is not actually processed because accumStats() is called in this. Yes, calling both is weird. The motivation to call doLog is to catch up zeros on slow rates, so as to avoid holes in the log, including at the end of the run. This "trick" was already used by the code. I agree that it would record a non existant transaction, which is not desirable. I wanted to avoid a special parameter, but this seems unrealistic. Is using logAgg() directly in the context actually right when it comes to sample_rate? The point is just to trigger the last display, which is not triggered by the previous I think because of the precision: the start of the run is not exactly the start of the thread. We may not log anything on HEAD if sample_rate is enabled, but we would finish by logging something all the time with this patch. I do not get it. If I am following this code correctly, we don't care about accumStats() in the code path of a thread we are done with, right? Yes. Attached a v3 which adds a boolean to distinguish recording vs flushing. -- Fabien.
Re: Error on pgbench logs
On Tue, 15 Jun 2021 10:05:29 +0200 (CEST) Fabien COELHO wrote: > > Hello Michaël, > > >> I think we don't have to call doLog() before logAgg(). If we call doLog(), > >> we will count an extra transaction that is not actually processed because > >> accumStats() is called in this. > > > > Yes, calling both is weird. > > The motivation to call doLog is to catch up zeros on slow rates, so as to > avoid holes in the log, including at the end of the run. This "trick" was > already used by the code. I agree that it would record a non existant > transaction, which is not desirable. I wanted to avoid a special > parameter, but this seems unrealistic. > > > Is using logAgg() directly in the context actually right when it comes > > to sample_rate? > > The point is just to trigger the last display, which is not triggered by > the previous I think because of the precision: the start of the run is > not exactly the start of the thread. > > > We may not log anything on HEAD if sample_rate is enabled, but we would > > finish by logging something all the time with this patch. > > I do not get it. It was not a problem because --sampling-rate --aggregate-interval cannot be used at the same time. > > If I am following this code correctly, we don't care about accumStats() > > in the code path of a thread we are done with, right? > > Yes. > > Attached a v3 which adds a boolean to distinguish recording vs flushing. Sorry, but I can't find any patach attached... -- Yugo NAGATA
Re: Signed vs. Unsigned (some)
At Fri, 11 Jun 2021 23:05:29 -0300, Ranier Vilela wrote in > Hi, > > Removing legitimate warnings can it be worth it? >From what the warning comes from? And what is the exact message? > -1 CAST can be wrong, when there is an invalid value defined > (InvalidBucket, InvalidBlockNumber). > I think depending on the compiler -1 CAST may be different from > InvalidBucket or InvalidBlockNumber. The definitions are not ((type) -1) but ((type) 0x) so actually they might be different if we forget to widen the constant when widening the types. Regarding to the compiler behavior, I think we are assuming C99[1] and C99 defines that -1 is converted to Uxxx_MAX. (6.3.1.3 Singed and unsigned integers) I'm +0.2 on it. It might be worthwhile as a matter of style. > pg_rewind is one special case. > All cases of XLogSegNo (uint64) initialization are zero, but in pg_rewind > was used -1? > I did not find it InvalidXLogSegNo! I'm not sure whether that is a thinko that the variable is signed or that it is intentional to assign the maximum value. Anyway, actually there's no need for initializing the variable at all. So I don't think it's worth changing the initial value. If any compiler actually complains about the assignment changing it to zero seems reasonable. > Not tested. > > Trivial patch attached. Please don't quickly update the patch responding to my comments alone. I might be a minority. [1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Error on pgbench logs
On Tue, Jun 15, 2021 at 05:15:14PM +0900, Yugo NAGATA wrote: > On Tue, 15 Jun 2021 10:05:29 +0200 (CEST) Fabien COELHO > wrote: > It was not a problem because --sampling-rate --aggregate-interval cannot be > used at the same time. Yep, you are right, thanks. I have missed that both options cannot be specified at the same time. -- Michael signature.asc Description: PGP signature
RE: psql - factor out echo code
>> Wouldn't it be better to comment it like any other function? > >Sure. Attached. Thank you for your revision. I think this patch is good, so I will move it to ready for committer. Best regards, Shinya Kato
RE: [PATCH] expand the units that pg_size_pretty supports on output
>> I don't see the need to extend the unit to YB. >> What use case do you have in mind? > >Practical or no, I saw no reason not to support all defined units. I assume >we’ll >get to a need sooner or later. :) Thank you for your reply! Hmmm, I didn't think YB was necessary, but what do others think? Best regards, Shinya Kato
Re: Error on pgbench logs
Attached a v3 which adds a boolean to distinguish recording vs flushing. Better with the attachement… sorry for the noise. -- Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index d7479925cb..3b27ffebf8 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -648,7 +648,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval); static bool evaluateExpr(CState *st, PgBenchExpr *expr, PgBenchValue *retval); static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now); -static void doLog(TState *thread, CState *st, +static void doLog(TState *thread, CState *st, bool tx, StatsData *agg, bool skipped, double latency, double lag); static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now, bool skipped, StatsData *agg); @@ -3765,6 +3765,30 @@ executeMetaCommand(CState *st, pg_time_usec_t *now) return CSTATE_END_COMMAND; } +/* print aggregated report to logfile */ +static void +logAgg(FILE *logfile, StatsData *agg) +{ + fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f", + agg->start_time, + agg->cnt, + agg->latency.sum, + agg->latency.sum2, + agg->latency.min, + agg->latency.max); + if (throttle_delay) + { + fprintf(logfile, " %.0f %.0f %.0f %.0f", +agg->lag.sum, +agg->lag.sum2, +agg->lag.min, +agg->lag.max); + if (latency_limit) + fprintf(logfile, " " INT64_FORMAT, agg->skipped); + } + fputc('\n', logfile); +} + /* * Print log entry after completing one transaction. * @@ -3774,7 +3798,7 @@ executeMetaCommand(CState *st, pg_time_usec_t *now) * the cost of an extra syscall in all cases. */ static void -doLog(TState *thread, CState *st, +doLog(TState *thread, CState *st, bool tx, StatsData *agg, bool skipped, double latency, double lag) { FILE *logfile = thread->logfile; @@ -3793,42 +3817,29 @@ doLog(TState *thread, CState *st, /* should we aggregate the results or not? */ if (agg_interval > 0) { + pg_time_usec_t next; + /* * Loop until we reach the interval of the current moment, and print * any empty intervals in between (this may happen with very low tps, * e.g. --rate=0.1). */ - - while (agg->start_time + agg_interval <= now) + while ((next = agg->start_time + agg_interval * INT64CONST(100)) <= now) { - /* print aggregated report to logfile */ - fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f", - agg->start_time, - agg->cnt, - agg->latency.sum, - agg->latency.sum2, - agg->latency.min, - agg->latency.max); - if (throttle_delay) - { -fprintf(logfile, " %.0f %.0f %.0f %.0f", - agg->lag.sum, - agg->lag.sum2, - agg->lag.min, - agg->lag.max); -if (latency_limit) - fprintf(logfile, " " INT64_FORMAT, agg->skipped); - } - fputc('\n', logfile); + logAgg(logfile, agg); /* reset data and move to next interval */ - initStats(agg, agg->start_time + agg_interval); + initStats(agg, next); } - /* accumulate the current transaction */ - accumStats(agg, skipped, latency, lag); + if (tx) + /* accumulate the current transaction */ + accumStats(agg, skipped, latency, lag); + else + /* final call to show the last aggregate */ + logAgg(logfile, agg); } - else + else if (tx) { /* no, print raw transactions */ if (skipped) @@ -3889,7 +3900,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now, st->cnt++; if (use_log) - doLog(thread, st, agg, skipped, latency, lag); + doLog(thread, st, true, agg, skipped, latency, lag); /* XXX could use a mutex here, but we choose not to */ if (per_script_stats) @@ -6794,8 +6805,9 @@ done: if (agg_interval > 0) { /* log aggregated but not yet reported transactions */ - doLog(thread, state, &aggs, false, 0, 0); + doLog(thread, state, false, &aggs, false, 0.0, 0.0); } + fclose(thread->logfile); thread->logfile = NULL; }
RE: Transactions involving multiple postgres foreign servers, take 2
From: Robert Haas > Well, we're talking about running this commit routine from within > CommitTransaction(), right? So I think it is in fact running in the > server. And if that's so, then you have to worry about how to make it > respond to interrupts. You can't just call some functions > DBMSX_xa_commit() and wait for infinite time for it to return. Look at > pgfdw_get_result() for an example of what real code to do this looks > like. Postgres can do that, but other implementations can not necessaily do it, I'm afraid. But before that, the FDW interface documentation doesn't describe anything about how to handle interrupts. Actually, odbc_fdw and possibly other FDWs don't respond to interrupts. > Honestly, I am not quite sure what any specification has to say about > this. We're talking about what happens when a user does something with > a foreign table and then type COMMIT. That's all about providing a set > of behaviors that are consistent with how PostgreSQL works in other > situations. You can't negotiate away the requirement to handle errors > in a way that works with PostgreSQL's infrastructure, or the > requirement that any length operation handle interrupts properly, by > appealing to a specification. What we're talking here is mainly whether commit should return success or failure when some participants failed to commit in the second phase of 2PC. That's new to Postgres, isn't it? Anyway, we should respect existing relevant specifications and (well-known) implementations before we conclude that we have to devise our own behavior. > That sounds more like a limitation of the present implementation than > a fundamental problem. We shouldn't reject the idea of having a > resolver process handle this just because the initial implementation > might be slow. If there's no fundamental problem with the idea, > parallelism and concurrency can be improved in separate patches at a > later time. It's much more important at this stage to reject ideas > that are not theoretically sound. We talked about that, and unfortunately, I haven't seen a good and feasible idea to enhance the current approach that involves the resolver from the beginning of 2PC processing. Honestly, I don't understand why such a "one prepare, one commit in turn" serialization approach can be allowed in PostgreSQL where developers pursue best performance and even tries to refrain from adding an if statement in a hot path. As I showed and Ikeda-san said, other implementations have each client session send prepare and commit requests. That's a natural way to achieve reasonable concurrency and performance. Regards Takayuki Tsunakawa
Re: Signed vs. Unsigned (some)
Hi Kyotaro, Thanks for taking a look. Em ter., 15 de jun. de 2021 às 05:17, Kyotaro Horiguchi < horikyota@gmail.com> escreveu: > At Fri, 11 Jun 2021 23:05:29 -0300, Ranier Vilela > wrote in > > Hi, > > > > Removing legitimate warnings can it be worth it? > > From what the warning comes from? And what is the exact message? > msvc 64 bits compiler (Level4) warning C4245: '=': conversion from 'int' to 'Bucket', signed/unsigned mismatch > > -1 CAST can be wrong, when there is an invalid value defined > > (InvalidBucket, InvalidBlockNumber). > > I think depending on the compiler -1 CAST may be different from > > InvalidBucket or InvalidBlockNumber. > > The definitions are not ((type) -1) but ((type) 0x) so > actually they might be different if we forget to widen the constant > when widening the types. Regarding to the compiler behavior, I think > we are assuming C99[1] and C99 defines that -1 is converted to > Uxxx_MAX. (6.3.1.3 Singed and unsigned integers) > > I'm +0.2 on it. It might be worthwhile as a matter of style. > I think about more than style. This is one of the tricks that should not be used. > > pg_rewind is one special case. > > All cases of XLogSegNo (uint64) initialization are zero, but in pg_rewind > > was used -1? > > I did not find it InvalidXLogSegNo! > > I'm not sure whether that is a thinko that the variable is signed or > that it is intentional to assign the maximum value. It is a thinko. Anyway, actually > there's no need for initializing the variable at all. So I don't think > it's worth changing the initial value. It is the case of removing the initialization then? > If any compiler actually > complains about the assignment changing it to zero seems reasonable. > Same case. msvc 64 bits compiler (Level4) warning C4245: '=': initialization from 'int' to 'XLogSegNo', signed/unsigned mismatch > > Not tested. > > > > Trivial patch attached. > > Please don't quickly update the patch responding to my comments alone. > I might be a minority. > Ok. best regards, Ranier Vilela
Re: [PATCH] Fix buffer not null terminated on (ecpg lib)
Em sex., 11 de jun. de 2021 às 19:49, Andres Freund escreveu: > Hi, > > On 2020-04-23 14:36:15 +0900, Kyotaro Horiguchi wrote: > > At Thu, 23 Apr 2020 01:21:21 -0300, Ranier Vilela > wrote in > > > Em qua., 22 de abr. de 2020 às 23:27, Kyotaro Horiguchi < > > > horikyota@gmail.com> escreveu: > > > > > > > > - strncpy(sqlca->sqlerrm.sqlerrmc, message, > > > > sizeof(sqlca->sqlerrm.sqlerrmc)); > > > > - sqlca->sqlerrm.sqlerrmc[sizeof(sqlca->sqlerrm.sqlerrmc) - 1] > = 0; > > > > + sqlca->sqlerrm.sqlerrmc[sizeof(sqlca->sqlerrm.sqlerrmc) - 1] > = > > > > '\0'; > > > > + strncpy(sqlca->sqlerrm.sqlerrmc, message, > > > > sizeof(sqlca->sqlerrm.sqlerrmc) - 1); > > > > > > > > The existing strncpy then terminating by NUL works fine. I don't > think > > > > there's any point in doing the reverse way. Actually > > > > sizeof(sqlca->sqlerrm.sqlerrmc) - 1 is enough for the length but the > > > > existing code is not necessarily a bug. > > > > > > > Without understanding then, why Coveriy claims bug here. > > > > Well, handling non-terminated strings with str* functions are a sign > > of bug in most cases. Coverity is very useful but false positives are > > annoying. I wonder what if we attach Coverity annotations to such > > codes. > > It might be worth doing something about this, for other reasons. We have > disabled -Wstringop-truncation in 716585235b1. But I've enabled it in my > debug build, because I find it useful. The only warning we're getting > in non-optimized builds is > > /home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c: In > function ‘ECPGset_var’: > /home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c:565:17: > warning: ‘strncpy’ output truncated before terminating nul copying 5 bytes > from a string of the same length [-Wstringop-truncation] > 565 | strncpy(sqlca->sqlstate, "YE001", > sizeof(sqlca->sqlstate)); > | > ^~ > memcpy would not suffer from it? regards, Ranier Vilela
Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)
Em seg., 14 de jun. de 2021 às 21:01, Ranier Vilela escreveu: > I took it a step further. > > Transactions > > HEAD patched > 1000220710586781 > 1014616710388685 > 100489191059 > 10065764,333 10436275 3,55021946687555 > > TPS > HEAD patched > 33469,016009 35399,010472 > 33950,624679 34733,252336 > 33639,8429 34578,495043 > 33686,494529 34903,585950 3,48700968070122 > > 3,55% Is it worth touch procarray.c for real? > > With msvc 64 bits, the asm generated: > HEAD > 213.731 bytes procarray.asm > patched > 212.035 bytes procarray.asm > > Patch attached. > Added to next CF (https://commitfest.postgresql.org/33/3169/) regards, Ranier Vilela
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Mon, Jun 14, 2021 at 9:08 PM Robert Haas wrote: > > On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila wrote: > > Why do you think we don't need to check index AM functions? Say we > > have an index expression that uses function and if its parallel safety > > is changed then probably that also impacts whether we can do insert in > > parallel. Because otherwise, we will end up executing some parallel > > unsafe function in parallel mode during index insertion. > > I'm not saying that we don't need to check index expressions. I agree > that we need to check those. The index AM functions are things like > btint4cmp(). I don't think that a function like that should ever be > parallel-unsafe. > Okay, but I think if we go with your suggested model where whenever there is a change in parallel-safety of any function, we need to send the new invalidation then I think it won't matter whether the function is associated with index expression, check constraint in the table, or is used in any other way. > > Even if this line of thinking is correct, there's a big issue for > partitioning hierarchies because there you need to know stuff about > relations that you don't have any other reason to open. I'm just > arguing that if there's no partitioning, the problem seems reasonably > solvable. Either you changed something about the relation, in which > case you've got to lock it and issue invalidations, or you've changed > something about the function, which could be handled via a new type of > invalidation. I don't really see why the cost would be particularly > bad. Suppose that for every relation, you have a flag which is either > PARALLEL_DML_SAFE, PARALLEL_DML_RESTRICTED, PARALLEL_DML_UNSAFE, or > PARALLEL_DML_SAFETY_UNKNOWN. When someone sends a message saying "some > existing function's parallel-safety changed!" you reset that flag for > every relation in the relcache to PARALLEL_DML_SAFETY_UNKNOWN. Then if > somebody does DML on that relation and we want to consider > parallelism, it's got to recompute that flag. None of that sounds > horribly expensive. > Sounds reasonable. I will think more on this and see if anything else comes to mind apart from what you have mentioned. > I mean, it could be somewhat annoying if you have 100k relations open > and sit around all day flipping parallel-safety markings on and off > and then doing a single-row insert after each flip, but if that's the > only scenario where we incur significant extra overhead from this kind > of design, it seems clearly better than forcing users to set a flag > manually. Maybe it isn't, but I don't really see what the other > problem would be right now. Except, of course, for partitioning, which > I'm not quite sure what to do about. > Yeah, dealing with partitioned tables is tricky. I think if we don't want to check upfront the parallel safety of all the partitions then the other option as discussed could be to ask the user to specify the parallel safety of partitioned tables. We can additionally check the parallel safety of partitions when we are trying to insert into a particular partition and error out if we detect any parallel-unsafe clause and we are in parallel-mode. So, in this case, we won't be completely relying on the users. Users can either change the parallel safe option of the table or remove/change the parallel-unsafe clause after error. The new invalidation message as we are discussing would invalidate the parallel-safety for individual partitions but not the root partition (partitioned table itself). For root partition, we will rely on information specified by the user. I am not sure if we have a simple way to check the parallel safety of partitioned tables. In some way, we need to rely on user either (a) by providing an option to specify whether parallel Inserts (and/or other DMLs) can be performed, or (b) by providing a guc and/or rel option which indicate that we can check the parallel-safety of all the partitions. Yet another option that I don't like could be to parallelize inserts on non-partitioned tables. -- With Regards, Amit Kapila.
RE: locking [user] catalog tables vs 2pc vs logical rep
On Tuesday, June 15, 2021 1:51 PM vignesh C wrote: > > Attached the patch-set that addressed those two comments. > > I also fixed the commit message a bit in the 2PC specific patch to HEAD. > > No other changes. > > > > Please check. > > Thanks for the updated patches, the patch applies cleanly in all branches. > Please add a commitfest entry for this, so that we don't miss it. Thank you. I've registered the patch-set in [1]. I'll wait for other reviews from other developers, if any. [1] - https://commitfest.postgresql.org/33/3170/ Best Regards, Takamichi Osumi
Re: Decoding speculative insert with toast leaks memory
On Mon, Jun 14, 2021 at 12:06 PM Dilip Kumar wrote: > > On Mon, Jun 14, 2021 at 9:44 AM Dilip Kumar wrote: > > > > On Mon, Jun 14, 2021 at 8:34 AM Amit Kapila wrote: > > > > > > I think the test in this patch is quite similar to what Noah has > > > pointed in the nearby thread [1] to be failing at some intervals. Can > > > you also please once verify the same and if we can expect similar > > > failures here then we might want to consider dropping the test in this > > > patch for now? We can always come back to it once we find a good > > > solution to make it pass consistently. > > > > test insert-conflict-do-nothing ... ok 646 ms > > test insert-conflict-do-nothing-2 ... ok 1994 ms > > test insert-conflict-do-update... ok 1786 ms > > test insert-conflict-do-update-2 ... ok 2689 ms > > test insert-conflict-do-update-3 ... ok 851 ms > > test insert-conflict-specconflict ... FAILED 3695 ms > > test delete-abort-savept ... ok 1238 ms > > > > Yeah, this is the same test that we have used base for our test so > > let's not push this test until it becomes stable. > > Patches without test case. > Pushed! -- With Regards, Amit Kapila.
[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Currently, CREATE DATABASE forces a checkpoint, then copies all the files, then forces another checkpoint. The comments in the createdb() function explain the reasons for this. The attached patch fixes this problem by making CREATE DATABASE completely WAL-logged so that now we can avoid checkpoints. The patch modifies both CREATE DATABASE and ALTER DATABASE..SET TABLESPACE to be fully WAL-logged. One main advantage of this change is that it will be cheaper. Forcing checkpoints on an idle system is no big deal, but when the system is under heavy write load, it's very expensive. Another advantage is that it makes things better for features like TDE, which might want the pages in the source database to be encrypted using a different key or nonce than the pages in the target database. Design Idea: - First, create the target database directory along with the version file and WAL-log this operation. Create the "relation map file" in the target database and copy the content from the source database. For this, we can use some modified versions of the write_relmap_file() and WAL-log the relmap create operation along with the file content. Now, read the relmap file to find the relfilenode for pg_class and then we read pg_class block by block and decode the tuples. For reading the pg_class blocks, we can use ReadBufferWithoutRelCache() so that we don't need the relcache. Nothing prevents us from checking visibility for tuples in another database because CLOG is global to the cluster. And nothing prevents us from deforming those tuples because the column definitions for pg_class have to be the same in every database. Then we can get the relfilenode of every file we need to copy, and prepare a list of all such relfilenode. Next, for each relfilenode in the source database, create a respective relfilenode in the target database (for all forks) using smgrcreate, which is already a WAL-logged operation. Now read the source relfilenode block by block using ReadBufferWithoutRelCache() and copy the block to the target relfilenode using smgrextend() and WAL-log them using log_newpage(). For the source database, we can not directly use the smgrread(), because there could be some dirty buffers so we will have to read them through the buffer manager interface, otherwise, we will have to flush all the dirty buffers. WAL sequence using pg_waldump 1. (new wal to create db dir and write PG_VERSION file) rmgr: Database desc: CREATE create dir 1663/16394 2. (new wal to create and write relmap file) rmgr: RelMap desc: CREATE database 16394 tablespace 1663 size 512 2. (create relfilenode) rmgr: Storage desc: CREATE base/16394/16384 rmgr: Storage desc: CREATE base/16394/2619 3. (write page data) rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW 4. (create other forks) rmgr: Storage desc: CREATE base/16394/2619_fsm rmgr: Storage CREATE base/16394/2619_vm . I have attached a POC patch, which shows this idea, with this patch all basic sanity testing and the "check-world" is passing. Open points: --- - This is a POC patch so needs more refactoring/cleanup and testing. - Might need to relook into the SMGR level API usage. Credits: --- Thanks to Robert Haas, for suggesting this idea and the high-level design. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From e472d3cb744dc45641d36e919098f9570f80a8fd Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Sat, 5 Jun 2021 17:08:13 +0530 Subject: [PATCH v1] WAL logged CREATE DATABASE Currently, CREATE DATABASE forces a checkpoint, then copies all the files, then forces another checkpoint. The comments in the createdb() function explain the reasons for this. The attached patch fixes this problem by making create database completely WAL logged and so that we can avoid the checkpoints. --- src/backend/access/rmgrdesc/dbasedesc.c | 3 +- src/backend/access/rmgrdesc/relmapdesc.c | 10 + src/backend/access/transam/xlogutils.c | 12 +- src/backend/commands/dbcommands.c| 653 --- src/backend/storage/buffer/bufmgr.c | 13 +- src/backend/utils/cache/relmapper.c | 222 +++ src/bin/pg_rewind/parsexlog.c| 5 + src/include/commands/dbcommands_xlog.h | 7 +- src/include/storage/bufmgr.h | 3 +- src/include/utils/relmapper.h| 6 +- 10 files changed, 613 insertions(+), 321 deletions(-) diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c index 2660984..5010f72 100644 --- a/src/backend/access/rmgrdesc/dbasedesc.c +++ b/src/backend/access/rmgrdesc/dbasedesc.c @@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record) { xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec; - appendStringInfo(buf, "copy di
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote: > Currently, CREATE DATABASE forces a checkpoint, then copies all the > files, then forces another checkpoint. The comments in the createdb() > function explain the reasons for this. The attached patch fixes this > problem by making CREATE DATABASE completely WAL-logged so that now we > can avoid checkpoints. The patch modifies both CREATE DATABASE and > ALTER DATABASE..SET TABLESPACE to be fully WAL-logged. > > One main advantage of this change is that it will be cheaper. Forcing > checkpoints on an idle system is no big deal, but when the system is > under heavy write load, it's very expensive. Another advantage is that > it makes things better for features like TDE, which might want the > pages in the source database to be encrypted using a different key or > nonce than the pages in the target database. I only had a quick look at the patch but AFAICS your patch makes the new behavior mandatory. Wouldn't it make sense to have a way to use the previous approach? People creating wanting to copy somewhat big database and with a slow replication may prefer to pay 2 checkpoints rather than stream everything. Same for people who have an otherwise idle system (I often use that to make temporary backups and/or prepare multiple dataset and most of the time the checkpoint is basically free).
Re: Race condition in recovery?
On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote: > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane wrote in >> I think jacana uses msys[2?], so this likely indicates a problem >> in path sanitization for the archive command. Andrew, any advice? > Thanks for fixing it. > > # I haven't still succeed to run TAP tests on MSYS2 environment. I > # cannot install IPC::Run for msys perl.. > > regards. > Unpack the attached somewhere and point your PERL5LIB at it. That's all I do. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com IPC-Run-Win-0.94.tgz Description: application/compressed-tar
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On 15/06/2021 14:20, Dilip Kumar wrote: Design Idea: - First, create the target database directory along with the version file and WAL-log this operation. Create the "relation map file" in the target database and copy the content from the source database. For this, we can use some modified versions of the write_relmap_file() and WAL-log the relmap create operation along with the file content. Now, read the relmap file to find the relfilenode for pg_class and then we read pg_class block by block and decode the tuples. For reading the pg_class blocks, we can use ReadBufferWithoutRelCache() so that we don't need the relcache. Nothing prevents us from checking visibility for tuples in another database because CLOG is global to the cluster. And nothing prevents us from deforming those tuples because the column definitions for pg_class have to be the same in every database. Then we can get the relfilenode of every file we need to copy, and prepare a list of all such relfilenode. I guess that would work, but you could also walk the database directory like copydir() does. How you find the relations to copy is orthogonal to whether you WAL-log them or use checkpoints. And whether you use the buffer cache is also orthogonal to the rest of the proposal; you could issue FlushDatabaseBuffers() instead of a checkpoint. Next, for each relfilenode in the source database, create a respective relfilenode in the target database (for all forks) using smgrcreate, which is already a WAL-logged operation. Now read the source relfilenode block by block using ReadBufferWithoutRelCache() and copy the block to the target relfilenode using smgrextend() and WAL-log them using log_newpage(). For the source database, we can not directly use the smgrread(), because there could be some dirty buffers so we will have to read them through the buffer manager interface, otherwise, we will have to flush all the dirty buffers. Yeah, WAL-logging the contents of the source database would certainly be less weird than the current system. As Julien also pointed out, the question is, are there people using on "CREATE DATABASE foo TEMPLATE bar" to copy a large source database, on the premise that it's fast because it skips WAL-logging? In principle, we could have both mechanisms, and use the new WAL-logged system if the database is small, and the old system with checkpoints if it's large. But I don't like idea of having to maintain both. - Heikki
Re: Race condition in recovery?
On Mon, Jun 14, 2021 at 3:47 PM Andrew Dunstan wrote: > So, will you feel happier with this applied? I haven't tested it yet but > I'm confident it will work. I'm not all that unhappy now, but yeah, that looks like an improvement to me. I'm still afraid that I will keep writing tests that blow up on Windows but that's a bigger problem than we can hope to fix on this thread, and I do think this discussion has helped. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error on pgbench logs
On Tue, 15 Jun 2021 18:01:18 +0900 Michael Paquier wrote: > On Tue, Jun 15, 2021 at 05:15:14PM +0900, Yugo NAGATA wrote: > > On Tue, 15 Jun 2021 10:05:29 +0200 (CEST) Fabien COELHO > > wrote: > > It was not a problem because --sampling-rate --aggregate-interval cannot be > > used at the same time. > > Yep, you are right, thanks. I have missed that both options cannot be > specified at the same time. Maybe, adding Assert(sample_rate == 0.0 || agg_interval == 0) or moving the check of sample_rate into the else block could improve code readability? -- Yugo NAGATA
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas wrote: > > On 15/06/2021 14:20, Dilip Kumar wrote: > > Design Idea: . Then > > we can get the relfilenode of every file we need to copy, and prepare > > a list of all such relfilenode. > > I guess that would work, but you could also walk the database directory > like copydir() does. How you find the relations to copy is orthogonal to > whether you WAL-log them or use checkpoints. And whether you use the > buffer cache is also orthogonal to the rest of the proposal; you could > issue FlushDatabaseBuffers() instead of a checkpoint. Yeah, that would also work, but I thought since we are already avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and directly use the lower level buffer manager API which doesn't need recache. And I am using pg_class to identify the useful relfilenode so that we can avoid processing some unwanted relfilenode but yeah I agree that this is orthogonal to whether we use checkpoint or not. > Yeah, WAL-logging the contents of the source database would certainly be > less weird than the current system. As Julien also pointed out, the > question is, are there people using on "CREATE DATABASE foo TEMPLATE > bar" to copy a large source database, on the premise that it's fast > because it skips WAL-logging? > > In principle, we could have both mechanisms, and use the new WAL-logged > system if the database is small, and the old system with checkpoints if > it's large. But I don't like idea of having to maintain both. Yeah, I agree in some cases, where we don't have many dirty buffers, checkpointing can be faster. I think code wise maintaining two approaches will not be a very difficult job because the old approach just calls copydir(), but I am thinking about how can we decide which approach is better in which scenario. I don't think we can take calls just based on the database size? It would also depend upon many other factors e.g. how busy your system is, how many total dirty buffers are there in the cluster right? because checkpoint will affect the performance of the operation going on in other databases in the cluster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Error on pgbench logs
On Tue, 15 Jun 2021 11:38:00 +0200 (CEST) Fabien COELHO wrote: > > > Attached a v3 which adds a boolean to distinguish recording vs flushing. I am fine with this version, but I think it would be better if we have a comment explaining what "tx" is for. Also, how about adding Assert(tx) instead of using "else if (tx)" because we are assuming that tx is always true when agg_interval is not used, right? -- Yugo NAGATA
Re: unnesting multirange data types
On Mon, Jun 14, 2021 at 4:14 PM Alexander Korotkov wrote: > On Mon, Jun 14, 2021 at 3:50 PM Jonathan S. Katz wrote: > > On 6/13/21 5:18 PM, Alexander Korotkov wrote: > > > > >> "Expands an array into a set of rows. The array's elements are read out > > >> in storage order." > > >> > > >> If we tweaked the multirange "unnest" function, we could change it to: > > >> > > >> + > > >> +Expands a multirange into a set of rows. > > >> +The ranges are read out in storage order (ascending). > > >> + > > >> > > >> to match what the array "unnest" function docs, or > > >> > > >> + > > >> +Expands a multirange into a set of rows that each > > >> +contain an individual range. > > >> +The ranges are read out in storage order (ascending). > > >> + > > >> > > >> to be a bit more specific. However, I think this is also bordering on > > >> overengineering the text, given there has been a lack of feedback on the > > >> "unnest" array function description being confusing. > > > > > > I think it's not necessarily to say about rows here. Our > > > documentation already has already a number of examples, where we > > > describe set of returned values without speaking about rows including: > > > json_array_elements, json_array_elements_text, json_object_keys, > > > pg_listening_channels, pg_tablespace_databases... > > > > I do agree -- my main point was that I don't think we need to change > > anything. I proposed alternatives just to show some other ways of > > looking at it. But as I mentioned, at this point I think it's > > overengineering the text. > > > > If folks are good with the method + code, I think this is ready. > > Cool, thank you for the summary. I'll wait for two days since I've > published the last revision of the patch [1] (comes tomorrow), and > push it if no new issues arise. Pushed! Thanks to thread participants for raising this topic and review. I'll be around to resolve issues if any. -- Regards, Alexander Korotkov
Re: doc issue missing type name "multirange" in chapter title
On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov wrote: > On Fri, Jun 11, 2021 at 11:16 PM Tom Lane wrote: > > > > Alexander Korotkov writes: > > > What about "Range/Multirange Functions and Operators"? > > > > Better than a comma, I guess. Personally I didn't have a > > problem with the form with two "ands". > > Thank you. I propose to push the slash option because it both evades > double "and" and it's aligned with sibling section headers (we have > "Date/Time Functions and Operators"). > > Any objections? I heard no objection. So, pushed. -- Regards, Alexander Korotkov
Re: doc issue missing type name "multirange" in chapter title
út 15. 6. 2021 v 15:11 odesílatel Alexander Korotkov napsal: > On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov > wrote: > > On Fri, Jun 11, 2021 at 11:16 PM Tom Lane wrote: > > > > > > Alexander Korotkov writes: > > > > What about "Range/Multirange Functions and Operators"? > > > > > > Better than a comma, I guess. Personally I didn't have a > > > problem with the form with two "ands". > > > > Thank you. I propose to push the slash option because it both evades > > double "and" and it's aligned with sibling section headers (we have > > "Date/Time Functions and Operators"). > > > > Any objections? > > I heard no objection. So, pushed. > Thank you Pavel > -- > Regards, > Alexander Korotkov >
Re: Function scan FDW pushdown
Hi Alexander, On Thu, May 20, 2021 at 11:13 PM Alexander Pyhalov wrote: > > Hi. > > The attached patch allows pushing joins with function RTEs to PostgreSQL > data sources. > This makes executing queries like this > > create foreign table f_pgbench_accounts (aid int, bid int, abalance int, > filler char(84)) SERVER local_srv OPTIONS (table_name > 'pgbench_accounts'); > select * from f_pgbench_accounts join unnest(array[1,2,3]) ON unnest = > aid; > It will be good to provide some practical examples where this is useful. > more efficient. > > with patch: > > > So far I don't know how to visualize actual function expression used in > function RTE, as in postgresExplainForeignScan() es->rtable comes from > queryDesc->plannedstmt->rtable, and rte->functions is already 0. The actual function expression will be part of the Remote SQL of ForeignScan node so no need to visualize it separately. The patch will have problems when there are multiple foreign tables all on different servers or use different FDWs. In such a case the function scan's RelOptInfo will get the fpinfo based on the first foreign table the function scan is paired with during join planning. But that may not be the best foreign table to join. We should be able to plan all the possible joins. Current infra to add one fpinfo per RelOptInfo won't help there. We need something better. The patch targets only postgres FDW, how do you see this working with other FDWs? If we come up with the right approach we could use it for 1. pushing down queries with IN () clause 2. joining a small local table with a large foreign table by sending the local table rows down to the foreign server. -- Best Wishes, Ashutosh Bapat
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Am I mistaken in thinking that this would allow CREATE DATABASE to run inside a transaction block now, further reducing the DDL commands that are non-transactional?
Re: Case expression pushdown
Looks quite useful to me. Can you please add this to the next commitfest? On Wed, Jun 9, 2021 at 5:25 PM Alexander Pyhalov wrote: > > Hi. > > This patch allows pushing case expressions to foreign servers, so that > more types of updates could be executed directly. > > For example, without patch: > > EXPLAIN (VERBOSE, COSTS OFF) > UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END > WHERE c1 > 1000; >QUERY PLAN > --- >Update on public.ft2 d > Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2 WHERE ctid = $1 > -> Foreign Scan on public.ft2 d > Output: CASE WHEN (c2 > 0) THEN c2 ELSE 0 END, ctid, d.* > Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM > "S 1"."T 1" WHERE (("C 1" > 1000)) FOR UPDATE > > > EXPLAIN (VERBOSE, COSTS OFF) > UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END > WHERE c1 > 1000; > QUERY PLAN > > Update on public.ft2 d > -> Foreign Update on public.ft2 d > Remote SQL: UPDATE "S 1"."T 1" SET c2 = (CASE WHEN (c2 > 0) > THEN c2 ELSE 0 END) WHERE (("C 1" > 1000)) > > > -- > Best regards, > Alexander Pyhalov, > Postgres Professional -- Best Wishes, Ashutosh Bapat
Re: [PATCH] expand the units that pg_size_pretty supports on output
On Tue, 15 Jun 2021 at 21:24, wrote: > Hmmm, I didn't think YB was necessary, but what do others think? For me personally, without consulting Wikipedia, I know that Petabyte comes after Terabyte and then I'm pretty sure it's Exabyte. After that, I'd need to check. Assuming I'm not the only person who can't tell exactly how many bytes are in a Yottabyte, would it actually be a readability improvement if we started showing these units to people? I'd say there might be some argument to implement as far as PB one day, maybe not that far out into the future, especially if we got something like built-in clustering. But I just don't think there's any need to go all out and take it all the way to YB. There's an above zero chance we'll break something of someones by doing this, so I think any changes here should be driven off an actual requirement. I really think this change is more likely to upset someone than please someone. Just my thoughts. David
Re: [PATCH] expand the units that pg_size_pretty supports on output
On Tue, 15 Jun 2021 at 05:24, wrote: > >> I don't see the need to extend the unit to YB. > >> What use case do you have in mind? > > > >Practical or no, I saw no reason not to support all defined units. I > assume we’ll > >get to a need sooner or later. :) > > Thank you for your reply! > Hmmm, I didn't think YB was necessary, but what do others think? > If I’m reading the code correctly, the difference between supporting YB and not supporting it is whether there is a line for it in the list of prefixes and their multiples. As such, I don’t see why we’re even discussing whether or not to include all the standard prefixes. A YB is still an absurd amount of storage, but that’s not the point; just put all the standard prefixes and be done with it. If actual code changes were required in the new code as they are in the old it might be worth discussing. One question: why is there no “k” in the list of prefixes? Also: why not have only the prefixes in the array, and use a single fixed output format "%s %sB" all the time? It feels like it should be possible to calculate the appropriate idx to use (while adjusting the number to print as is done now) and then just have one psprintf call for all cases. A more significant question is YB vs. YiB. I know there is a long tradition within computer-related fields of saying that k = 1024, M = 1024^2, etc., but we’re not special enough to override the more general principles of SI (Système International) which provide that k = 1000, M = 1000^2 and so on universally and provide the alternate prefixes ki, Mi, etc. which use 1024 as the multiple. So I would suggest either display 200 as 2MB or as 1.907MiB.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On 6/15/21 8:04 AM, Heikki Linnakangas wrote: > > Yeah, WAL-logging the contents of the source database would certainly > be less weird than the current system. As Julien also pointed out, the > question is, are there people using on "CREATE DATABASE foo TEMPLATE > bar" to copy a large source database, on the premise that it's fast > because it skips WAL-logging? I'm 100% certain there are. It's not even a niche case. > > In principle, we could have both mechanisms, and use the new > WAL-logged system if the database is small, and the old system with > checkpoints if it's large. But I don't like idea of having to maintain > both. > > Rather than use size, I'd be inclined to say use this if the source database is marked as a template, and use the copydir approach for anything that isn't. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: unnesting multirange data types
Alexander Korotkov writes: > Pushed! Thanks to thread participants for raising this topic and > review. I'll be around to resolve issues if any. Buildfarm is pretty thoroughly unhappy. Did you do a "check-world" before pushing? regards, tom lane
SSL/TLS instead of SSL in docs
In the NSS thread it was discussed (20210603210642.gf22...@momjian.us etc) that we use SSL rather than TLS in the documentation, which is technically somewhat incorrect. Consensus came to using SSL/TLS instead for referring to encrypted connections. Since this isn't really limited to the NSS work, I'm breaking this out into a new thread. Looking at the docs it turns out that we have a mix of SSL (with one ssl), SSL/TLS and TLS for referring to the same thing. The attached changes the documentation to consistently use SSL/TLS when referring to an encrypted connection using a TLS protocol, leaving bare SSL and TLS only for referring to the actual protocols. I *think* I found all instances, there are many so I might have missed some, but this version seemed like a good place to continue the discussion from the previous thread. Admittedly it gets pretty unwieldy with the markup on SSL and TLS but I opted for being consistent, since I don't know of any rules for when it can/should be omitted (and it seems quite arbitrary right now). Mentions in titles were previously not marked up so I've left those as is. I've also left line breaks as an excercise for later to make the diff more readable. While in there I added IMO missing items to the glossary and acronyms sections as well as fixed up markup around OpenSSL. This only deals with docs, but if this is deemed interesting then userfacing messages in the code should use SSL/TLS as well of course. Thoughts? -- Daniel Gustafsson https://vmware.com/ v1-0003-docs-Consistent-OpenSSL-markup.patch Description: Binary data v1-0002-docs-Replace-usage-of-SSL-with-SSL-TLS.patch Description: Binary data v1-0001-docs-SSL-TLS-related-acronyms-and-glossary.patch Description: Binary data
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Tue, Jun 15, 2021 at 7:05 AM Amit Kapila wrote: > Okay, but I think if we go with your suggested model where whenever > there is a change in parallel-safety of any function, we need to send > the new invalidation then I think it won't matter whether the function > is associated with index expression, check constraint in the table, or > is used in any other way. No, it will still matter, because I'm proposing that the parallel-safety of functions that we only access via operator classes will not even be checked. Also, if we decided to make the system more fine-grained - e.g. by invalidating on the specific OID of the function that was changed rather than doing something that is database-wide or global - then it matters even more. > Yeah, dealing with partitioned tables is tricky. I think if we don't > want to check upfront the parallel safety of all the partitions then > the other option as discussed could be to ask the user to specify the > parallel safety of partitioned tables. Just to be clear here, I don't think it really matters what we *want* to do. I don't think it's reasonably *possible* to check all the partitions, because we don't hold locks on them. When we're assessing a bunch of stuff related to an individual relation, we have a lock on it. I think - though we should double-check tablecmds.c - that this is enough to prevent all of the dependent objects - triggers, constraints, etc. - from changing. So the stuff we care about is stable. But the situation with a partitioned table is different. In that case, we can't even examine that stuff without locking all the partitions. And even if we do lock all the partitions, the stuff could change immediately afterward and we wouldn't know. So I think it would be difficult to make it correct. Now, maybe it could be done, and I think that's worth a little more thought. For example, perhaps whenever we invalidate a relation, we could also somehow send some new, special kind of invalidation for its parent saying, essentially, "hey, one of your children has changed in a way you might care about." But that's not good enough, because it only goes up one level. The grandparent would still be unaware that a change it potentially cares about has occurred someplace down in the partitioning hierarchy. That seems hard to patch up, again because of the locking rules. The child can know the OID of its parent without locking the parent, but it can't know the OID of its grandparent without locking its parent. Walking up the whole partitioning hierarchy might be an issue for a number of reasons, including possible deadlocks, and possible race conditions where we don't emit all of the right invalidations in the face of concurrent changes. So I don't quite see a way around this part of the problem, but I may well be missing something. In fact I hope I am missing something, because solving this problem would be really nice. > We can additionally check the > parallel safety of partitions when we are trying to insert into a > particular partition and error out if we detect any parallel-unsafe > clause and we are in parallel-mode. So, in this case, we won't be > completely relying on the users. Users can either change the parallel > safe option of the table or remove/change the parallel-unsafe clause > after error. The new invalidation message as we are discussing would > invalidate the parallel-safety for individual partitions but not the > root partition (partitioned table itself). For root partition, we will > rely on information specified by the user. Yeah, that may be the best we can do. Just to be clear, I think we would want to check whether the relation is still parallel-safe at the start of the operation, but not have a run-time check at each function call. > I am not sure if we have a simple way to check the parallel safety of > partitioned tables. In some way, we need to rely on user either (a) by > providing an option to specify whether parallel Inserts (and/or other > DMLs) can be performed, or (b) by providing a guc and/or rel option > which indicate that we can check the parallel-safety of all the > partitions. Yet another option that I don't like could be to > parallelize inserts on non-partitioned tables. If we figure out a way to check the partitions automatically that actually works, we don't need a switch for it; we can (and should) just do it that way all the time. But if we can't come up with a correct algorithm for that, then we'll need to add some kind of option where the user declares whether it's OK. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan wrote: > > Rather than use size, I'd be inclined to say use this if the source > database is marked as a template, and use the copydir approach for > anything that isn't. Looks like a good approach.
Confused by the default privilege
Hi, My use case is to create an isolated interface schema consisting of only views and functions (possibly many schemas, for multi-tenancy or multi-version), which has the minimal access exposure. To reduce the mental and maintenance burden, I am inclined to create one role per interface schema, instead of creating separate roles for the owner and the user. As a consequence, the default privileges must be revoked from the owner. Explicit revocation works just fine, except that it requires repetitive and forgettable statements for each object in the schema. The default privileges come to rescue. It mostly works, despite a bit of confusion to me. The ending contents are some experiments and demonstrations. To sum up, I have to either leave some non-critical privileges (e.g., trigger, references) by the default privilege mechanism or manually revoke all privileges, to stop the owner having all the default privileges. Plus, the first alternative is not applicable to functions because there is only one privilege for functions (execute). To me, it is confusing and less intuitive. Or is there something I miss? TL;DR Revoking all default privileges is effectively equivalent to revoking nothing, because an empty string of access privileges is handled as 'default'. Maybe 'NULL' for 'default', and '' (empty string) means nothing? Regards. -- drop owned by owner; drop role if exists owner, guest; create role owner; create role guest; drop schema if exists s; create schema if not exists s authorization owner; DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA 1. tables 1.1. no-op set role to owner; create or replace view s.v1 as select 1; \dp+ s.v1 Schema Name Type Access privileges Column privileges Policies s v1 view select * from information_schema.role_table_grants where table_name='v1'; grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy owner owner postgres s v1 INSERT YES NO owner owner postgres s v1 SELECT YES YES owner owner postgres s v1 UPDATE YES NO owner owner postgres s v1 DELETE YES NO owner owner postgres s v1 TRUNCATE YES NO owner owner postgres s v1 REFERENCES YES NO owner owner postgres s v1 TRIGGER YES NO set role to owner; select * from s.v1; ?column? 1 1.2. default privilege: revoke all from owner alter default privileges for user owner revoke all on tables from owner; \ddp+ Owner Schema Type Access privileges owner table set role to owner; create or replace view s.v2 as select 1; \dp+ s.v2 Schema Name Type Access privileges Column privileges Policies s v2 view select * from information_schema.role_table_grants where table_name='v2'; grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy owner owner postgres s v2 INSERT YES NO owner owner postgres s v2 SELECT YES YES owner owner postgres s v2 UPDATE YES NO owner owner postgres s v2 DELETE YES NO owner owner postgres s v2 TRUNCATE YES NO owner owner postgres s v2 REFERENCES YES NO owner owner postgres s v2 TRIGGER YES NO set role to owner; select * from s.v2; ?column? 1 1.3. default privilege: revoke all but one from owner alter default privileges for user owner revoke all on tables from owner; alter default privileges for user owner grant trigger on tables to owner; \ddp+ Owner Schema Type Access privileges owner table owner=t/owner set role to owner; create or replace view s.v3 as select 1; \dp+ s.v3 Schema Name Type Access privileges Column privileges Policies s v3 view owner=t/owner select * from information_schema.role_table_grants where table_name='v3'; grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy owner owner postgres s v3 TRIGGER YES NO set role to owner; select * from s.v3; ERROR: 42501: permission denied for view v3 LOCATION: aclcheck_error, aclchk.c:3461 1.4. manual revoke all from owner alter default privileges for user owner revoke all on tables from owner; \ddp+ Owner Schema Type Access privileges owner table set role to owner; create or replace view s.v4 as select 1; \dp+ s.v4 Schema Name Type Access privileges Column privileges Policies s v4 view select * from information_schema.role_table_grants where table_name='v4'; grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy owner owner postgres s v4 INSERT YES NO owner owner postgres s v4 SELECT YES YES owner owner postgres s v4 UPDATE YES NO owner owner postgres s v4 DELETE YES NO owner owner postgres s v4 TRUNCATE YES NO owner owner postgres s v4 REFERENCES YES NO owner owner postgres s v4 TRIGGER YES NO set role to owner; select * from s.v4; ?column? 1 So far, the situation is identical to s.v2. set role to owner; revoke all on table s.v4 from owner; \dp+ s.v4 Schema Name Type Access privileges Column privileges Policies s v4 view select
Re: Confused by the default privilege
Gee, I pasted the ending demonstration as html. Re-pasting a text version. -- ┌ │ drop owned by owner; │ drop role if exists owner, guest; │ │ create role owner; │ create role guest; │ │ drop schema if exists s; │ create schema if not exists s authorization owner; └ DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA 1 tables 1.1 no-op ┌ │ set role to owner; │ create or replace view s.v1 as select 1; └ ┌ │ \dp+ s.v1 └ Schema Name Type Access privileges Column privileges Policies s v1view ┌ │ select * from information_schema.role_table_grants where table_name='v1'; └ ━ grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy ─ ownerownerpostgres s v1 INSERT YES NO ownerownerpostgres s v1 SELECT YES YES ownerownerpostgres s v1 UPDATE YES NO ownerownerpostgres s v1 DELETE YES NO ownerownerpostgres s v1 TRUNCATE YES NO ownerownerpostgres s v1 REFERENCES YES NO ownerownerpostgres s v1 TRIGGER YES NO ━ ┌ │ set role to owner; │ select * from s.v1; └ ━━ ?column? ── 1 ━━ 1.2 default privilege: `revoke all from owner' ─── ┌ │ alter default privileges for user owner revoke all on tables from owner; │ \ddp+ └ ━ Owner Schema Type Access privileges ─ owner table ━ ┌ │ set role to owner; │ create or replace view s.v2 as select 1; └ ┌ │ \dp+ s.v2 └ Schema Name Type Access privileges Column privileges Policies s v2view ┌ │ select * from information_schema.role_table_grants where table_name='v2'; └ ━ grantor grantee table_catalog table_schema table_name privilege_type is_grantable with_hierarchy ─ ownerownerpostgres s v2 INSERT YES NO ownerownerpostgres s v2 SELECT YES YES ownerownerpostgres s v2 UPDATE YES NO ownerownerpostgres s v2 DELETE YES NO ownerownerpostgres s v2 TRUNCATE YES NO ownerownerpostgres s v2 REFERENCES YES NO ownerownerpostgres s v2 TRIGGER YES NO ━ ┌ │ set role to owner; │ select * from s.v2; └ ━━ ?column? ── 1 ━━ 1.3 default privilege: `revoke all but one from owner' ─── ┌ │ alter default privileges for user owner revoke all on tables from owner; │ alter default privileges for user owner grant trigger on tables to owner; │ \ddp+ └ ━ Owner Schema Type Access privileges ─ owner table owner=t/owner ━ ┌ │ set role to owner; │ create or replace view s.v3 as select 1; └ ┌ │ \dp+ s.v3 └ Schema Name Type Access privileges Column privileges Policies ─
Re: Fix around conn_duration in pgbench
On Mon, 14 Jun 2021 10:57:07 +0200 (CEST) Fabien COELHO wrote: > > However, I found that conn_duration is calculated even when -C/--connect > > is not specified, which is waste. SO we can remove this code as fixed in > > the attached patch. > > I'm fine with the implied code simplification, but it deserves a comment. Thank you for adding comments! > > In addition, deconnection delays are not cumulated even under -C/--connect > > in spite of mentioned in the comment. I also fixed this in the attached > > patch. > > I'm fine with that, even if it only concerns is_connect. I've updated the > command to work whether now is initially set or not. Ok. I agree with your update. > Also, there is the issue of connection failures: the attached version adds > an error message and exit for initial connections consistently. > This is not done with is_connect, though, and I'm unsure what we should > really do. Well, as to connection failures, I think that we should discuss in the other thread [1] where this issue was originally raised or in a new thread because we can discuss this as a separated issue from the originally proposed patch. [1] https://www.postgresql.org/message-id/flat/TYCPR01MB5870057375ACA8A73099C649F5349%40TYCPR01MB5870.jpnprd01.prod.outlook.com. Regards, Yugo Nagata -- Yugo NAGATA
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Mon, Jun 14, 2021 at 9:08 PM Robert Haas wrote: > > On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila wrote: > > > Yeah, this could be one idea but I think even if we use pg_proc OID, > > we still need to check all the rel cache entries to find which one > > contains the invalidated OID and that could be expensive. I wonder > > can't we directly find the relation involved and register invalidation > > for the same? We are able to find the relation to which trigger > > function is associated during drop function via findDependentObjects > > by scanning pg_depend. Assuming, we are able to find the relation for > > trigger function by scanning pg_depend, what kinds of problems do we > > envision in registering the invalidation for the same? > > I don't think that finding the relation involved and registering an > invalidation for the same will work properly. Suppose there is a > concurrently-running transaction which has created a new table and > attached a trigger function to it. You can't see any of the catalog > entries for that relation yet, so you can't invalidate it, but > invalidation needs to happen. Even if you used some snapshot that can > see those catalog entries before they are committed, I doubt it fixes > the race condition. You can't hold any lock on that relation, because > the creating transaction holds AccessExclusiveLock, but the whole > invalidation mechanism is built around the assumption that the sender > puts messages into the shared queue first and then releases locks, > while the receiver first acquires a conflicting lock and then > processes messages from the queue. > Won't such messages be proceesed at start transaction time (AtStart_Cache->AcceptInvalidationMessages)? > Without locks, that synchronization > algorithm can't work reliably. As a consequence of all that, I believe > that, not just in this particular case but in general, the > invalidation message needs to describe the thing that has actually > changed, rather than any derived property. We can make invalidations > that say "some function's parallel-safety flag has changed" or "this > particular function's parallel-safety flag has changed" or "this > particular function has changed in some way" (this one, we have > already), but anything that involves trying to figure out what the > consequences of such a change might be and saying "hey, you, please > update XYZ because I changed something somewhere that could affect > that" is not going to be correct. > > > I think we probably need to worry about the additional cost to find > > dependent objects and if there are any race conditions in doing so as > > pointed out by Tom in his email [1]. The concern related to cost could > > be addressed by your idea of registering such an invalidation only > > when the user changes the parallel safety of the function which we > > don't expect to be a frequent operation. Now, here the race condition, > > I could think of could be that by the time we change parallel-safety > > (say making it unsafe) of a function, some of the other sessions might > > have already started processing insert on a relation where that > > function is associated via trigger or check constraint in which case > > there could be a problem. I think to avoid that we need to acquire an > > Exclusive lock on the relation as we are doing in Rename Policy kind > > of operations. > > Well, the big issue here is that we don't actually lock functions > while they are in use. So there's absolutely nothing that prevents a > function from being altered in any arbitrary way, or even dropped, > while code that uses it is running. I don't really know what happens > in practice if you do that sort of thing: can you get the same query > to run with one function definition for the first part of execution > and some other definition for the rest of execution? I tend to doubt > it, because I suspect we cache the function definition at some point. > It is possible that in the same statement execution a different function definition can be executed. Say, in session-1 we are inserting three rows, on first-row execution definition-1 of function in index expression gets executed. Now, from session-2, we change the definition of the function to definition-2. Now, in session-1, on second-row insertion, while executing definition-1 of function, we insert in another table that will accept the invalidation message registered in session-2. Now, on third-row insertion, the new definition (definition-2) of function will be executed. > If that's the case, caching the parallel-safety marking at the same > point seems OK too, or at least no worse than what we're doing > already. But on the other hand if it is possible for a query's notion > of the function definition to shift while the query is in flight, then > this is just another example of that and no worse than any other. > Instead of changing the parallel-safety flag, somebody could redefine > the function so that it divides by zero or pro
Re: [PATCH] expand the units that pg_size_pretty supports on output
On Tue, Jun 15, 2021 at 8:31 AM Isaac Morland wrote: > On Tue, 15 Jun 2021 at 05:24, wrote: > >> >> I don't see the need to extend the unit to YB. >> >> What use case do you have in mind? >> > >> >Practical or no, I saw no reason not to support all defined units. I >> assume we’ll >> >get to a need sooner or later. :) >> >> Thank you for your reply! >> Hmmm, I didn't think YB was necessary, but what do others think? >> > > If I’m reading the code correctly, the difference between supporting YB > and not supporting it is whether there is a line for it in the list of > prefixes and their multiples. As such, I don’t see why we’re even > discussing whether or not to include all the standard prefixes. A YB is > still an absurd amount of storage, but that’s not the point; just put all > the standard prefixes and be done with it. If actual code changes were > required in the new code as they are in the old it might be worth > discussing. > Agreed, this is why I went this way. One and done. > One question: why is there no “k” in the list of prefixes? > kB has a special-case code block before you get to this point. I didn't look into the reasons, but assume there are some. > Also: why not have only the prefixes in the array, and use a single fixed > output format "%s %sB" all the time? > > It feels like it should be possible to calculate the appropriate idx to > use (while adjusting the number to print as is done now) and then just have > one psprintf call for all cases. > Sure, if that seems more readable/understandable. > A more significant question is YB vs. YiB. I know there is a long > tradition within computer-related fields of saying that k = 1024, M = > 1024^2, etc., but we’re not special enough to override the more general > principles of SI (Système International) which provide that k = 1000, M = > 1000^2 and so on universally and provide the alternate prefixes ki, Mi, > etc. which use 1024 as the multiple. > > So I would suggest either display 200 as 2MB or as 1.907MiB. > Heh, I was just expanding the existing logic; if others want to have this particular battle go ahead and I'll adjust the code/prefixes, but obviously the logic will need to change if we want to support true MB instead of MiB as MB. Also, this will presumably be a breaking change for anyone using the existing units MB == 1024 * 1024, as we've had for something like 20 years. Changing these units to the *iB will be trivial with this patch, but not looking forward to garnering the consensus to change this part. David
Re: [PATCH] expand the units that pg_size_pretty supports on output
On Tue, Jun 15, 2021 at 8:26 AM David Rowley wrote: > On Tue, 15 Jun 2021 at 21:24, wrote: > > Hmmm, I didn't think YB was necessary, but what do others think? > > For me personally, without consulting Wikipedia, I know that Petabyte > comes after Terabyte and then I'm pretty sure it's Exabyte. After > that, I'd need to check. > > Assuming I'm not the only person who can't tell exactly how many bytes > are in a Yottabyte, would it actually be a readability improvement if > we started showing these units to people? > I hadn't really thought about that TBH; to me it seemed like an improvement, but I do see that others might not, and adding confusion is definitely not helpful. That said, it seems like having the code structured in a way that we can expand via adding an element to a table instead of the existing way it's written with nested if blocks is still a useful refactor, whatever we decide the cutoff units should be. > I'd say there might be some argument to implement as far as PB one > day, maybe not that far out into the future, especially if we got > something like built-in clustering. But I just don't think there's any > need to go all out and take it all the way to YB. There's an above > zero chance we'll break something of someones by doing this, so I > think any changes here should be driven off an actual requirement. > I got motivated to do this due to some (granted synthetic) work/workloads, where I was seeing 6+digit TB numbers and thought it was ugly. Looked at the code and thought the refactor was the way to go, and just stuck all of the known units in. > I really think this change is more likely to upset someone than please > someone. > I'd be interested to see reactions from people; to me, it seems a +1, but seems like -1, 0, +1 all valid opinions here; I'd expect more 0's and +1s, but I'm probably biased since I wrote this. :-)
Re: pg_stat_statements and "IN" conditions
> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote: > > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote: > > On 1/5/21 10:51 AM, Zhihong Yu wrote: > > > > > > + int lastExprLenght = 0; > > > > > > Did you mean to name the variable lastExprLenghth ? > > > > > > w.r.t. extracting to helper method, the second and third > > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar. > > > It is up to you whether to create the helper method. > > > I am fine with the current formation. > > > > Dmitry, thoughts on this review? > > Oh, right. lastExprLenghth is obviously a typo, and as we agreed that > the helper is not strictly necessary I wanted to wait a bit hoping for > more feedback and eventually to post an accumulated patch. Doesn't make > sense to post another version only to fix one typo :) Hi, I've prepared a new rebased version to deal with the new way of computing query id, but as always there is one tricky part. From what I understand, now an external module can provide custom implementation for query id computation algorithm. It seems natural to think this machinery could be used instead of patch in the thread, i.e. one could create a custom logic that will enable constants collapsing as needed, so that same queries with different number of constants in an array will be hashed into the same record. But there is a limitation in how such queries will be normalized afterwards — to reduce level of surprise it's necessary to display the fact that a certain query in fact had more constants that are showed in pgss record. Ideally LocationLen needs to carry some bits of information on what exactly could be skipped, and generate_normalized_query needs to understand that, both are not reachable for an external module with custom query id logic (without replicating significant part of the existing code). Hence, a new version of the patch. >From 81a217485385452c7223a140dd4c98eeb5270945 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Thu, 10 Jun 2021 13:15:35 +0200 Subject: [PATCH v4] Prevent jumbling of every element in ArrayExpr pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...) depending on number of parameters, because every element of ArrayExpr is jumbled. Make Consts (or any expression that could be reduced to a Const) contribute nothing to the jumble hash if they're part of a series and at position further that specified threshold. Do the same for similar queries with VALUES as well. --- .../expected/pg_stat_statements.out | 835 +- .../pg_stat_statements/pg_stat_statements.c | 42 +- .../sql/pg_stat_statements.sql| 163 src/backend/utils/misc/guc.c | 13 + src/backend/utils/misc/queryjumble.c | 274 +- src/include/utils/queryjumble.h | 11 +- 6 files changed, 1323 insertions(+), 15 deletions(-) diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out index 40b5109b55..3fc1978066 100644 --- a/contrib/pg_stat_statements/expected/pg_stat_statements.out +++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out @@ -205,7 +205,7 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C"; INSERT INTO test VALUES(generate_series($1, $2), $3) | 1 | 10 SELECT * FROM test ORDER BY a| 1 | 12 SELECT * FROM test WHERE a > $1 ORDER BY a | 2 |4 - SELECT * FROM test WHERE a IN ($1, $2, $3, $4, $5) | 1 |8 + SELECT * FROM test WHERE a IN ($1, $2, $3, $4, ...) | 1 |8 SELECT pg_stat_statements_reset()| 1 |1 SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 |0 UPDATE test SET b = $1 WHERE a = $2 | 6 |6 @@ -1067,4 +1067,837 @@ SELECT COUNT(*) FROM pg_stat_statements WHERE query LIKE '%SELECT GROUPING%'; 2 (1 row) +-- +-- Consts merging +-- +SET pg_stat_statements.merge_threshold = 5; +CREATE TABLE test_merge (id int, data int); +-- IN queries +-- Normal +SELECT pg_stat_statements_reset(); + pg_stat_statements_reset +-- + +(1 row) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3); + id | data ++-- +(0 rows) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT * FROM test_merge WHERE id IN ($1, $2, $3) | 1 + SELECT pg_stat_statements_reset() | 1
Re: Use singular number when appropriate
On Tue, Jun 15, 2021 at 09:37:11AM +0200, Laurenz Albe wrote: > On Tue, 2021-06-15 at 04:59 +, David Fetter wrote: > > I thought about using the dual, but wasn't sure how many languages > > support it. > > I think none of the languages for which we cater uses the dual. But > I guess you were joking, since the tests are not translated ... I was. > > if (fail_count == 0 && fail_ignore_count == 0) > > snprintf(buf, sizeof(buf), > > -_(" All %d tests passed. "), > > -success_count); > > +_(" %s %d test%s passed. "), > > +success_count == 1 ? "The" : "All", > > +success_count, > > +success_count == 1 ? "" : "s"); > > ... and that wouldn't be translatable. Thanks, will rearrange. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Duplicate history file?
Greetings, * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier > wrote in > > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote: > > > I think cp can be an example as far as we explain the limitations. (On > > > the other hand "test !-f" cannot since it actually prevents server > > > from working correctly.) > > > > Disagreed. I think that we should not try to change this area until > > we can document a reliable solution, and a simple "cp" is not that. > > Isn't removing cp from the documentation a change in this area? I > basically agree to not to change anything but the current example > "test ! -f && cp .." and relevant description has been known to > be problematic in a certain situation. [...] > - Write the full (known) requirements and use a pseudo tool-name in > the example? I'm generally in favor of just using a pseudo tool-name and then perhaps providing a link to a new place on .Org where people can ask to have their PG backup solution listed, or something along those lines. > - provide a minimal implement of the command? Having been down this road for a rather long time, I can't accept this as a serious suggestion. No, not even with Perl. Been there, done that, not going back. > - recommend some external tools (that we can guarantee that they >comform the requriements)? The requirements are things which are learned over years and changes over time. Trying to document them and keep up with them would be a pretty serious project all on its own. There are external projects who spend serious time and energy doing their best to provide the tooling needed here and we should be promoting those, not trying to pretend like this is a simple thing which anyone could write a short perl script to accomplish. > - not recommend any tools? This is the approach that has been tried and it's, objectively, failed miserably. Our users are ending up with invalid and unusable backups, corrupted WAL segments, inability to use PITR, and various other issues because we've been trying to pretend that this isn't a hard problem. We really need to stop that and accept that it's hard and promote the tools which have been explicitly written to address that hard problem. > > Hmm. A simple command that could be used as reference is for example > > "dd" that flushes the file by itself, or we could just revisit the > > discussions about having a pg_copy command, or we could document a > > small utility in perl that does the job. > > I think we should do that if pg_copy comforms the mandatory > requirements but maybe it's in the future. Showing the minimal > implement in perl looks good. Already tried doing it in perl. No, it's not simple and it's also entirely vaporware today and implies that we're going to develop this tool, improve it in the future as we realize it needs to be improved, and maintain it as part of core forever. If we want to actually adopt and pull in a backup tool to be part of core then we should talk about things which actually exist, such as the various existing projects that have been written to specifically work to address all the requirements which are understood today, not say "well, we can just write a simple perl script to do it" because it's not actually that simple. Providing yet another half solution would be doubling-down on the failed approach to document a "simple" solution and would be a disservice to our users. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] expand the units that pg_size_pretty supports on output
On Wed, 16 Jun 2021 at 02:58, David Christensen wrote: > That said, it seems like having the code structured in a way that we can > expand via adding an element to a table instead of the existing way it's > written with nested if blocks is still a useful refactor, whatever we decide > the cutoff units should be. I had not really looked at the patch, but if there's a cleanup portion to the same patch as you're adding the YB too, then maybe it's worth separating those out into another patch so that the two can be considered independently. David
Re: Delegating superuser tasks to new security roles
Greetings, * torikoshia (torikos...@oss.nttdata.com) wrote: > On 2021-06-14 23:53, Mark Dilger wrote: > >>On Jun 14, 2021, at 5:51 AM, torikoshia > >>wrote: > >>BTW, do these patches enable non-superusers to create user with > >>bypassrls? [...] > >Do you believe that functionality should be added? I have not thought > >much about that issue. > > I just noticed that because I was looking into operations that can only be > done by superusers. In general, I agree with the sentiment that we should be providing a way to have non-superusers able to do things that only a superuser can do today. I'd love to get rid of all of the explicit superuser checks in the backend except the one that makes a superuser a member of all roles. Thanks, Stephen signature.asc Description: PGP signature
Should wal receiver reply to wal sender more aggressively?
While working on some related issues I found that the wal receiver tries to call walrcv_receive() loop before replying the write/flush/apply LSN to wal senders in XLogWalRcvSendReply(). It is possible that walrcv_receive() loop receives and writes a lot of xlogs, so it does not reply those LSN information in time, thus finally slows down those transactions due to syncrep wait (assuming default synchronous_commit) In my TPCB testing, I found the worst case is that 10,466,469 bytes were consumed in the walrcv_receive() loop. More seriously, we call XLogWalRcvSendReply(false, false) after handling those bytes; The first argument false means no force , i.e. it notifies unless max time of guc wal_receiver_status_interval value (10s by default) is reached, so we may have to wait other calls of XLogWalRcvSendReply() to notify the wal sender. I thought and tried enhancing this by force-replying to the wal sender each when receiving a maximum bytes (e.g. 128K) but several things confused me: - What's the purpose of guc wal_receiver_status_interval? The OS kernel is usually not efficient when handling small packets but we are not replying that aggressively so why is this guc there? - I run simple TPCB (1000 scaling, 200 connections, shared_buffers, max_connections tuned) but found no obvious performance difference with and without the code change. I did not see obvious system (IO/CPU/network) bottleneck - probably the bottleneck is in PG itself. I did not investigate further at this moment, but the change should in theory help, no? Another thing came to my mind is the wal receiver logic: Currently the wal receiver process does network io, wal write, wal flush in one process. Network io is async, blocking at epoll/poll, wal write is mostly non-blocking, but for wal flush, probably we could decouple it to a dedicated process. And maybe use sync_file_range instead of wal file fsync in issue_xlog_fsync()? We should sync those wal contents with lower LSN at first and reply to the wal sender in time, right?. Below is the related code: /* See if we can read data immediately */ len = walrcv_receive(wrconn, &buf, &wait_fd); if (len != 0) { /* * Process the received data, and any subsequent data we * can read without blocking. */ for (;;) { if (len > 0) { /* * Something was received from primary, so reset * timeout */ last_recv_timestamp = GetCurrentTimestamp(); ping_sent = false; XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); } else if (len == 0) break; else if (len < 0) { ereport(LOG, (errmsg("replication terminated by primary server"), errdetail("End of WAL reached on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write; endofwal = true; break; } len = walrcv_receive(wrconn, &buf, &wait_fd); } /* Let the primary know that we received some data. */ XLogWalRcvSendReply(false, false); /* * If we've written some records, flush them to disk and * let the startup process and primary server know about * them. */ XLogWalRcvFlush(false); -- Paul Guo (Vmware)
Re: unnesting multirange data types
On Tue, Jun 15, 2021 at 4:49 PM Tom Lane wrote: > Alexander Korotkov writes: > > Pushed! Thanks to thread participants for raising this topic and > > review. I'll be around to resolve issues if any. > > Buildfarm is pretty thoroughly unhappy. Did you do a "check-world" > before pushing? Yes, I'm looking at this now. I did run "check-world", but it passed for me. Probably the same reason it passed for some buildfarm animals... -- Regards, Alexander Korotkov
Re: Different compression methods for FPI
On Tue, Jun 15, 2021 at 07:53:26AM +0200, Peter Eisentraut wrote: > On 15.06.21 07:37, Michael Paquier wrote: > > > > Actually, I was just thinking that default yes/no/on/off stuff maybe > > > > should be > > > > defined to mean "lz4" rather than meaning pglz for "backwards > > > > compatibility". > > > +1 > > I am not sure that we have any reasons to be that aggressive about > > this one either, and this would mean that wal_compression=on implies a > > different method depending on the build options. I would just stick > > with the past, careful practice that we have to use a default > > backward-compatible value as default, while being able to use a new > > option. You're right, I hadn't though this through all the way. There's precedent if the default is non-static (wal_sync_method). But I think yes/on/true/1 should be a compatibility alias for a static thing, and then the only option is pglz. Of course, changing the default to LZ4 is still a possibility. > If we think this new thing is strictly better than the old thing, then why > not make it the default. What would be the gain from sticking to the old > default? > > The point that the default would depend on build options is a valid one. > I'm not sure whether it's important enough by itself.
Re: a path towards replacing GEQO with something better
On Wed, Jun 9, 2021 at 9:24 PM John Naylor wrote: > 3) It actually improves the existing exhaustive search, because the > complexity of the join order problem depends on the query shape: a "chain" > shape (linear) vs. a "star" shape (as in star schema), for the most common > examples. The size of the DP table grows like this (for n >= 4): > > Chain: (n^3 - n) / 6 (including bushy plans) > Star: (n - 1) * 2^(n - 2) > > n chain star > > 4 10 12 > 5 20 32 > 6 35 80 > 7 56192 > 8 84448 > 9120 1024 > 10165 2304 > 11220 5120 > 12286 11264 > 13364 24576 > 14455 53248 > 15560 114688 > ... > 64 43680 290536219160925437952 I don't quite understand the difference between the "chain" case and the "star" case. Can you show sample queries for each one? e.g. SELECT ... FROM a_1, a_2, ..., a_n WHERE ? One idea I just ran across in https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf is to try to economize by skipping consideration of bushy plans. We could start doing that when some budget is exceeded, similar to what you are proposing here, but probably the budget for skipping consideration of bushy plans would be smaller than the budget for switching to IDP. The idea of skipping bushy plan generation in some cases makes sense to me intuitively because most of the plans PostgreSQL generates are mostly left-deep, and even when we do generate bushy plans, they're not always a whole lot better than the nearest equivalent left-deep plan. The paper argues that considering bushy plans makes measurable gains, but also that failure to consider such plans isn't catastrophically bad. -- Robert Haas EDB: http://www.enterprisedb.com
change logging defaults
I propose to change some defaults: log_autovacuum_min_duration = 0 log_checkpoints = on log_lock_waits = on (and log_recovery_conflict_waits too?) log_temp_files = 128kB Note that pg_regress does this: | fputs("\n# Configuration added by pg_regress\n\n", pg_conf); | fputs("log_autovacuum_min_duration = 0\n", pg_conf); | fputs("log_checkpoints = on\n", pg_conf); | fputs("log_line_prefix = '%m %b[%p] %q%a '\n", pg_conf); | fputs("log_lock_waits = on\n", pg_conf); | fputs("log_temp_files = 128kB\n", pg_conf); | fputs("max_prepared_transactions = 2\n", pg_conf); -- Justin
Re: Duplicate history file?
On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote: > > The requirements are things which are learned over years and changes > over time. Trying to document them and keep up with them would be a > pretty serious project all on its own. There are external projects who > spend serious time and energy doing their best to provide the tooling > needed here and we should be promoting those, not trying to pretend like > this is a simple thing which anyone could write a short perl script to > accomplish. The fact that this is such a complex problem is the very reason why we should spend a lot of energy documenting the various requirements. Otherwise, how could anyone implement a valid program for that and how could anyone validate that a solution claiming to do its job actually does its job? > Already tried doing it in perl. No, it's not simple and it's also > entirely vaporware today and implies that we're going to develop this > tool, improve it in the future as we realize it needs to be improved, > and maintain it as part of core forever. If we want to actually adopt > and pull in a backup tool to be part of core then we should talk about > things which actually exist, such as the various existing projects that > have been written to specifically work to address all the requirements > which are understood today, not say "well, we can just write a simple > perl script to do it" because it's not actually that simple. Adopting a full backup solution seems like a bit extreme. On the other hand, having some real core implementation of an archive_command for the most general use cases (local copy, distant copy over ssh...) could make sense. This would remove that burden for some, probably most, of the 3rd party backup tools, and would also ensure that the various requirements are properly documented since it would be the implementation reference.
Re: Case expression pushdown
Hi. Ashutosh Bapat писал 2021-06-15 16:24: Looks quite useful to me. Can you please add this to the next commitfest? Addded to commitfest. Here is an updated patch version. -- Best regards, Alexander Pyhalov, Postgres ProfessionalFrom 80d60eb9b1630ee55d1825964e0e976ae6c289a1 Mon Sep 17 00:00:00 2001 From: Alexander Pyhalov Date: Tue, 30 Mar 2021 13:24:14 +0300 Subject: [PATCH] Allow pushing CASE expression to foreign server --- contrib/postgres_fdw/deparse.c| 118 ++ .../postgres_fdw/expected/postgres_fdw.out| 47 +++ contrib/postgres_fdw/sql/postgres_fdw.sql | 22 3 files changed, 187 insertions(+) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 31919fda8c6..3621fed4b54 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -87,6 +87,7 @@ typedef struct foreign_loc_cxt { Oid collation; /* OID of current collation, if any */ FDWCollateState state; /* state of current collation choice */ + Expr *case_arg; /* the last case arg to inspect */ } foreign_loc_cxt; /* @@ -101,6 +102,7 @@ typedef struct deparse_expr_cxt * a base relation. */ StringInfo buf; /* output buffer to append to */ List **params_list; /* exprs that will become remote Params */ + List *case_args; /* list of args to deparse CaseTestExpr */ } deparse_expr_cxt; #define REL_ALIAS_PREFIX "r" @@ -186,6 +188,9 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno, deparse_expr_cxt *context); +static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context); +static void deparseCaseTestExpr(CaseTestExpr *node, deparse_expr_cxt *context); + /* * Helper functions */ @@ -254,6 +259,7 @@ is_foreign_expr(PlannerInfo *root, glob_cxt.relids = baserel->relids; loc_cxt.collation = InvalidOid; loc_cxt.state = FDW_COLLATE_NONE; + loc_cxt.case_arg = NULL; if (!foreign_expr_walker((Node *) expr, &glob_cxt, &loc_cxt)) return false; @@ -312,6 +318,7 @@ foreign_expr_walker(Node *node, /* Set up inner_cxt for possible recursion to child nodes */ inner_cxt.collation = InvalidOid; inner_cxt.state = FDW_COLLATE_NONE; + inner_cxt.case_arg = outer_cxt->case_arg; switch (nodeTag(node)) { @@ -509,6 +516,62 @@ foreign_expr_walker(Node *node, state = FDW_COLLATE_UNSAFE; } break; + case T_CaseExpr: + { +CaseExpr *ce = (CaseExpr *) node; +ListCell *arg; + +if (ce->arg) + inner_cxt.case_arg = ce->arg; + +foreach(arg, ce->args) +{ + CaseWhen *w = lfirst_node(CaseWhen, arg); + + if (!foreign_expr_walker((Node *) w->expr, + glob_cxt, &inner_cxt)) + return false; + + if (!foreign_expr_walker((Node *) w->result, + glob_cxt, &inner_cxt)) + return false; +} + +if (!foreign_expr_walker((Node *) ce->defresult, + glob_cxt, &inner_cxt)) + return false; + +collation = ce->casecollid; +if (collation == InvalidOid) + state = FDW_COLLATE_NONE; +else if (inner_cxt.state == FDW_COLLATE_SAFE && + collation == inner_cxt.collation) + state = FDW_COLLATE_SAFE; +else if (collation == DEFAULT_COLLATION_OID) + state = FDW_COLLATE_NONE; +else + state = FDW_COLLATE_UNSAFE; + } + break; + case T_CaseTestExpr: + { +Expr *arg; + +Assert(outer_cxt->case_arg != NULL); +arg = outer_cxt->case_arg; + +if (!foreign_expr_walker((Node *) arg, + glob_cxt, &inner_cxt)) + return false; + +/* + * Collation and state just bubble up from the previously + * saved case argument + */ +collation = inner_cxt.collation; +state = inner_cxt.state; + } + break; case T_OpExpr: case T_DistinctExpr: /* struct-equivalent to OpExpr */ { @@ -1019,6 +1082,7 @@ deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *rel, context.foreignrel = rel; context.scanrel = IS_UPPER_REL(rel) ? fpinfo->outerrel : rel; context.params_list = params_list; + context.case_args = NIL; /* Construct SELECT clause */ deparseSelectSql(tlist, is_subquery, retrieved_attrs, &context); @@ -1598,6 +1662,7 @@ deparseFromExprForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *foreignrel, context.scanrel = foreignrel; context.root = root; context.params_list = params_list; + context.case_args = NIL; appendStringInfoChar(buf, '('); appendConditions(fpinfo->joinclauses, &context); @@ -1901,6 +1966,7 @@ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root, context.scanrel = foreignrel; context.buf = buf; context.params_list = params_list; + context.case_args = NIL; appendStringInfoString(buf, "UPDATE "); deparseRelation(buf, rel); @@ -2008,6 +2074,7 @@ deparseDirectDeleteSql(StringInfo buf, PlannerInfo *root, context.scanrel = foreignrel; context.buf = buf;
Re: snapshot too old issues, first around wraparound and then more.
Robert Haas writes: > Oh, maybe I'm the one who misunderstood... So, it's well over a year later, and so far as I can see exactly nothing has been done about snapshot_too_old's problems. I never liked that feature to begin with, and I would be very glad to undertake the task of ripping it out. If someone thinks this should not happen, please commit to fixing it ... and not "eventually". regards, tom lane
Re: a path towards replacing GEQO with something better
Robert Haas writes: > One idea I just ran across in > https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf > is to try to economize by skipping consideration of bushy plans. We > could start doing that when some budget is exceeded, similar to what > you are proposing here, but probably the budget for skipping > consideration of bushy plans would be smaller than the budget for > switching to IDP. The idea of skipping bushy plan generation in some > cases makes sense to me intuitively because most of the plans > PostgreSQL generates are mostly left-deep, and even when we do > generate bushy plans, they're not always a whole lot better than the > nearest equivalent left-deep plan. The paper argues that considering > bushy plans makes measurable gains, but also that failure to consider > such plans isn't catastrophically bad. It's not catastrophically bad until you hit a query where the only correct plans are bushy. These do exist, and I think they're not that uncommon. Still, I take your point that maybe we could ratchet down the cost of exhaustive search by skimping on this part. Maybe it'd work to skip bushy so long as we'd found at least one left-deep or right-deep path for the current rel. regards, tom lane
Re: change logging defaults
Justin Pryzby writes: > I propose to change some defaults: > log_autovacuum_min_duration = 0 > log_checkpoints = on > log_lock_waits = on (and log_recovery_conflict_waits too?) > log_temp_files = 128kB Why? Based on reports that I see, some quite large percentage of Postgres DBAs never look at the postmaster log at all. So making the log bulkier isn't something that will be useful to them. People who do want these reports are certainly capable of turning them on. > Note that pg_regress does this: What we find useful for testing seems to me to be nearly unrelated to production needs. regards, tom lane
disfavoring unparameterized nested loops
>From time to time, someone tells me that they've configured enable_nestloop=false on postgresql.conf, which is a pretty bad idea since there are a significant number of cases where such plans are the only reasonable way of executing some query. However, it's no great secret that PostgreSQL's optimizer sometimes produces nested loops that are very, very, very slow, generally because it has grossly underestimated the cardinality of the inner side of the nested loop. The frustration which users experience as a result of such events is understandable. I read https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf today and found out that the authors of that paper did something a bit more nuanced which, in their experiments, was very successful. It sounds like what they basically did is disabled unparameterized nested loops. They argue that such nested loops figure to gain very little as compared with a hash join, but that they might turn out to lose a lot if the cardinality estimation is inaccurate, and they present experimental results to back up those claims. One observation that the paper makes along the way is that every system they tested is more likely to underestimate the cardinality of joins than to overestimate it, and that this tendency becomes more pronounced as the size of the join planning problem increases. On reflection, I believe this matches my experience, and it makes sense that it should be so, since it occasionally happens that the join selectivity estimate is essentially zero, and a bigger join problem is more likely to have at least one such case. On the other hand, the join selectivity estimate can never be +infinity. Hence, it's more important in general for a database system to be resilient against underestimates than to be resilient against overestimates. Being less willing to choose unparameterized nested loops is one way to move in that direction. How to do that is not very clear. One very simple thing we could do would be to introduce enable_nestloop=parameterized or enable_parameterized_nestloop=false. That is a pretty blunt instrument but the authors of the paper seem to have done that with positive results, so maybe it's actually not a dumb idea. Another approach would be to introduce a large fuzz factor for such nested loops e.g. keep them only if the cost estimate is better than the comparable hash join plan by at least a factor of N (or if no such plan is being generated for some reason). I'm not very confident that this would actually do what we want, though. In the problematic cases, a nested loop is likely to look extremely cheap, so just imagining that the cost might be higher is not very protective. Perhaps a better approach would be something like: if the estimated number of inner rows is less than 100, then re-estimate the cost of this approach and of the best available hash join on the assumption that there are 100 inner rows. If the hash join still wins, keep it; if it loses under that assumption, throw it out. I think it's likely that this approach would eliminate a large number of highly risky nested loop plans, probably even with s/100/10/g, without causing many other problems (except perhaps increased planner CPU consumption ... but maybe that's not too bad). Just to be clear, I do understand that there are cases where no Hash Join is possible, but anything we do here could be made to apply only when a hash join is in fact possible. We could also think about making the point of comparison the best other plans of any sort rather than a hash join specifically, which feels a little more principled but might actually be worse. When a Nested Loop is a stupid idea, it's stupid precisely because the inner side is big and we could've avoided recomputing it over and over by using a Hash Join instead, not because some Merge Join based plan turns out to be better. I mean, it is possible that some Merge Join plan does turn out to be better, but that's not rage-inducing in the same way. Nobody looks at a complicated join plan that happened to use a Nested Loop and says "obviously, this is inferior to a merge join," or if they do, they're probably full of hot air. But people look at complicated join plans that happen to use a Nested Loop and say "obviously, this is inferior to a hash join" *all the time* and assuming the inner path is unparameterized, they are very often correct. Thoughts? I'd be particularly curious to hear about any cases anyone knows about where an unparameterized nested loop and a hash join with batches=1 are both possible and where the unparameterized nested loop is *much* cheaper than the hash join. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: snapshot too old issues, first around wraparound and then more.
Hi, On 2021-06-15 12:51:28 -0400, Tom Lane wrote: > Robert Haas writes: > > Oh, maybe I'm the one who misunderstood... > > So, it's well over a year later, and so far as I can see exactly > nothing has been done about snapshot_too_old's problems. > > I never liked that feature to begin with, and I would be very > glad to undertake the task of ripping it out. If someone thinks > this should not happen, please commit to fixing it ... and not > "eventually". I still think that's the most reasonable course. I actually like the feature, but I don't think a better implementation of it would share much if any of the current infrastructure. Greetings, Andres Freund
Re: snapshot too old issues, first around wraparound and then more.
On Tue, Jun 15, 2021 at 9:51 AM Tom Lane wrote: > So, it's well over a year later, and so far as I can see exactly > nothing has been done about snapshot_too_old's problems. FWIW I think that the concept itself is basically reasonable. The implementation is very flawed, though, so it hardly enters into it. > I never liked that feature to begin with, and I would be very > glad to undertake the task of ripping it out. If someone thinks > this should not happen, please commit to fixing it ... and not > "eventually". ISTM that this is currently everybody's responsibility, and therefore nobody's responsibility. That's probably why the problems haven't been resolved yet. I propose that the revert question be explicitly timeboxed. If the issues haven't been fixed by some date, then "snapshot too old" automatically gets reverted without further discussion. This gives qualified hackers the opportunity to save the feature if they feel strongly about it, and are actually willing to take responsibility for its ongoing maintenance. -- Peter Geoghegan
Re: a path towards replacing GEQO with something better
On Tue, Jun 15, 2021 at 1:00 PM Tom Lane wrote: > Still, I take your point that maybe we could ratchet down the cost of > exhaustive search by skimping on this part. Maybe it'd work to skip > bushy so long as we'd found at least one left-deep or right-deep path > for the current rel. Yes, that sounds better. -- Robert Haas EDB: http://www.enterprisedb.com
Re: unnesting multirange data types
Alexander Korotkov writes: > I did run "check-world", but it passed for me. Probably the same > reason it passed for some buildfarm animals... The only buildfarm animals that have passed since this went in are the ones that don't run the pg_dump or pg_upgrade tests. It looks to me like the proximate problem is that you should have taught pg_dump to skip these new auto-generated functions. However, I fail to see why we need auto-generated functions for this at all. Couldn't we have done it with one polymorphic function? I think this ought to be reverted and reviewed more carefully. regards, tom lane
Re: unnesting multirange data types
On 2021-Jun-15, Tom Lane wrote: > It looks to me like the proximate problem is that you should > have taught pg_dump to skip these new auto-generated functions. > However, I fail to see why we need auto-generated functions > for this at all. Couldn't we have done it with one polymorphic > function? I think such a function would need to take anycompatiblerangearray, which is not something we currently have. > I think this ought to be reverted and reviewed more carefully. It seems to me that removing the cast-to-range[] is a sufficient fix, and that we can do with only the unnest part for pg14; the casts can be added in 15 (if at all). That would mean reverting only half the patch. -- Álvaro Herrera39°49'30"S 73°17'W
Re: [PATCH] Fix buffer not null terminated on (ecpg lib)
Hi, On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote: > memcpy would not suffer from it? It'd not be correct for short sqlstates - you'd read beyond the end of the source buffer. There are cases of it in the ecpg code. Greetings, Andres Freund
Re: unnesting multirange data types
Alvaro Herrera writes: > On 2021-Jun-15, Tom Lane wrote: >> I think this ought to be reverted and reviewed more carefully. > It seems to me that removing the cast-to-range[] is a sufficient fix, > and that we can do with only the unnest part for pg14; the casts can be > added in 15 (if at all). That would mean reverting only half the patch. Might be a reasonable solution. But right now I'm annoyed that the buildfarm is broken, and I'm also convinced that this didn't get adequate testing. I think "revert and reconsider" is the way forward for today. regards, tom lane
Re: [PATCH] Fix buffer not null terminated on (ecpg lib)
Andres Freund writes: > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote: >> memcpy would not suffer from it? > It'd not be correct for short sqlstates - you'd read beyond the end of > the source buffer. There are cases of it in the ecpg code. What's a "short SQLSTATE"? They're all five characters by definition. regards, tom lane
Re: unnesting multirange data types
On 6/15/21 1:52 PM, Tom Lane wrote: > Alvaro Herrera writes: >> On 2021-Jun-15, Tom Lane wrote: >>> I think this ought to be reverted and reviewed more carefully. > >> It seems to me that removing the cast-to-range[] is a sufficient fix, >> and that we can do with only the unnest part for pg14; the casts can be >> added in 15 (if at all). That would mean reverting only half the patch. > > Might be a reasonable solution. But right now I'm annoyed that the > buildfarm is broken, and I'm also convinced that this didn't get > adequate testing. I had focused testing primarily on the "unnest" cases that I had described in my original note. I did a couple of casts and had no issue; I did not test with pg_dump / pg_upgrade, but noting to do so in the future in cases like this. > I think "revert and reconsider" is the way > forward for today. I don't want the buildfarm broken so I'm fine if this is the best way forward. If we can keep the "unnest" functionality I would strongly suggest it as that was the premise of the original note to complete the utility of multiranges for v14. The casting, while convenient, is not needed. Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: snapshot too old issues, first around wraparound and then more.
Peter Geoghegan writes: > On Tue, Jun 15, 2021 at 9:51 AM Tom Lane wrote: >> So, it's well over a year later, and so far as I can see exactly >> nothing has been done about snapshot_too_old's problems. > I propose that the revert question be explicitly timeboxed. If the > issues haven't been fixed by some date, then "snapshot too old" > automatically gets reverted without further discussion. This gives > qualified hackers the opportunity to save the feature if they feel > strongly about it, and are actually willing to take responsibility for > its ongoing maintenance. The goal I have in mind is for snapshot_too_old to be fixed or gone in v15. I don't feel a need to force the issue sooner than that, so there's plenty of time for someone to step up, if anyone wishes to. I imagine that we should just ignore the question of whether anything can be done for it in the back branches. Given the problems identified upthread, fixing it in a non-back-patchable way would be challenging enough. regards, tom lane
Re: disfavoring unparameterized nested loops
On Tue, Jun 15, 2021 at 10:09 AM Robert Haas wrote: > How to do that is not very clear. One very simple thing we could do > would be to introduce enable_nestloop=parameterized or > enable_parameterized_nestloop=false. That is a pretty blunt instrument > but the authors of the paper seem to have done that with positive > results, so maybe it's actually not a dumb idea. I think that it's probably a good idea as-is. Simple heuristics that are very frequently wrong when considered in a naive way can work very well in practice. This seems to happen when they capture some kind of extreme naturally occuring cost/benefit asymmetry -- especially one with fixed well understood costs and unlimited benefits (this business with unparameterized nestloop joins is about *avoiding* the inverse asymmetry, but that seems very similar). My go to example of such an asymmetry is the rightmost page split heuristic of applying leaf fillfactor regardless of any of the other specifics; we effectively assume that all indexes are on columns with ever-increasing values. Which is obviously wrong. We're choosing between two alternatives (unparameterized nested loop vs hash join) that are really very similar when things go as expected, but diverge sharply when there is a misestimation -- who wouldn't take the "conservative" choice here? I guess that there is a hesitation to not introduce heuristics like this because it doesn't fit into some larger framework that captures risk -- it might be seen as an ugly special case. But isn't this already actually kind of special, whether or not we officially think so? -- Peter Geoghegan
Re: a path towards replacing GEQO with something better
On Tue, Jun 15, 2021 at 12:15 PM Robert Haas wrote: > > On Wed, Jun 9, 2021 at 9:24 PM John Naylor wrote: > > 3) It actually improves the existing exhaustive search, because the complexity of the join order problem depends on the query shape: a "chain" shape (linear) vs. a "star" shape (as in star schema), for the most common examples. The size of the DP table grows like this (for n >= 4): ... > I don't quite understand the difference between the "chain" case and > the "star" case. Can you show sample queries for each one? e.g. SELECT > ... FROM a_1, a_2, ..., a_n WHERE ? There's a very simple example in the optimizer README: -- SELECT * FROMtab1, tab2, tab3, tab4 WHERE tab1.col = tab2.col AND tab2.col = tab3.col AND tab3.col = tab4.col Tables 1, 2, 3, and 4 are joined as: {1 2},{2 3},{3 4} {1 2 3},{2 3 4} {1 2 3 4} (other possibilities will be excluded for lack of join clauses) SELECT * FROMtab1, tab2, tab3, tab4 WHERE tab1.col = tab2.col AND tab1.col = tab3.col AND tab1.col = tab4.col Tables 1, 2, 3, and 4 are joined as: {1 2},{1 3},{1 4} {1 2 3},{1 3 4},{1 2 4} {1 2 3 4} -- The first one is chain, and the second is star. Four is the smallest set where we have a difference. I should now point out an imprecision in my language: By "size of the DP table", the numbers in my first email refer to the number of joinrels times the number of possible joins (not paths, and ignoring commutativity). Here are the steps laid out, with cumulative counts: join_level, # joins, cumulative # joins: linear, n=4 2 3 3 3 4 7 4 3 10 star, n=4 2 3 3 3 6 9 4 3 12 And of course, the chain query also has three at the last level, because it tries two left- (or right-) deep joins and one bushy join. > One idea I just ran across in > https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf > is to try to economize by skipping consideration of bushy plans. That's a good paper, and it did influence my thinking. You likely already know this, but for the archives: If only chain queries could have bushy plans, it wouldn't matter because they are so cheap to enumerate. But, since star queries can introduce a large number of extra joins via equivalence (same join column or FK), making them resemble "clique" queries, bushy joins get excessively numerous. > We > could start doing that when some budget is exceeded, similar to what > you are proposing here, but probably the budget for skipping > consideration of bushy plans would be smaller than the budget for > switching to IDP. The idea of skipping bushy plan generation in some > cases makes sense to me intuitively because most of the plans > PostgreSQL generates are mostly left-deep, and even when we do > generate bushy plans, they're not always a whole lot better than the > nearest equivalent left-deep plan. The paper argues that considering > bushy plans makes measurable gains, but also that failure to consider > such plans isn't catastrophically bad. I think that makes sense. There are a few things we could do within the "grey zone" -- too many rels to finish exhaustive search, but not enough to justify starting directly with the greedy step -- to increase our chances of completing, and that's a very simple one. -- John Naylor EDB: http://www.enterprisedb.com
Re: snapshot too old issues, first around wraparound and then more.
On Tue, Jun 15, 2021 at 11:01 AM Tom Lane wrote: > The goal I have in mind is for snapshot_too_old to be fixed or gone > in v15. I don't feel a need to force the issue sooner than that, so > there's plenty of time for someone to step up, if anyone wishes to. Seems more than reasonable to me. A year ought to be plenty of time if the feature truly is salvageable. What do other people think? Ideally we could commit to that hard deadline now. To me the important thing is to actually have a real deadline that forces the issue one way or another. This situation must not be allowed to drag on forever. -- Peter Geoghegan
Re: Duplicate history file?
Greetings, * Julien Rouhaud (rjuju...@gmail.com) wrote: > On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote: > > The requirements are things which are learned over years and changes > > over time. Trying to document them and keep up with them would be a > > pretty serious project all on its own. There are external projects who > > spend serious time and energy doing their best to provide the tooling > > needed here and we should be promoting those, not trying to pretend like > > this is a simple thing which anyone could write a short perl script to > > accomplish. > > The fact that this is such a complex problem is the very reason why we should > spend a lot of energy documenting the various requirements. Otherwise, how > could anyone implement a valid program for that and how could anyone validate > that a solution claiming to do its job actually does its job? Reading the code. > > Already tried doing it in perl. No, it's not simple and it's also > > entirely vaporware today and implies that we're going to develop this > > tool, improve it in the future as we realize it needs to be improved, > > and maintain it as part of core forever. If we want to actually adopt > > and pull in a backup tool to be part of core then we should talk about > > things which actually exist, such as the various existing projects that > > have been written to specifically work to address all the requirements > > which are understood today, not say "well, we can just write a simple > > perl script to do it" because it's not actually that simple. > > Adopting a full backup solution seems like a bit extreme. On the other hand, > having some real core implementation of an archive_command for the most > general > use cases (local copy, distant copy over ssh...) could make sense. This would > remove that burden for some, probably most, of the 3rd party backup tools, and > would also ensure that the various requirements are properly documented since > it would be the implementation reference. Having a database platform that hasn't got a full backup solution is a pretty awkward position to be in. I'd like to see something a bit more specific than handwaving about how core could provide something in this area which would remove the burden from other tools. Would also be good to know who is going to write that and maintain it. Thanks, Stephen signature.asc Description: PGP signature
Re: unnesting multirange data types
On 6/15/21 12:06 PM, Alexander Korotkov wrote: > On Tue, Jun 15, 2021 at 4:49 PM Tom Lane wrote: >> Alexander Korotkov writes: >>> Pushed! Thanks to thread participants for raising this topic and >>> review. I'll be around to resolve issues if any. >> Buildfarm is pretty thoroughly unhappy. Did you do a "check-world" >> before pushing? > Yes, I'm looking at this now. > > I did run "check-world", but it passed for me. Probably the same > reason it passed for some buildfarm animals... > Did you configure with --enable-tap-tests? If not, then `make check-world` won't run the tests that are failing here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: unnesting multirange data types
On Tue, Jun 15, 2021 at 8:18 PM Tom Lane wrote: > Alexander Korotkov writes: > > I did run "check-world", but it passed for me. Probably the same > > reason it passed for some buildfarm animals... > > The only buildfarm animals that have passed since this went in > are the ones that don't run the pg_dump or pg_upgrade tests. > > It looks to me like the proximate problem is that you should > have taught pg_dump to skip these new auto-generated functions. > However, I fail to see why we need auto-generated functions > for this at all. Couldn't we have done it with one polymorphic > function? > > I think this ought to be reverted and reviewed more carefully. Thank you for your feedback. I've reverted the patch. I'm going to have closer look at this tomorrow. -- Regards, Alexander Korotkov
Re: unnesting multirange data types
On 6/15/21 1:52 PM, Tom Lane wrote: > Alvaro Herrera writes: >> On 2021-Jun-15, Tom Lane wrote: >>> I think this ought to be reverted and reviewed more carefully. >> It seems to me that removing the cast-to-range[] is a sufficient fix, >> and that we can do with only the unnest part for pg14; the casts can be >> added in 15 (if at all). That would mean reverting only half the patch. > Might be a reasonable solution. But right now I'm annoyed that the > buildfarm is broken, and I'm also convinced that this didn't get > adequate testing. I think "revert and reconsider" is the way > forward for today. > > (RMT hat on) That would be my inclination at this stage. The commit message states that it's trivial, but it seems not to be, and I suspect it should not have been done at this stage of the development cycle. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [PATCH] Fix buffer not null terminated on (ecpg lib)
Hi, On 2021-06-15 13:53:08 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote: > >> memcpy would not suffer from it? > > > It'd not be correct for short sqlstates - you'd read beyond the end of > > the source buffer. There are cases of it in the ecpg code. > > What's a "short SQLSTATE"? They're all five characters by definition. I thought there were places that just dealt with "00" etc. And there are - but it's just comparisons. I still don't fully feel comfortable just using memcpy() though, given that the sqlstates originate remotely / from libpq, making it hard to rely on the fact that the buffer "ought to" always be at least 5 bytes long? As far as I can tell there's no enforcement of PQresultErrorField(..., PG_DIAG_SQLSTATE) being that long. Greetings, Andres Freund
Re: Improving the isolationtester: fewer failures, less delay
Hi, On 2021-06-14 22:57:08 -0400, Tom Lane wrote: > This is a followup to the conversation at [1], in which we speculated > about constraining the isolationtester's behavior by annotating the > specfiles, in order to eliminate common buildfarm failures such as [2] > and reduce the need to use long delays to stabilize the test results. > > I've spent a couple days hacking on this idea, and I think it has worked > out really well. On my machine, the time needed for "make installcheck" > in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result > of removing all the multiple-second delays we used before. Very cool stuff. All the reliability things aside, isolationtester frequently is the slowest test in a parallel check world... > Also, while I'm not fool enough to claim that this will reduce the > rate of bogus failures to zero, I do think it addresses all the > repeating failures we've seen lately. And it should make it easier to fix some others and also to make it easier to write some tests that were too hard to get to reliable today. > This is still WIP to some extent, as I've not spent much time looking at > specfiles other than the ones with big delays; there may be additional > improvements possible in some places. Also, I've not worried about > whether the tests pass in serializable mode, since we have problems there > already [4]. But this seemed like a good point at which to solicit > feedback and see what the cfbot thinks of it. Are there spec output changes / new failures, if you apply the patch, but do not apply the changes to the spec files? Will look at the patch itself in a bit. Greetings, Andres Freund
Re: Improving the isolationtester: fewer failures, less delay
Andres Freund writes: > On 2021-06-14 22:57:08 -0400, Tom Lane wrote: >> This is still WIP to some extent, as I've not spent much time looking at >> specfiles other than the ones with big delays; there may be additional >> improvements possible in some places. Also, I've not worried about >> whether the tests pass in serializable mode, since we have problems there >> already [4]. But this seemed like a good point at which to solicit >> feedback and see what the cfbot thinks of it. > Are there spec output changes / new failures, if you apply the patch, > but do not apply the changes to the spec files? If you make only the code changes, there are a bunch of diffs stemming from the removal of the 'error in steps' message prefix. If you just mechanically remove those from the .out files without touching the .spec files, most tests pass, but I don't recall whether that's 100% the case. > Will look at the patch itself in a bit. I'll have a v2 in a little bit --- the cfbot pointed out that there were some contrib tests I'd missed fixing, and I found a couple of other improvements. regards, tom lane
Re: snapshot too old issues, first around wraparound and then more.
On Tue, Jun 15, 2021 at 12:51 PM Tom Lane wrote: > So, it's well over a year later, and so far as I can see exactly > nothing has been done about snapshot_too_old's problems. Progress has been pretty limited, but not altogether nonexistent. 55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to fix, the time->XID mapping, which is one of the main things that Andres said was broken originally. Also, there are patches on this thread from Thomas Munro to add some test coverage for that case, another problem Andres noted in his original email. I guess it wouldn't be too hard to get something committed there, and I'm willing to do it if Thomas doesn't want to and if there's any prospect of salvaging the feature. But that's not clear to me. I'm not clear how exactly how many problems we know about and need to fix in order to keep the feature, and I'm also not clear how deep the hole goes. Like, if we need to get a certain number of specific bugs fixed, I might be willing to do that. If we need to commit to a major rewrite of the current implementation, that's more than I can do. But I guess I don't understand exactly how bad the current problems are. Reviewing complaints from Andres from this thread: > Looking at TransactionIdLimitedForOldSnapshots() I think the ts == > update_ts threshold actually needs to be ts >= update_ts, right now we > don't handle being newer than the newest bin correctly afaict (mitigated > by autovacuum=on with naptime=1s doing a snapshot more often). It's hard > to say, because there's no comments. This seems specific enough to be analyzed and anything that is broken can be fixed. > The whole lock nesting is very hazardous. Most (all?) > TestForOldSnapshot() calls happen with locks on on buffers held, and can > acquire lwlocks itself. In some older branches we do entire *catalog > searches* with the buffer lwlock held (for RelationHasUnloggedIndex()). I think it's unclear whether there are live problems in master in this area. > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the > basis to detect conflicts seems dangerous to me. Isn't that ignoring > inserts that are already in progress? Discussion on this point trailed off. Upon rereading, I think Andres is correct that there's an issue; the snapshot's LSN needs to be set to a value not older than the last xlog insertion that has been completed rather than, as now, the last one that is started. I guess to get that value we would need to do something like WaitXLogInsertionsToFinish(), or some approximation of it e.g. GetXLogWriteRecPtr() at the risk of unnecessary snapshot-too-old errors. > * In read-mostly workloads it can trigger errors in sessions that are > much younger than old_snapshot_threshold, if the transactionid is not > advancing. > > I've not tried to reproduce, but I suspect this can also cause wrong > query results. Because a later snapshot can have the same xmin as > older transactions, it sure looks like we can end up with data from an > older xmin getting removed, but the newer snapshot's whenTaken will > prevent TestForOldSnapshot_impl() from raising an error. I haven't really wrapped my head around this one, but it seems amenable to a localized fix. It basically amounts to a complaint that GetOldSnapshotThresholdTimestamp() is returning a newer value than it should. I don't know exactly what's required to make it not do that, but it doesn't seem intractable. > * I am fairly sure that it can cause crashes (or even data corruption), > because it assumes that DML never needs to check for old snapshots > (with no meaningful justificiation). Leading to heap_update/delete to > assume the page header is a tuple. I don't understand the issue here, really. I assume there might be a wrong word here, because assuming that the page header is a tuple doesn't sound like a thing that would actually happen. I think one of the key problems for this feature is figuring out whether you've got snapshot-too-old checks in all the right places. I think what is being alleged here is that heap_update() and heap_delete() need them, and that it's not good enough to rely on the scan that found the tuple to be updated or deleted having already performed those checks. It is not clear to me whether that is true, or how it could cause crashes. Andres may have explained this to me at some point, but if he did I have unfortunately forgotten. My general point here is that I would like to know whether we have a finite number of reasonably localized bugs or a three-ring disaster that is unrecoverable no matter what we do. Andres seems to think it is the latter, and I *think* Peter Geoghegan agrees, but I think that the point might be worth a little more discussion. I'm unclear whether Tom's dislike for the feature represents hostility to the concept - with which I would have to disagree - or a judgement on the quality of the implementation - which might be justified. For the record, and to Peter
Re: Improving the isolationtester: fewer failures, less delay
On 6/14/21 10:57 PM, Tom Lane wrote: > This is a followup to the conversation at [1], in which we speculated > about constraining the isolationtester's behavior by annotating the > specfiles, in order to eliminate common buildfarm failures such as [2] > and reduce the need to use long delays to stabilize the test results. > > I've spent a couple days hacking on this idea, and I think it has worked > out really well. On my machine, the time needed for "make installcheck" > in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result > of removing all the multiple-second delays we used before. Also, > while I'm not fool enough to claim that this will reduce the rate of > bogus failures to zero, I do think it addresses all the repeating > failures we've seen lately. > > In the credit-where-credit-is-due department, this owes some inspiration > to the patch Asim Praveen offered at [3], though this takes the idea a > good bit further. > > This is still WIP to some extent, as I've not spent much time looking at > specfiles other than the ones with big delays; there may be additional > improvements possible in some places. Also, I've not worried about > whether the tests pass in serializable mode, since we have problems there > already [4]. But this seemed like a good point at which to solicit > feedback and see what the cfbot thinks of it. > > Cool stuff. Minor gripe while we're on $subject - I wish we'd rename it. It's long outgrown the original purpose that gave it its name, and keeping the name makes it unnecessarily obscure. Yes, I know Lisp still has car and cdr, but we don't need to follow that example. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: disfavoring unparameterized nested loops
On Tue, Jun 15, 2021 at 2:04 PM Peter Geoghegan wrote: > I guess that there is a hesitation to not introduce heuristics like > this because it doesn't fit into some larger framework that captures > risk -- it might be seen as an ugly special case. But isn't this > already actually kind of special, whether or not we officially think > so? Yes, I think it is. Reading the paper really helped me crystallize my thoughts about this, because when I've studied the problems myself, I came, as you postulate here, to the conclusion that there's a lot of stuff the planner does where there is risk and uncertainty, and thus that a general framework would be necessary to deal with it. But the fact that an academic researcher called this problem out as the only one worth treating specially makes me think that perhaps it deserves special handling. In defense of that approach, note that this is a case where we know both that the Nested Loop is risky and that Hash Join is a similar alternative with probably similar cost. I am not sure there are any other cases where we can say quite so generally both that a certain thing is risky and what we could do instead. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Fix buffer not null terminated on (ecpg lib)
Em ter., 15 de jun. de 2021 às 15:48, Andres Freund escreveu: > Hi, > > On 2021-06-15 13:53:08 -0400, Tom Lane wrote: > > Andres Freund writes: > > > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote: > > >> memcpy would not suffer from it? > > > > > It'd not be correct for short sqlstates - you'd read beyond the end of > > > the source buffer. There are cases of it in the ecpg code. > > > > What's a "short SQLSTATE"? They're all five characters by definition. > > I thought there were places that just dealt with "00" etc. And there are - > but > it's just comparisons. > > I still don't fully feel comfortable just using memcpy() though, given that > the sqlstates originate remotely / from libpq, making it hard to rely on > the > fact that the buffer "ought to" always be at least 5 bytes long? As far as > I > can tell there's no enforcement of PQresultErrorField(..., > PG_DIAG_SQLSTATE) > being that long. > And replacing with snprintf, what do you guys think? n = snprintf(sqlca->sqlstate, sizeof(sqlca->sqlstate), "%s", sqlstate); Assert(n >= 0 && n < sizeof(sqlca->sqlstate)); regards, Ranier Vilela fix_buffer_null_not_terminated.patch Description: Binary data
Re: snapshot too old issues, first around wraparound and then more.
Robert Haas writes: > My general point here is that I would like to know whether we have a > finite number of reasonably localized bugs or a three-ring disaster > that is unrecoverable no matter what we do. Andres seems to think it > is the latter, and I *think* Peter Geoghegan agrees, but I think that > the point might be worth a little more discussion. TBH, I am not clear on that either. > I'm unclear whether > Tom's dislike for the feature represents hostility to the concept - > with which I would have to disagree - or a judgement on the quality of > the implementation - which might be justified. I think it's a klugy, unprincipled solution to a valid real-world problem. I suspect the implementation issues are not unrelated to the kluginess of the concept. Thus, I would really like to see us throw this away and find something better. I admit I have nothing to offer about what a better solution to the problem would look like. But I would really like it to not involve random-seeming query failures. regards, tom lane
Re: disfavoring unparameterized nested loops
On Tue, Jun 15, 2021 at 12:31 PM Robert Haas wrote: > Yes, I think it is. Reading the paper really helped me crystallize my > thoughts about this, because when I've studied the problems myself, I > came, as you postulate here, to the conclusion that there's a lot of > stuff the planner does where there is risk and uncertainty, and thus > that a general framework would be necessary to deal with it. It is an example (perhaps the only example in the optimizer) of an oasis of certainty in an ocean of uncertainty. As uncertain as everything is, we seemingly can make strong robust statements about the relative merits of each strategy *in general*, just in this particular instance. It's just not reasonable to make such a reckless choice, no matter what your general risk tolerance is. Goetz Graefe is interviewed here, and goes into his philosophy on robustness -- it seems really interesting to me: https://sigmodrecord.org/publications/sigmodRecord/2009/pdfs/05_Profiles_Graefe.pdf > In defense of that approach, note that this is a > case where we know both that the Nested Loop is risky and that Hash > Join is a similar alternative with probably similar cost. I am not > sure there are any other cases where we can say quite so generally > both that a certain thing is risky and what we could do instead. I tend to think of a hash join as like a nested loop join with an inner index scan where you build the index yourself, dynamically. That might be why I find it easy to make this mental leap. In theory you could do this by giving the nestloop join runtime smarts -- make it turn into a hash join adaptively. Like Graefe's G-Join design. That way you could do this in a theoretically pure way. I don't think that that's actually necessary just to deal with this case -- it probably really is as simple as it seems. I point this out because perhaps it's useful to have that theoretical anchoring. -- Peter Geoghegan
Re: snapshot too old issues, first around wraparound and then more.
On Tue, Jun 15, 2021 at 12:49 PM Tom Lane wrote: > Robert Haas writes: > > My general point here is that I would like to know whether we have a > > finite number of reasonably localized bugs or a three-ring disaster > > that is unrecoverable no matter what we do. Andres seems to think it > > is the latter, and I *think* Peter Geoghegan agrees, but I think that > > the point might be worth a little more discussion. > > TBH, I am not clear on that either. I don't know for sure which it is, but that in itself isn't actually what matters to me. The most concerning thing is that I don't really know how to *assess* the design now. The clear presence of at least several very severe bugs doesn't necessarily prove anything (it just *hints* at major design problems). If I could make a very clear definitive statement on this then I'd probably have to do ~1/3 of the total required work -- that'd be my guess. If it was easy to be quite sure here then we wouldn't still be here 12 months later. In any case I don't think that the feature deserves to be treated all that differently to something that was committed much more recently, given what we know. Frankly it took me about 5 minutes to find a very serious bug in the feature, pretty much without giving it any thought. That is not a good sign. > I think it's a klugy, unprincipled solution to a valid real-world > problem. I suspect the implementation issues are not unrelated to > the kluginess of the concept. Thus, I would really like to see us > throw this away and find something better. I admit I have nothing > to offer about what a better solution to the problem would look like. > But I would really like it to not involve random-seeming query failures. I would be very happy to see somebody take this up, because it is important. The reality is that anybody that undertakes this task should start with the assumption that they're starting from scratch, at least until they learn otherwise. So ISTM that it might as well be true that it needs a total rewrite, even if it turns out to not be strictly true. -- Peter Geoghegan
Re: snapshot too old issues, first around wraparound and then more.
On Tue, Jun 15, 2021 at 12:17 PM Robert Haas wrote: > My general point here is that I would like to know whether we have a > finite number of reasonably localized bugs or a three-ring disaster > that is unrecoverable no matter what we do. Andres seems to think it > is the latter, and I *think* Peter Geoghegan agrees, but I think that > the point might be worth a little more discussion. I'm unclear whether > Tom's dislike for the feature represents hostility to the concept - > with which I would have to disagree - or a judgement on the quality of > the implementation - which might be justified. For the record, and to > Peter's point, I think it's reasonable to set v15 feature freeze as a > drop-dead date for getting this feature into acceptable shape, but I > would like to try to nail down what we think "acceptable" means in > this context. What I had in mind was this: a committer adopting the feature themselves. The committer would be morally obligated to maintain the feature on an ongoing basis, just as if they were the original committer. This seems like the only sensible way of resolving this issue once and for all. If it really is incredibly important that we keep this feature, or one like it, then I have to imagine that somebody will step forward -- there is still ample opportunity. But if nobody steps forward, I'll be forced to conclude that perhaps it wasn't quite as important as I first thought. Anybody can agree that it's important in an abstract sense -- that's easy. What we need is a committer willing to sign on the dotted line, which we're no closer to today than we were a year ago. Actions speak louder than words. -- Peter Geoghegan
Re: snapshot too old issues, first around wraparound and then more.
On Wed, Apr 1, 2020 at 4:59 PM Andres Freund wrote: > The primary issue here is that there is no TestForOldSnapshot() in > heap_hot_search_buffer(). Therefore index fetches will never even try to > detect that tuples it needs actually have already been pruned away. This is still true, right? Nobody fixed this bug after 14 months? Even though we're talking about returning rows that are not visible to the xact's snapshot? -- Peter Geoghegan
Re: snapshot too old issues, first around wraparound and then more.
Peter Geoghegan writes: > What I had in mind was this: a committer adopting the feature > themselves. The committer would be morally obligated to maintain the > feature on an ongoing basis, just as if they were the original > committer. This seems like the only sensible way of resolving this > issue once and for all. Yeah, it seems clear that we need somebody to do that, given that Kevin Grittner has been inactive for awhile. Even if the known problems can be resolved by drive-by patches, I think this area needs an ongoing commitment from someone. regards, tom lane
Re: Error on pgbench logs
On Tue, Jun 15, 2021 at 09:53:06PM +0900, Yugo NAGATA wrote: > I am fine with this version, but I think it would be better if we have a > comment > explaining what "tx" is for. > > Also, how about adding Assert(tx) instead of using "else if (tx)" because > we are assuming that tx is always true when agg_interval is not used, right? Agreed on both points. From what I get, this code could be clarified much more, and perhaps partially refactored to have less spaghetti code between the point where we call it at the end of a thread or when gathering stats of a transaction mid-run, but that's not something to do post-beta1. I am not completely sure that the result would be worth it either. Let's document things and let's the readers know better the assumptions this area of the code relies on, for clarity. The dependency between agg_interval and sample_rate is one of those things, somebody needs now to look at the option parsing why only one is possible at the time. Using an extra tx flag to track what to do after the loop for the aggregate print to the log file is an improvement in this direction. -- Michael signature.asc Description: PGP signature
Improving isolationtester's data output
I've been spending a lot of time looking at isolationtester results over the past couple of days, and gotten really annoyed at how poorly it formats query results. In particular, any column heading or value that is 15 characters or longer is not separated from the next column, rendering the output quite confusing. Attached is a little hack that tries to improve that case while making minimal changes to the output files otherwise. There's still a good deal to be desired here: notably, the code still does nothing to ensure vertical alignment of successive lines when there are wide headings or values. But doing anything about that would involve much-more-invasive changes of the output files. If we wanted to buy into that, I'd think about discarding this ad-hoc code altogether in favor of using one of libpq's fe-print.c routines. But I'm not really sure that the small legibility gains that would result are worth massive changes in the output files. Thoughts? regards, tom lane diff --git a/src/test/isolation/expected/detach-partition-concurrently-3.out b/src/test/isolation/expected/detach-partition-concurrently-3.out index 96ee090d53..3f553711b3 100644 --- a/src/test/isolation/expected/detach-partition-concurrently-3.out +++ b/src/test/isolation/expected/detach-partition-concurrently-3.out @@ -9,7 +9,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -35,7 +35,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -54,7 +54,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -77,7 +77,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -95,7 +95,7 @@ a 1 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach2: <... completed> @@ -121,7 +121,7 @@ a 1 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach2: <... completed> @@ -150,7 +150,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -172,7 +172,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -194,7 +194,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -213,7 +213,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -233,7 +233,7 @@ a 1 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid; -pg_cancel_backendpg_sleep +pg_cancel_backend pg_sleep t step s2detach: <... completed> @@ -252,7 +252,7 @@ a 1 step s2detach: ALTER TABLE d3_li
Re: Improving isolationtester's data output
On 2021-Jun-15, Tom Lane wrote: > I've been spending a lot of time looking at isolationtester results > over the past couple of days, and gotten really annoyed at how poorly > it formats query results. In particular, any column heading or value > that is 15 characters or longer is not separated from the next column, > rendering the output quite confusing. Yeah, I noticed this too. > Attached is a little hack that tries to improve that case while making > minimal changes to the output files otherwise. Seems pretty reasonable. > There's still a good deal to be desired here: notably, the code still > does nothing to ensure vertical alignment of successive lines when > there are wide headings or values. But doing anything about that > would involve much-more-invasive changes of the output files. > If we wanted to buy into that, I'd think about discarding this > ad-hoc code altogether in favor of using one of libpq's fe-print.c > routines. But I'm not really sure that the small legibility gains > that would result are worth massive changes in the output files. Shrug -- it's a one time change. It wouldn't bother me, for one. -- Álvaro Herrera Valdivia, Chile "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentro de él no son, por desgracia, nada idílicas" (Ijon Tichy)
Re: Improving isolationtester's data output
Alvaro Herrera writes: > On 2021-Jun-15, Tom Lane wrote: >> If we wanted to buy into that, I'd think about discarding this >> ad-hoc code altogether in favor of using one of libpq's fe-print.c >> routines. But I'm not really sure that the small legibility gains >> that would result are worth massive changes in the output files. > Shrug -- it's a one time change. It wouldn't bother me, for one. Going forward it wouldn't be a problem, but back-patching isolation test cases might find it annoying. On the other hand, my nearby patch to improve isolation test stability is already going to create issues of that sort. (Unless, dare I say it, we back-patch that.) I do find it a bit attractive to create some regression-testing coverage of fe-print.c. We are never going to remove that code, AFAICS, so getting some benefit from it would be nice. regards, tom lane
Re: Support for NSS as a libpq TLS backend
On Wed, 2021-06-16 at 00:08 +0200, Daniel Gustafsson wrote: > > On 15 Jun 2021, at 00:15, Jacob Champion wrote: > > Attached is a quick patch; does it work on your machine? > > It does, thanks! I've included it in the attached v37 along with a few tiny > non-functional improvements in comment spelling etc. Great, thanks! I've been tracking down reference leaks in the client. These open references prevent NSS from shutting down cleanly, which then makes it impossible to open a new context in the future. This probably affects other libpq clients more than it affects psql. The first step to fixing that is not ignoring failures during NSS shutdown, so I've tried a patch to pgtls_close() that pushes any failures through the pqInternalNotice(). That seems to be working well. The tests were still mostly green, so I taught connect_ok() to fail if any stderr showed up, and that exposed quite a few failures. I am currently stuck on one last failing test. This leak seems to only show up when using TLSv1.2 or below. There doesn't seem to be a substantial difference in libpq code coverage between 1.2 and 1.3, so I'm worried that either 1) there's some API we use that "requires" cleanup, but only on 1.2 and below, or 2) there's some bug in my version of NSS. Attached are a few work-in-progress patches. I think the reference cleanups themselves are probably solid, but the rest of it could use some feedback. Are there better ways to test for this? and can anyone reproduce the TLSv1.2 leak? --Jacob From 6976d15a4be50276ea2f77fd5c37d238493f9447 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Tue, 15 Jun 2021 10:01:14 -0700 Subject: [PATCH 1/3] nss: don't ignore failures during context shutdown The biggest culprit of a shutdown failure so far seems to be object leaks. A failure here may prevent future contexts from being created (and they'll fail in confusing ways), so don't ignore it. There's not a great way to signal errors from this layer of the stack, but a notice will at least give us a chance of seeing the problem. --- src/interfaces/libpq/fe-secure-nss.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c index 1b9bd4f1a5..90f57e0d99 100644 --- a/src/interfaces/libpq/fe-secure-nss.c +++ b/src/interfaces/libpq/fe-secure-nss.c @@ -139,7 +139,16 @@ pgtls_close(PGconn *conn) { if (conn->nss_context) { - NSS_ShutdownContext(conn->nss_context); + SECStatus status; + status = NSS_ShutdownContext(conn->nss_context); + + if (status != SECSuccess) + { + pqInternalNotice(&conn->noticeHooks, +"unable to shut down NSS context: %s", + pg_SSLerrmessage(PR_GetError())); + } + conn->nss_context = NULL; } } -- 2.25.1 From acabc4d51d38124db748a6be9dca0ff83693e039 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Tue, 15 Jun 2021 14:37:28 -0700 Subject: [PATCH 2/3] test: check for empty stderr during connect_ok() ...in a similar manner to command_like(), to catch notices-as-errors coming from NSS. --- src/test/authentication/t/001_password.pl | 2 +- src/test/authentication/t/002_saslprep.pl | 2 +- src/test/kerberos/t/001_auth.pl | 2 +- src/test/ldap/t/001_auth.pl | 2 +- src/test/perl/PostgresNode.pm | 5 - src/test/ssl/t/001_ssltests.pl| 4 ++-- src/test/ssl/t/002_scram.pl | 4 ++-- 7 files changed, 12 insertions(+), 9 deletions(-) diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl index 427a360198..327dfb4ed9 100644 --- a/src/test/authentication/t/001_password.pl +++ b/src/test/authentication/t/001_password.pl @@ -20,7 +20,7 @@ if (!$use_unix_sockets) } else { - plan tests => 23; + plan tests => 32; } diff --git a/src/test/authentication/t/002_saslprep.pl b/src/test/authentication/t/002_saslprep.pl index f080a0ccba..d6508df8bd 100644 --- a/src/test/authentication/t/002_saslprep.pl +++ b/src/test/authentication/t/002_saslprep.pl @@ -17,7 +17,7 @@ if (!$use_unix_sockets) } else { - plan tests => 12; + plan tests => 20; } # Delete pg_hba.conf from the given node, add a new entry to it diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index b5594924ca..62015339a0 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -23,7 +23,7 @@ use Time::HiRes qw(usleep); if ($ENV{with_gssapi} eq 'yes') { - plan tests => 44; + plan tests => 54; } else { diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl index 0ae14e4c85..e857c26258 100644 --- a/src/test/ldap/t/001_auth.pl +++ b/src/test/ldap/t/001_auth.pl @@ -9
Re: disfavoring unparameterized nested loops
On Wed, 16 Jun 2021 at 05:09, Robert Haas wrote: > How to do that is not very clear. One very simple thing we could do > would be to introduce enable_nestloop=parameterized or > enable_parameterized_nestloop=false. That is a pretty blunt instrument > but the authors of the paper seem to have done that with positive > results, so maybe it's actually not a dumb idea. It's not great that people are having to use such blunt instruments to get the planner to behave. It might not be a terrible idea to provide them with something a bit sharper as you suggest. The GUC idea is certainly something that could be done without too much effort. There was some talk of doing that in [1]. > Another approach > would be to introduce a large fuzz factor for such nested loops e.g. > keep them only if the cost estimate is better than the comparable hash > join plan by at least a factor of N (or if no such plan is being > generated for some reason). In my experience, the most common reason that the planner chooses non-parameterized nested loops wrongly is when there's row underestimation that says there's just going to be 1 row returned by some set of joins. The problem often comes when some subsequent join is planned and the planner sees the given join rel only produces one row. The cheapest join method we have to join 1 row is Nested Loop. So the planner just sticks the 1-row join rel on the outer side thinking the executor will only need to scan the inner side of the join once. When the outer row count blows up, then we end up scanning that inner side many more times. The problem is compounded when you nest it a few joins deep Most of the time when I see that happen it's down to either the selectivity of some correlated base-quals being multiplied down to a number low enough that we clamp the estimate to be 1 row. The other case is similar, but with join quals. It seems to me that each time we multiply 2 selectivities together that the risk of the resulting selectivity being out increases. The risk is likely lower when we have some extended statistics which allows us to do fewer selectivity multiplications. For that 1-row case, doing an unparameterized nested loop is only the cheapest join method by a tiny amount. It really wouldn't be much more expensive to just put that single row into a hash table. If that 1 estimated row turns out to be 10 actual rows then it's likely not too big a deal for the hash join code to accommodate the 9 additional rows. This makes me think that it's insanely risky for the planner to be picking Nested Loop joins in this case. And that puts me down the path of thinking that this problem should be solved by having the planner take into account risk as well as costs when generating plans. I don't really have any concrete ideas on that, but a basic idea that I have been considering is that a Path has a risk_factor field that is decided somewhere like clauselist_selectivity(). Maybe the risk can go up by 1 each time we multiply an individual selectivity. (As of master, estimate_num_groups() allows the estimation to pass back some further information to the caller. I added that for Result Cache so I could allow the caller to get visibility about when the estimate fell back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get similar treatment to allow the risk_factor or number of nstats_used to be passed back). We'd then add a GUC, something like planner_risk_adversion which could be set to anything from 0.0 to some positive number. During add_path() we could do the cost comparison like: path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion) < path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion). That way, if you set planner_risk_adversion to 0.0, then the planner does as it does today, i.e takes big risks. The problem I have with this idea is that I really don't know how to properly calculate what the risk_factor should be set to. It seems easy at first to set it to something that has the planner avoid these silly 1-row estimate nested loop mistakes, but I think what we'd set the risk_factor to would become much more important when more and more Path types start using it. So if we did this and just guessed the risk_factor, that might be fine when only 1 of the paths being compared had a non-zero risk_factor, but as soon as both paths have one set, unless they're set to something sensible, then we just end up comparing garbage costs to garbage costs. Another common mistake the planner makes is around WHERE a = ORDER BY b LIMIT n; where there are separate indexes on (a) and (b). Scanning the (b) index is pretty risky. All the "a" values you need might be right at the end of the index. It seems safer to scan the (a) index as we'd likely have statistics that tell us how many rows exist with . We don't have any stats that tell us where in the (b) index are all the rows with a = . I don't really think we should solve this by having nodeNestloop.c fa
Re: disfavoring unparameterized nested loops
On Tue, Jun 15, 2021 at 5:00 PM David Rowley wrote: > I don't really think we should solve this by having nodeNestloop.c > fall back on hashing when the going gets tough. Overloading our nodes > that way is not a sustainable thing to do. I'd rather see the > executor throw the plan back at the planner along with some hints > about what was wrong with it. We could do that providing we've not > sent anything back to the client yet. It wasn't a serious suggestion -- it was just a way of framing the issue at hand that I thought might be useful. If we did have something like that (which FWIW I think makes sense but is hard to do right in a general way) then it might be expected to preemptively refuse to even start down the road of using an unparameterized nestloop join very early, or even before execution time. Such an adaptive join algorithm/node might be expected to have a huge bias against this particular plan shape, that can be reduced to a simple heuristic. But you can have the simple heuristic without needing to build everything else. Whether or not we throw the plan back at the planner or "really change our minds at execution time" seems like a distinction without a difference. Either way we're changing our minds about the plan based on information that is fundamentally execution time information, not plan time information. Have I missed something? -- Peter Geoghegan
Re: snapshot too old issues, first around wraparound and then more.
Hi, On 2021-06-15 15:17:05 -0400, Robert Haas wrote: > But that's not clear to me. I'm not clear how exactly how many > problems we know about and need to fix in order to keep the feature, > and I'm also not clear how deep the hole goes. Like, if we need to get > a certain number of specific bugs fixed, I might be willing to do > that. If we need to commit to a major rewrite of the current > implementation, that's more than I can do. But I guess I don't > understand exactly how bad the current problems are. Reviewing > complaints from Andres from this thread: One important complaints I think your (useful!) list missed is that there's missing *read side* checks that demonstrably lead to wrong query results: https://www.postgresql.org/message-id/CAH2-Wz%3DFQ9rbBKkt1nXvz27kmd4A8i1%2B7dcLTNqpCYibxX83VQ%40mail.gmail.com and that it's currently very hard to figure out where they need to be, because there's no real explained model of what needs to be checked and what not. > > * I am fairly sure that it can cause crashes (or even data corruption), > > because it assumes that DML never needs to check for old snapshots > > (with no meaningful justificiation). Leading to heap_update/delete to > > assume the page header is a tuple. > > I don't understand the issue here, really. I assume there might be a > wrong word here, because assuming that the page header is a tuple > doesn't sound like a thing that would actually happen. I suspect what I was thinking of is that a tuple could get pruned away due to s_t_o, which would leave a LP_DEAD item around. As heap_update/delete neither checks s_t_o, nor balks at targetting LP_DEAD items, we'd use the offset from a the LP_DEAD item. ItemIdSetDead() sets lp_off to 0 - which would mean that the page header is interpreted as a tuple. Right? > I think one of the key problems for this feature is figuring out > whether you've got snapshot-too-old checks in all the right places. I > think what is being alleged here is that heap_update() and > heap_delete() need them, and that it's not good enough to rely on the > scan that found the tuple to be updated or deleted having already > performed those checks. It is not clear to me whether that is true, or > how it could cause crashes. Andres may have explained this to me at > some point, but if he did I have unfortunately forgotten. I don't think it is sufficient to rely on the scan. That works only as long as the page with the to-be-modified tuple is pinned (since that'd prevent pruning / vacuuming from working on the page), but I am fairly sure that there are plans where the target tuple is not pinned from the point it was scanned until it is modified. In which case it is entirely possible that the u/d target can be pruned away due to s_t_o between the scan checking s_t_o and the u/d executing. > My general point here is that I would like to know whether we have a > finite number of reasonably localized bugs or a three-ring disaster > that is unrecoverable no matter what we do. Andres seems to think it > is the latter Correct. I think there's numerous architectural issues the way the feature is implemented right now, and that it'd be a substantial project to address them. > For the record, and to Peter's point, I think it's reasonable to set > v15 feature freeze as a drop-dead date for getting this feature into > acceptable shape, but I would like to try to nail down what we think > "acceptable" means in this context. I think the absolute minimum would be to have - actually working tests - a halfway thorough code review of the feature - added documentation explaining where exactly s_t_o tests need to be - bugfixes obviously If I were to work on the feature, I cannot imagine being sufficient confident the feature works as long as the xid->time mapping granularity is a minute. It's just not possible to write reasonable tests with the granularity being that high. Or even to do manual tests of it - I'm not that patient. But I "can accept" if somebody actually doing the work differs on this. Greetings, Andres Freund
Re: Different compression methods for FPI
On Tue, Jun 15, 2021 at 11:14:56AM -0500, Justin Pryzby wrote: > You're right, I hadn't though this through all the way. > There's precedent if the default is non-static (wal_sync_method). > > But I think yes/on/true/1 should be a compatibility alias for a static thing, > and then the only option is pglz. > > Of course, changing the default to LZ4 is still a possibility. We have not reached yet a conclusion with the way we are going to parameterize all that, so let's adapt depending on the feedback. For now, I am really interested in this patch set, so I am going to run some tests of my own and test more around the compression levels we have at our disposals with the proposed algos. From I'd like us to finish with here is one new algorithm method, able to cover a large range of cases as mentioned upthread, from low-CPU/low-compression to high-CPU/high-compression. It does not seem like a good idea to be stuck with an algo that only specializes in one or the other, for example. -- Michael signature.asc Description: PGP signature
Re: Duplicate history file?
On Tue, Jun 15, 2021 at 02:28:04PM -0400, Stephen Frost wrote: > > * Julien Rouhaud (rjuju...@gmail.com) wrote: > > On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote: > > > > The fact that this is such a complex problem is the very reason why we > > should > > spend a lot of energy documenting the various requirements. Otherwise, how > > could anyone implement a valid program for that and how could anyone > > validate > > that a solution claiming to do its job actually does its job? > > Reading the code. Oh, if it's as simple as that then surely documenting the various requirements won't be an issue.
Re: disfavoring unparameterized nested loops
On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan wrote: > Whether or not we throw the plan back at the planner or "really change > our minds at execution time" seems like a distinction without a > difference. What is "really change our minds at execution time"? Is that switch to another plan without consulting the planner? If so what decides what that new plan should be? The planner is meant to be the expert in that department. The new information might cause the join order to completely change. It might not be as simple as swapping a Nested Loop for a Hash Join. > Either way we're changing our minds about the plan based > on information that is fundamentally execution time information, not > plan time information. Have I missed something? I don't really see why you think the number of rows that a given join might produce is execution information. It's exactly the sort of information the planner needs to make a good plan. It's just that today we get that information from statistics. Plenty of other DBMSs make decisions from sampling. FWIW, we do already have a minimalist sampling already in get_actual_variable_range(). I'm just trying to highlight that I don't think overloading nodes is a good path to go down. It's not a sustainable practice. It's a path towards just having a single node that does everything. If your suggestion was not serious then there's no point in discussing it further. David