Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
On 2022-Feb-22, Imseih (AWS), Sami wrote: > On 13.5 a wal flush PANIC is encountered after a standby is promoted. > > With debugging, it was found that when a standby skips a missing > continuation record on recovery, the missingContrecPtr is not > invalidated after the record is skipped. Therefore, when the standby > is promoted to a primary it writes an overwrite_contrecord with an LSN > of the missingContrecPtr, which is now in the past. On flush time, > this causes a PANIC. From what I can see, this failure scenario can > only occur after a standby is promoted. Ooh, nice find and diagnosys. I can confirm that the test fails as you described without the code fix, and doesn't fail with it. I attach the same patch, with the test file put in its final place rather than as a patch. Due to recent xlog.c changes this need a bit of work to apply to back branches; I'll see about getting it in all branches soon. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I'm impressed how quickly you are fixing this obscure issue. I came from MS SQL and it would be hard for me to put into words how much of a better job you all are doing on [PostgreSQL]." Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php >From 7948d1acdcd25b5b425c7804fad0d46cfb4b14b0 Mon Sep 17 00:00:00 2001 From: "Sami Imseih (AWS)" Date: Tue, 22 Feb 2022 19:09:36 + Subject: [PATCH v2] Fix "missing continuation record" after standby promotion Fix a condition where a recently promoted standby attempts to write an OVERWRITE_RECORD with an LSN of the previously read aborted record. Author: Sami Imseih Discussion: https://postgr.es/m/44d259de-7542-49c4-8a52-2ab01534d...@amazon.com --- src/backend/access/transam/xlog.c | 16 ++- .../t/029_overwrite_contrecord_promotion.pl | 111 ++ 2 files changed, 126 insertions(+), 1 deletion(-) create mode 100644 src/test/recovery/t/029_overwrite_contrecord_promotion.pl diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..62e942f41a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5423,11 +5423,25 @@ StartupXLOG(void) * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) + * + * If the last wal record is ahead of the missing contrecord, this is a + * recently promoted primary and we should not write an overwrite + * contrecord. */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); - EndOfLog = missingContrecPtr; + if (endOfRecoveryInfo->lastRec < missingContrecPtr) + { + elog(DEBUG2, "setting end of wal to missing continuation record %X/%X", + LSN_FORMAT_ARGS(missingContrecPtr)); + EndOfLog = missingContrecPtr; + } + else + { + elog(DEBUG2, "resetting aborted record"); + abortedRecPtr = InvalidXLogRecPtr; + } } /* diff --git a/src/test/recovery/t/029_overwrite_contrecord_promotion.pl b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl new file mode 100644 index 00..ea4ebb32c0 --- /dev/null +++ b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl @@ -0,0 +1,111 @@ +# Copyright (c) 2021-2022, PostgreSQL Global Development Group + +# Tests for resetting the "aborted record" after a promotion. + +use strict; +use warnings; + +use FindBin; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; + +# Test: Create a physical replica that's missing the last WAL file, +# then restart the primary to create a divergent WAL file and observe +# that the replica resets the "aborted record" after a promotion. + +my $node = PostgreSQL::Test::Cluster->new('primary'); +$node->init(allows_streaming => 1); +# We need these settings for stability of WAL behavior. +$node->append_conf( + 'postgresql.conf', qq( +autovacuum = off +wal_keep_size = 1GB +log_min_messages = DEBUG2 +)); +$node->start; + +$node->safe_psql('postgres', 'create table filler (a int, b text)'); + +# Now consume all remaining room in the current WAL segment, leaving +# space enough only for the start of a largish record. +$node->safe_psql( + 'postgres', q{ +DO $$ +DECLARE +wal_segsize int := setting::int FROM pg_settings WHERE name = 'wal_segment_size'; +remain int; +iters int := 0; +BEGIN +LOOP +INSERT into filler +select g, repeat(md5(g::text), (random() * 60 + 1)::int) +from generate_series(1, 10) g; + +remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize; +IF remain < 2 * setting::int from pg_settings where name = 'block_size' THEN +RAISE log 'exiting after % iterations, % bytes to end of WAL segment', iters, remain; +EXIT; +END IF; +iters := iters + 1; +END LOOP; +END +$$; +}); + +
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
I think the change to ImmediateCheckpointRequested() makes no sense. Before this patch, that function merely inquires whether there's an immediate checkpoint queued. After this patch, it ... changes a progress-reporting flag? I think it would make more sense to make the progress-report flag change in whatever is the place that *requests* an immediate checkpoint rather than here. I think the use of capitals in CHECKPOINT and CHECKPOINTER in the documentation is excessive. (Same for terms such as MULTIXACT and others in those docs; we typically use those in lowercase when user-facing; and do we really use term CLOG anymore? Don't we call it "commit log" nowadays?) -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
Re: convert libpq uri-regress tests to tap test
On 2022-Feb-23, Andres Freund wrote: > When verifying that the meson port actually runs all perl based tests I came > across src/interfaces/libpq/test/regress.pl. Instead of running tests yet > another way, it seems better to convert it to a tap test. > > I hope others agree? WFM. > Perhaps we should just rename src/test/modules/libpq_pipeline to > src/test/modules/libpq and move uri-regress in there as well? Well, building multiple binaries would require some trickery that might be excessive for this little tool. But +1 to that on principle. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Re: convert libpq uri-regress tests to tap test
On 2022-Feb-23, Andres Freund wrote: > Separately: I don't really understand why we do the whole if USE_PGXS dance in > src/test/modules? Yeah, it seems a bit silly. I'm not opposed to making these makefiles unconditionally do the PGXS thing -- or the non-PGXS thing, if that makes it easier to build multiple binaries. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La grandeza es una experiencia transitoria. Nunca es consistente. Depende en gran parte de la imaginación humana creadora de mitos" (Irulan)
Re: Add id's to various elements in protocol.sgml
On 2022-Feb-24, Dagfinn Ilmari Mannsåker wrote: > Peter Eisentraut writes: > > Is there a way to obtain those URLs other than going into the HTML > > sources and checking if there is an anchor near where you want go? > > I use the jump-to-anchor extension: https://github.com/brettz9/jump-to-anchor/ > > Some sites have javascript that adds a link next to the element that > becomes visible when hovering, e.g. the NAME and other headings on > https://metacpan.org/pod/perl. Would it be possible to create such anchor links as part of the XSL stylesheets for HTML? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: support for MERGE
On 2022-Jan-28, Andres Freund wrote: > Any chance you could split this into something more reviewable? The overall > diff stats are: 102 files changed, 8589 insertions(+), 234 deletions(-) thats > pretty hard to really review. Incremental commits don't realy help with that. I'll work on splitting this next week. > One thing from skimming: There's not enough documentation about the approach > imo - it's a complicated enough feature to deserve more than the one paragraph > in src/backend/executor/README. Sure, I'll have a look at that. > I'm still [1],[2] uncomfortable with execMerge.c kinda-but-not-really being an > executor node. I think we should make a decision on code arrangement here. From my perspective, MERGE isn't its own executor node; rather it's just another "method" in ModifyTable. Which makes sense, given that all it does is call parts of INSERT, UPDATE, DELETE which are the other ModifyTable methods. Having a separate file doesn't strike me as great, but on the other hand it's true that merely moving all the execMerge.c code into nodeModifyTable.c makes the latter too large. However I don't want to create a .h file that means exposing all those internal functions to the outside world. My ideal would be to have each INSERT, UPDATE, DELETE, MERGE as its own separate .c file, which would be #included from nodeModifyTable.c. We don't use that pattern anywhere though. Any opposition to that? (The prototypes for each file would have to live in nodeModifyTable.c itself.) > > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c > > index 1a9c1ac290..280ac40e63 100644 > > --- a/src/backend/commands/trigger.c > > +++ b/src/backend/commands/trigger.c > > This stuff seems like one candidate for splitting out. Yeah, I had done that. It's now posted as 0001. > > + /* > > +* We maintain separate transition tables for UPDATE/INSERT/DELETE since > > +* MERGE can run all three actions in a single statement. Note that > > UPDATE > > +* needs both old and new transition tables whereas INSERT needs only > > new, > > +* and DELETE needs only old. > > +*/ > > + > > + /* "old" transition table for UPDATE, if any */ > > + Tuplestorestate *old_upd_tuplestore; > > + /* "new" transition table for UPDATE, if any */ > > + Tuplestorestate *new_upd_tuplestore; > > + /* "old" transition table for DELETE, if any */ > > + Tuplestorestate *old_del_tuplestore; > > + /* "new" transition table INSERT, if any */ > > + Tuplestorestate *new_ins_tuplestore; > > + > > TupleTableSlot *storeslot; /* for converting to tuplestore's > > format */ > > }; > > Do we need to do something slightly clever about the memory limits for the > tuplestore? Each of them having a separate work_mem makes nodeModifyTable.c > one memory hungry node (the worst of all maybe?). Not sure how that would work. The memory handling is inside tuplestore.c IIRC ... I think that would require some largish refactoring. Merely dividing by four doesn't seem like a great answer either. Perhaps we could divide by two and be optimistic about it. > > + /* > > +* Project the tuple. In case of a partitioned > > table, the > > +* projection was already built to use the > > root's descriptor, > > +* so we don't need to map the tuple here. > > +*/ > > + actionInfo.actionState = action; > > + insert_slot = ExecProject(action->mas_proj); > > + > > + (void) ExecInsert(mtstate, rootRelInfo, > > + insert_slot, > > slot, > > + estate, > > &actionInfo, > > + > > mtstate->canSetTag); > > + InstrCountFiltered1(&mtstate->ps, 1); > > What happens if somebody concurrently inserts a conflicting row? An open question to which I don't have any good answer RN. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)
Re: support for MERGE
FYI I intend to get the ModifyTable split patch (0001+0003) pushed hopefully on Tuesday or Wednesday next week, unless something really ugly is found on it. As for MERGE proper, I'm aiming at getting that one pushed on the week starting March 21st, though of course I'll spend some more time on it and will probably post further versions of it before that. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest "foo" someone someday shall type "supercalifragilisticexpialidocious" (5th Commandment for C programmers)
Re: Support logical replication of DDLs
Hello I think this is a pretty interesting and useful feature. Did you see some old code I wrote towards this goal? https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org The intention was that DDL would produce some JSON blob that accurately describes the DDL that was run; the caller can acquire that and use it to produce working DDL that doesn't depend on runtime conditions. There was lots of discussion on doing things this way. It was ultimately abandoned, but I think it's valuable. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Porque francamente, si para saber manejarse a uno mismo hubiera que rendir examen... ¿Quién es el machito que tendría carnet?" (Mafalda)
Re: support for MERGE
On 2022-Mar-17, Alvaro Herrera wrote: > I'll see what to do about Instrumentation->nfiltered{1,2,3} that was > complained about by Andres upthread. Maybe some additional macros will > help. This turns out not to be as simple as I expected, mostly because we want to keep Instrumentation as a node-agnostic struct. Things are already a bit wonky with nfiltered/nfiltered2, and the addition of nfiltered3 makes things a lot uglier, and my impulse of using a union to separate what is used for scans/joins vs. what is used for MERGE results in an even more node-specific definition rather than the other way around. Looking at the history I came across this older thread where this was discussed https://postgr.es/m/20180409215851.idwc75ct2bzi6tea@alvherre.pgsql particularly this message from Robert, https://www.postgresql.org/message-id/CA%2BTgmoaE3R6%3DV0G6zbht2L_LE%2BTsuYuSTPJXjLW%2B9_tpMGubZQ%40mail.gmail.com I'll keep looking at this in the coming days, see if I can come up with something sensible. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Y una voz del caos me habló y me dijo "Sonríe y sé feliz, podría ser peor". Y sonreí. Y fui feliz. Y fue peor.
Re: a misbehavior of partition row movement (?)
I rebased this patch; v15 attached. Other than fixing the (very large) conflicts due to nodeModifyTable.c rework, the most important change is moving GetAncestorResultRels into execMain.c and renaming it to have the "Exec-" prefix. The reason is that what this code is doing is affect struct ResultRelInfo, which is owned by execMain.c, so it seemed bogus to do that in nodeModifyTable.c and then let execMain.c's ExecCloseResultRelations() do the cleanup. I added a little commentary in the latter routine too. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "World domination is proceeding according to plan"(Andrew Morton) >From 345ed49718708d8ebde9e2dcf06bf963190bc5c8 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 18 Mar 2022 11:01:24 +0100 Subject: [PATCH v15] Enforce foreign key correctly during cross-partition updates When an update on a partitioned table referenced in foreign keys constraint causes a row to move from one partition to another, which is implemented by deleting the old row from the source leaf partition followed by inserting the new row into the destination leaf partition, firing the foreign key triggers on that delete event can result in surprising outcomes for those keys. For example, a given foreign key's delete trigger which implements the ON DELETE CASCADE clause of that key will delete any referencing rows when triggerred for that internal DELETE, although it should not, because the referenced row is simply being moved from one partition of the referenced root partitioned table into another, not being deleted from it. This commit teaches trigger.c to skip queuing such delete trigger events on the leaf partitions in favor of an UPDATE event fired on the root target relation. Doing so makes sense because both the old and the new tuple "logically" belong to the root relation. The after trigger event queuing interface now allows passing the source and the destination partitions of a particular cross-partition update when registering the update event for the root partitioned table. Along with the 2 ctids of the old and the new tuple, an after trigger event now also stores the OIDs of those partitions. The tuples fetched from the source and the destination partitions are converted into the root table format before they are passed to the trigger function. The implementation currently has a limitation that only the foreign keys pointing into the query's target relation are considered, not those of its sub-partitioned partitions. That seems like a reasonable limitation, because it sounds rare to have distinct foreign keys pointing into sub-partitioned partitions, but not into the root table. Author: Amit Langote --- doc/src/sgml/ref/update.sgml | 7 + src/backend/commands/trigger.c| 389 +++--- src/backend/executor/execMain.c | 86 - src/backend/executor/execReplication.c| 5 +- src/backend/executor/nodeModifyTable.c| 155 - src/backend/utils/adt/ri_triggers.c | 6 + src/include/commands/trigger.h| 8 +- src/include/executor/executor.h | 4 +- src/include/nodes/execnodes.h | 6 + src/test/regress/expected/foreign_key.out | 204 +++- src/test/regress/sql/foreign_key.sql | 135 +++- 11 files changed, 925 insertions(+), 80 deletions(-) diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml index 3fa54e5f70..3ba13010e7 100644 --- a/doc/src/sgml/ref/update.sgml +++ b/doc/src/sgml/ref/update.sgml @@ -316,6 +316,13 @@ UPDATE count partition (provided the foreign data wrapper supports tuple routing), they cannot be moved from a foreign-table partition to another partition. + + + An attempt of moving a row from one partition to another will fail if a + foreign key is found to directly reference a non-root partitioned table + in the partition tree, unless that table is also directly mentioned + in the UPDATEquery. + diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index e08bd9a370..a9aa043981 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -95,10 +95,13 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata, Instrumentation *instr, MemoryContext per_tuple_context); static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo, + ResultRelInfo *src_partinfo, + ResultRelInfo *dst_partinfo, int event, bool row_trigger, TupleTableSlot *oldtup, TupleTableSlot *newtup, List *recheckIndexes, Bitmapset *modifiedCols, - TransitionCaptureState *transition_capture); + TransitionCaptureState *transition_capture, + bool is_crosspart_update); static void AfterTriggerEnlargeQueryState(void); static bool before_stmt_triggers_fired(Oi
Re: support for MERGE
On 2022-Mar-10, Simon Riggs wrote: > Duplicate rows should produce a uniqueness violation error in one of > the transactions, assuming there is a constraint to define the > conflict. Without such a constraint there is no conflict. > > Concurrent inserts are checked by merge-insert-update.spec, which > correctly raises an ERROR in this case, as documented. Agreed -- I think this is reasonable. > Various cases were not tested in the patch - additional patch > attached, but nothing surprising there. Thank you, I've included this in all versions of the patch since you sent it. > ExecInsert() does not return from such an ERROR, so the code fragment > appears correct to me. I think trying to deal with it in a different way (namely: suspend processing the inserting WHERE NOT MATCHED clause and switch to processing the row using WHERE MATCHED clauses) would require us use speculative tokens, similar to the way INSERT ON CONFLICT does. I'm not sure we want to go there, but it seems okay to leave that for a later patch. Moreover, there would be no compatibility hit from doing so. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas/ desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
Re: a misbehavior of partition row movement (?)
On 2022-Mar-18, Zhihong Yu wrote: > +#define AFTER_TRIGGER_OFFSET 0x07FF /* must be low-order > bits */ > +#define AFTER_TRIGGER_DONE 0x8000 > +#define AFTER_TRIGGER_IN_PROGRESS 0x4000 > > Is it better if the order of AFTER_TRIGGER_DONE > and AFTER_TRIGGER_IN_PROGRESS is swapped (for the ordinal values to be > sequential) ? They *are* sequential -- See https://www.postgresql.org/message-id/202201172215.2tse3vjjgi2b%40alvherre.pgsql > +#define AFTER_TRIGGER_CP_UPDATE0x0800 > > It would be better to add a comment for this constant, explaining what CP > means (cross partition). Sure. > + if (!partRel->rd_rel->relispartition) > + elog(ERROR, "cannot find ancestors of a non-partition result > relation"); > > It would be better to include the relation name in the error message. I don't think it matters. We don't really expect to hit this. > + /* Ignore the root ancestor, because ...?? */ > > Please fill out the remainder of the comment. I actually would like to know what's the rationale for this myself. Amit? > + if (!trig->tgisclone && > + RI_FKey_trigger_type(trig->tgfoid) == RI_TRIGGER_PK) > + { > + has_noncloned_fkey = true; > > The variable says fkey, but the constant is not RI_TRIGGER_FK. Maybe add a > comment explaining why. Well, the constant is about the trigger *function*, not about any constraint. This code is testing "is this a noncloned trigger, and does that trigger use an FK-related function?" If you have a favorite comment to include, I'm all ears. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html >From a71baa9ab81d6f9ed04ce2c37c86d806ef36aa8b Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 18 Mar 2022 11:01:24 +0100 Subject: [PATCH v16] Enforce foreign key correctly during cross-partition updates When an update on a partitioned table referenced in foreign keys constraint causes a row to move from one partition to another, which is implemented by deleting the old row from the source leaf partition followed by inserting the new row into the destination leaf partition, firing the foreign key triggers on that delete event can result in surprising outcomes for those keys. For example, a given foreign key's delete trigger which implements the ON DELETE CASCADE clause of that key will delete any referencing rows when triggerred for that internal DELETE, although it should not, because the referenced row is simply being moved from one partition of the referenced root partitioned table into another, not being deleted from it. This commit teaches trigger.c to skip queuing such delete trigger events on the leaf partitions in favor of an UPDATE event fired on the root target relation. Doing so makes sense because both the old and the new tuple "logically" belong to the root relation. The after trigger event queuing interface now allows passing the source and the destination partitions of a particular cross-partition update when registering the update event for the root partitioned table. Along with the 2 ctids of the old and the new tuple, an after trigger event now also stores the OIDs of those partitions. The tuples fetched from the source and the destination partitions are converted into the root table format before they are passed to the trigger function. The implementation currently has a limitation that only the foreign keys pointing into the query's target relation are considered, not those of its sub-partitioned partitions. That seems like a reasonable limitation, because it sounds rare to have distinct foreign keys pointing into sub-partitioned partitions, but not into the root table. Author: Amit Langote --- doc/src/sgml/ref/update.sgml | 7 + src/backend/commands/trigger.c| 394 +++--- src/backend/executor/execMain.c | 86 - src/backend/executor/execReplication.c| 5 +- src/backend/executor/nodeModifyTable.c| 151 - src/backend/utils/adt/ri_triggers.c | 6 + src/include/commands/trigger.h| 8 +- src/include/executor/executor.h | 4 +- src/include/nodes/execnodes.h | 6 + src/test/regress/expected/foreign_key.out | 204 ++- src/test/regress/sql/foreign_key.sql | 135 +++- 11 files changed, 926 insertions(+), 80 deletions(-) diff --git a/doc/src/sgml/ref/update.sgml b/doc/s
Re: a misbehavior of partition row movement (?)
On 2022-Mar-20, Amit Langote wrote: > On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera > wrote: > > On 2022-Mar-18, Zhihong Yu wrote: > > > + if (!partRel->rd_rel->relispartition) > > > + elog(ERROR, "cannot find ancestors of a non-partition result > > > relation"); > > > > > > It would be better to include the relation name in the error message. > > > > I don't think it matters. We don't really expect to hit this. > > I tend to think maybe showing at least the OID in the error message > doesn't hurt, but maybe we don't need to. Since we don't even know of a situation in which this error message would be raised, I'm hardly bothered by failing to print the OID. If any users complain, we can add more detail. > I've fixed the comment as: > > - /* Ignore the root ancestor, because ...?? */ > + /* Root ancestor's triggers will be processed. */ Okay, included that. > A description of what we're looking for with this code is in the > comment above the loop: > > /* > * For any foreign keys that point directly into a non-root ancestors of > * the source partition,... > > So finding triggers in those non-root ancestors whose function is > RI_TRIGGER_PK tells us that those relations have foreign keys pointing > into it or that it is the PK table in that relationship. Other than > the comment, the code itself seems self-documenting with regard to > what's being done (given the function/macro/variable naming), so maybe > we're better off without additional commentary here. Yeah, WFM. > I've attached a delta patch on v16 for the above comment and a couple > of other changes. Merged that in, and pushed. I made a couple of wording changes in comments here and there as well. I lament the fact that this fix is not going to hit Postgres 12-14, but ratio of effort to reward seems a bit too high. I think we could backpatch the two involved commits if someone is motivated enough to verify everything and come up with solutions for the necessary ABI changes. Thank you, Amit, for your perseverance in getting this bug fixed! -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On 2022-Mar-04, Michael Paquier wrote: > d6d317d as solved the issue of tablespace paths across multiple nodes > with the new GUC called allow_in_place_tablespaces, and is getting > successfully used in the recovery tests as of 027_stream_regress.pl. OK, but that means that the test suite is now not backpatchable. The implication here is that either we're going to commit the fix without any tests at all on older branches, or that we're going to fix it only in branch master. Are you thinking that it's okay to leave this bug unfixed in older branches? That seems embarrasing. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia)
Re: Column Filtering in Logical Replication
Hello, Please add me to the list of authors of this patch. I made a large number of nontrivial changes to it early on. Thanks. I have modified the entry in the CF app (which sorts alphabetically, it was not my intention to put my name first.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
I had a look at this latest version of the patch, and found some things to tweak. Attached is v21 with three main changes from Kyotaro's v20: 1. the XLogFlush is only done if consistent state has not been reached. As I understand, it's not needed in normal mode. (In any case, if we do call XLogFlush in normal mode, what it does is not advance the recovery point, so the comment would be incorrect.) 2. use %u to print OIDs rather than %d 3. I changed the warning message wording to this: + ereport(WARNING, + (errmsg("skipping replay of database creation WAL record"), +errdetail("The source database directory \"%s\" was not found.", + src_path), +errhint("A future WAL record that removes the directory before reaching consistent mode is expected."))); I also renamed the function XLogReportMissingDir to XLogRememberMissingDir (which matches the "forget" part) and changed the DEBUG2 messages in that function to DEBUG1 (all the calls in other functions remain DEBUG2, because ISTM they are not as interesting). Finally, I made the TAP test search the WARNING line in the log. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No tengo por qué estar de acuerdo con lo que pienso" (Carlos Caszeli) >From 6a6fc73a93768a44ec026720c115f77c67d5cda2 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 21 Mar 2022 12:34:34 +0100 Subject: [PATCH v21] Fix replay of create database records on standby Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds mechanism similar to invalid page hash table, to track missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely enter archive recovery. Diagnosed-by: Paul Guo Author: Paul Guo Author: Kyotaro Horiguchi Author: Asim R Praveen Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27dn5xwj2p9-roh16e4...@mail.gmail.com --- src/backend/access/transam/xlogrecovery.c | 6 + src/backend/access/transam/xlogutils.c | 159 +++- src/backend/commands/dbcommands.c | 57 +++ src/backend/commands/tablespace.c | 17 +++ src/include/access/xlogutils.h | 4 + src/test/recovery/t/029_replay_tsp_drops.pl | 67 + src/tools/pgindent/typedefs.list| 2 + 7 files changed, 311 insertions(+), 1 deletion(-) create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 9feea3e6ec..f48d8d51fb 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void) */ XLogCheckInvalidPages(); + /* + * Check if the XLOG sequence contained any unresolved references to + * missing directories. + */ + XLogCheckMissingDirs(); + reachedConsistency = true; ereport(LOG, (errmsg("consistent recovery state reached at %X/%X", diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 511f2f186f..8c1b8216be 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -54,6 +54,164 @@ bool InRecovery = false; /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */ HotStandbyState standbyState = STANDBY_DISABLED; + +/* + * If a create database WAL record is being replayed more than once during + * crash recovery on a standby, it is possible that either the tablespace + * directory or the template database directory is missing. This happens when + * the directories are removed by replay of subsequent drop records. Note + * that this problem happens only on standby and not on master. On master, a + * checkpoint is created at the end of create database operation. On standby, + * however, such a strategy (creating restart points during replay) is not + * viable because it will slow down WAL replay. + * + * The alternative is to track references to each missing directory + * encountered when performing crash recovery in the following hash table. + * Similar to invalid page table above, the expectation is that each missing + * di
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On 2022-Mar-14, Robert Haas wrote: > 2. Why not instead change the code so that the operation can succeed, > by creating the prerequisite parent directories? Do we not have enough > information for that? I'm not saying that we definitely should do it > that way rather than this way, but I think we do take that approach in > some cases. It seems we can choose freely between these two implementations -- I mean I don't see any upsides or downsides to either one. The current one has the advantage that it never makes the datadir "dirty", to use Kyotaro's term. It verifies that the creation/drop form a pair. A possible downside is that if there's a bug, we could end up with a spurious PANIC at the end of recovery, and no way to recover. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: LogwrtResult contended spinlock
On 2022-Mar-21, Andres Freund wrote: > This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log Updated. > Are you aiming this for v15? Otherwise I'd like to move the entry to the next > CF. Marked as waiting-on-author. I'd like to get 0001 pushed to pg15, yes. I'll let 0002 sit here for discussion, but I haven't seen any evidence that we need it. If others vouch for it, I can push that one too, but I'd rather have it be a separate thing. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El miedo atento y previsor es la madre de la seguridad" (E. Burke) >From 4d5a8d915b497b79bbe62e18352f7b9d8c1b9bca Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v6 1/2] Change XLogCtl->LogwrtResult to use atomic ops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. In addition, this commit splits the process-local copy of these values (kept in the freestanding LogwrtResult struct) into two separate LogWriteResult and LogFlushResult. This is not strictly necessary, but it makes it clearer that these are updated separately, each on their own schedule. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 195 +++--- src/include/port/atomics.h| 29 + src/tools/pgindent/typedefs.list | 1 + 3 files changed, 125 insertions(+), 100 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4ac3871c74..311a8a1192 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,14 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * each member of LogwrtResult (LogWriteResult and LogFlushResult), each of + * which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,18 +315,18 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 of write position */ + pg_atomic_uint64 Flush; /* last byte + 1 of flush position */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ XLogRecPtr Flush; /* last byte + 1 to flush */ } XLogwrtRqst; -typedef struct XLogwrtResult -{ - XLogRecPtr Write; /* last byte + 1 written out */ - XLogRecPtr Flush; /* last byte + 1 flushed */ -} XLogwrtResult; - /* * Inserting to WAL is protected by a small fixed number of WAL insertion * locks. To insert to the WAL, you must hold one of the locks - it doesn't @@ -480,6 +478,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +496,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -622,10
Re: SQL/JSON: JSON_TABLE
On 2022-Mar-22, Andrew Dunstan wrote: > I'm planning on pushing the functions patch set this week and json-table > next week. I think it'd be a good idea to push the patches one by one and let a day between each. That would make it easier to chase buildfarm issues individually, and make sure they are fixed before the next step. Each patch in each series is already big enough. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: LogwrtResult contended spinlock
On 2022-Mar-22, Tom Lane wrote: > I looked briefly at 0001, and I've got to say that I disagree with > your decision to rearrange the representation of the local LogwrtResult > copy. It clutters the patch tremendously and makes it hard to > understand what the actual functional change is. Moreover, I'm > not entirely convinced that it's a notational improvement in the > first place. > > Perhaps it'd help if you split 0001 into two steps, one to do the > mechanical change of the representation and then a second patch that > converts the shared variable to atomics. Since you've moved around > the places that read the shared variable, that part is subtler than > one could wish and really needs to be studied on its own. Hmm, I did it the other way around: first change to use atomics, then the mechanical change. I think that makes the usefulness of the change more visible, because before the atomics use the use of the combined struct as a unit remains sensible. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke") >From dd9b53878faeedba921ea7027e98ddbb433e8647 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v7 1/3] Change XLogCtl->LogwrtResult to use atomic ops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. In addition, this commit splits the process-local copy of these values (kept in the freestanding LogwrtResult struct) into two separate LogWriteResult and LogFlushResult. This is not strictly necessary, but it makes it clearer that these are updated separately, each on their own schedule. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 85 +++ src/include/port/atomics.h| 29 +++ src/tools/pgindent/typedefs.list | 1 + 3 files changed, 72 insertions(+), 43 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4ac3871c74..6f2eb494fe 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,6 +314,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -480,6 +483,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +501,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtRes
Re: LogwrtResult contended spinlock
So I've been wondering about this block at the bottom of XLogWrite: /* * Make sure that the shared 'request' values do not fall behind the * 'result' values. This is not absolutely essential, but it saves some * code in a couple of places. */ { SpinLockAcquire(&XLogCtl->info_lck); if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write) XLogCtl->LogwrtRqst.Write = LogwrtResult.Write; if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush) XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush; SpinLockRelease(&XLogCtl->info_lck); } I just noticed that my 0001 makes the comment a lie: it is now quite possible that 'result' is advanced beyond 'request'. Before the patch that never happened because they were both advanced in the region locked by the spinlock. I think we could still maintain this promise if we just moved this entire block before the first pg_atomic_monotonic_advance_u64 setting XLogCtl->LogwrtResult.Write. Or we could halve the whole block, and put one acquire/test/set/release stanza before each monotonic increase of the corresponding variable. However, I wonder if this is still necessary. This code was added in 4d14fe0048c (March 2001) and while everything else was quite different back then, this hasn't changed at all. I can't quite figure out what are those "couple of places" that would need additional code if this block is just removed. I tried running the tests (including wal_consistency_checking), and nothing breaks. Reading the code surrounding the other accesses of XLogCtl->LogwrtRqst, there's nothing that looks to me like it depends on these values not lagging behind LogwrtResult. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)
Re: Column Filtering in Logical Replication
On 2022-Mar-19, Tomas Vondra wrote: > @@ -174,7 +182,13 @@ ALTER PUBLICATION noinsert SET (publish = 'update, > delete'); > > Add some tables to the publication: > > -ALTER PUBLICATION mypublication ADD TABLE users, departments; > +ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), > departments; > + > + > + > + Change the set of columns published for a table: > + > +ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname, > lastname), TABLE departments; > > > Hmm, it seems to me that if you've removed the feature to change the set of columns published for a table, then the second example should be removed as well. > +/* > + * Transform the publication column lists expression for all the relations > + * in the list. > + * > + * XXX The name is a bit misleading, because we don't really transform > + * anything here - we merely check the column list is compatible with the > + * definition of the publication (with publish_via_partition_root=false) > + * we only allow column lists on the leaf relations. So maybe rename it? > + */ > +static void > +TransformPubColumnList(List *tables, const char *queryString, > +bool pubviaroot) > +{ I agree with renaming this function. Maybe CheckPubRelationColumnList() ? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "This is a foot just waiting to be shot"(Andrew Dunstan)
Re: ExecRTCheckPerms() and many prunable partitions
On 2022-Mar-23, Amit Langote wrote: > As the changes being made with the patch are non-trivial and the patch > hasn't been reviewed very significantly since Alvaro's comments back > in Sept 2021 which I've since addressed, I'm thinking of pushing this > one into the version 16 dev cycle. Let's not get ahead of ourselves. The commitfest is not yet over. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Re: Column Filtering in Logical Replication
On 2022-Mar-23, Amit Kapila wrote: > On Wed, Mar 23, 2022 at 12:54 AM Alvaro Herrera > wrote: > > > > On 2022-Mar-19, Tomas Vondra wrote: > > > > > @@ -174,7 +182,13 @@ ALTER PUBLICATION noinsert SET (publish = 'update, > > > delete'); > > > > > > Add some tables to the publication: > > > > > > -ALTER PUBLICATION mypublication ADD TABLE users, departments; > > > +ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), > > > departments; > > > + > > > + > > > + > > > + Change the set of columns published for a table: > > > + > > > +ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname, > > > lastname), TABLE departments; > > > > > > > > > > > > > Hmm, it seems to me that if you've removed the feature to change the set > > of columns published for a table, then the second example should be > > removed as well. > > As per my understanding, the removed feature is "Alter Publication ... > Alter Table ...". The example here "Alter Publication ... Set Table > .." should still work as mentioned in my email[1]. Ah, I see. Yeah, that makes sense. In that case, the leading text seems a bit confusing. I would suggest "Change the set of tables in the publication, specifying a different set of columns for one of them:" I think it would make the example more useful if we table for which the columns are changing is a different one. Maybe do this: Add some tables to the publication: -ALTER PUBLICATION mypublication ADD TABLE users, departments; +ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), departments; + + + + Change the set of tables in the publication, keeping the column list + in the users table and specifying a different column list for the + departments table. Note that previously published tables not mentioned + in this command are removed from the publication: + + +ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname), TABLE departments (dept_id, deptname); so that it is clear that if you want to keep the column list unchanged in one table, you are forced to specify it again. (Frankly, this ALTER PUBLICATION SET command seems pretty useless from a user PoV.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
Re: prevent immature WAL streaming
On 2021-Sep-03, Alvaro Herrera wrote: > The last commit is something I noticed in pg_rewind ... I had missed this one; it's pushed now. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "I can see support will not be a problem. 10 out of 10."(Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
Re: [DOC] Document concurrent index builds waiting on each other
On 2020-Sep-30, Michael Paquier wrote: > + > + CREATE INDEX (including the > CONCURRENTLY > + option) commands are included when VACUUM calculates > what > + dead tuples are safe to remove even on tables other than the one being > indexed. > + > FWIW, this is true as well for REINDEX CONCURRENTLY because both use > the same code paths for index builds and validation, with basically > the same waiting phases. But is CREATE INDEX the correct place for > that? Wouldn't it be better to tell about such things on the VACUUM > doc? Yeah, I think it might be more sensible to document this in maintenance.sgml, as part of the paragraph that discusses removing tuples "to save space". But making it inline with the rest of the flow, it seems to distract from higher-level considerations, so I suggest to make it a footnote instead. I'm not sure on the wording to use; what about this? >From 6c9ad72e4e61dbf05f34146cb67706dd675a38f0 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 30 Nov 2020 18:50:06 -0300 Subject: [PATCH v5] Note CIC and RC in vacuum's doc Per James Coleman --- doc/src/sgml/maintenance.sgml | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 4d8ad754f8..d68d7f936e 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -159,7 +159,17 @@ concurrency control (MVCC, see ): the row version must not be deleted while it is still potentially visible to other transactions. But eventually, an outdated or deleted row version is no -longer of interest to any transaction. The space it occupies must then be +longer of interest to any transaction. + + + Note that VACUUM needs to retain tuples that are + nominally visible to transactions running + CREATE INDEX CONCURRENTLY or + REINDEX CONCURRENTLY, even when run on tables + other than the tables being indexed. + + +The space it occupies must then be reclaimed for reuse by new rows, to avoid unbounded growth of disk space requirements. This is done by running VACUUM. -- 2.20.1
Re: runtime error copying oids field
On 2020-Nov-30, Zhihong Yu wrote: > This was the line runtime error was raised: > > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); > > From RelationBuildPartitionDesc we can see that: > > if (nparts > 0) > { > PartitionBoundInfo boundinfo; > int*mapping; > int next_index = 0; > > result->oids = (Oid *) palloc0(nparts * sizeof(Oid)); > > The cause was oids field was not assigned due to nparts being 0. > This is verified by additional logging added just prior to the memcpy call. > > I want to get the community's opinion on whether a null check should be > added prior to the memcpy() call. As far as I understand, we do want to avoid memcpy's of null pointers; see [1]. In this case I think it'd be sane to skip the complete block, not just the memcpy, something like diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index ca24620fd0..d35deb433a 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId, if (partitioned) { + PartitionDesc partdesc; + /* * Unless caller specified to skip this step (via ONLY), process each * partition to make sure they all contain a corresponding index. * * If we're called internally (no stmt->relation), recurse always. */ - if (!stmt->relation || stmt->relation->inh) + partdesc = RelationGetPartitionDesc(rel); + if ((!stmt->relation || stmt->relation->inh) && partdesc->nparts > 0) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); int nparts = partdesc->nparts; Oid*part_oids = palloc(sizeof(Oid) * nparts); boolinvalidate_parent = false; [1] https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com
Re: error_severity of brin work item
The more I look at this, the less I like it. This would set a precedent that any action that can be initiated from an autovac work-item has a requirement of silently being discarded when it referenced a non-existant relation. I'd rather have the code that drops the index go through the list of work-items and delete those that reference that relation. I'm not sure if this is something that ought to be done in index_drop(); One objection might be that if the drop is rolled back, the work-items are lost. It's the easiest, though; and work-items are supposed to be lossy anyway, and vacuum would fix the lack of summarization eventually. So, not pretty, but not all that bad. (Hopefully rolled-back drops are not all that common.)
Re: Huge memory consumption on partitioned table with FKs
On 2020-Nov-26, Kyotaro Horiguchi wrote: > This shares RI_ConstraintInfo cache by constraints that shares the > same parent constraints. But you forgot that the cache contains some > members that can differ among partitions. > > Consider the case of attaching a partition that have experienced a > column deletion. I think this can be solved easily in the patch, by having ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if they are equal then use the parent's constaint_id, otherwise use the child constraint. That way, the cache entry is reused in the common case where they are identical. I would embed all this knowledge in ri_BuildQueryKey though, without adding the new function ri_GetParentConstOid. I don't think that function meaningful abstraction value, and instead it would make what I suggest more difficult.
Re: runtime error copying oids field
On 2020-Nov-30, Zhihong Yu wrote: > Alvaro, et al: > Please let me know how to proceed with the patch. > > Running test suite with the patch showed no regression. That's good to hear. I'll get it pushed today. Thanks for following up.
Re: error_severity of brin work item
On 2020-Nov-30, Justin Pryzby wrote: > On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote: > > The more I look at this, the less I like it. This would set a precedent > > that any action that can be initiated from an autovac work-item has a > > requirement of silently being discarded when it referenced a > > non-existant relation. > > My original request was to change to INFO, which is what vacuum proper does at > vacuum_open_relation(). I realize that still means that new work item > handlers > would have to do the LockOid, try_open dance - maybe it could be factored out. As I understand, INFO is not well suited to messages that are not going to the client. Anyway, my point is about the contortions that are needed to support the case, rather than what exactly do we do when it happens. Retrospectively, it's strange that this problem (what happens when indexes with pending work-items are dropped) hadn't manifested. It seems a pretty obvious one. > > I'd rather have the code that drops the index go through the list of > > work-items and delete those that reference that relation. > > > > I'm not sure if this is something that ought to be done in index_drop(); > > One objection might be that if the drop is rolled back, the work-items > > are lost. > > Should it be done in an AtCommit hook or something like that ? I didn't like this idea much on first read, on extensibility grounds, but perhaps it's not so bad because we can generalize it whenever there's pressure to add a second type of work-item (*if* that ever happens). I suppose the idea is that index_drop saves the index OID when a BRIN index with autosummarization=on is dropped, and then the AtCommit_WorkItems() call scans the items list and drops those that match any OIDs in the list. (It does require to be mindful of subxact aborts, of course.)
Re: runtime error copying oids field
On 2020-Dec-01, Alvaro Herrera wrote: > On 2020-Nov-30, Zhihong Yu wrote: > > > Alvaro, et al: > > Please let me know how to proceed with the patch. > > > > Running test suite with the patch showed no regression. > > That's good to hear. I'll get it pushed today. Thanks for following > up. Done, thanks for reporting this.
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Hi Justin, Thanks for all the comments. I'll incorporate everything and submit an updated version later. On 2020-Nov-30, Justin Pryzby wrote: > On Tue, Nov 03, 2020 at 08:56:06PM -0300, Alvaro Herrera wrote: > > +++ b/src/bin/psql/describe.c > > - printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s"), > > parent_name, > > - partdef); > > + printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s%s"), > > parent_name, > > + partdef, > > + strcmp(detached, "t") > > == 0 ? " DETACHED" : ""); > > The attname "detached" is a stretch of what's intuitive (it's more like > "detachING" or half-detached). But I think psql should for sure show > something > more obvious to users. Esp. seeing as psql output isn't documented. Let's > figure out what to show to users and then maybe rename the column that, too. OK. I agree that "being detached" is the state we want users to see, or maybe "detach pending", or "unfinisheddetach" (ugh). I'm not sure that pg_inherits.inhbeingdetached" is a great column name. Opinions welcome. > > +PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL) > > Instead of finalize .. deferred ? Or ?? Well, I'm thinking that this has to be a verb in the imperative mood. The user is commanding the server to "finalize this detach operation". I'm not sure that DEFERRED fits that grammatical role. If there are other ideas, let's discuss them. ALTER TABLE tst DETACH PARTITION tst_1 FINALIZE <-- decent ALTER TABLE tst DETACH PARTITION tst_1 COMPLETE <-- I don't like it ALTER TABLE tst DETACH PARTITION tst_1 DEFERRED <-- grammatically faulty? > ATExecDetachPartition: > Doesn't this need to lock the table before testing for default partition ? Correct, it does. > I ended up with apparently broken constraint when running multiple loops > around > a concurrent detach / attach: > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES > FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; > done& > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES > FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; > done& > > "p1_check" CHECK (true) > "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2) Not good.
Re: error_severity of brin work item
On 2020-Dec-01, Justin Pryzby wrote: > This was an idea I made up - I don't know any of the details of this, but if > you give a hint I could look at it more. There'd (still) be a race window, > but > I think that's ok. See CommitTransaction() and friends, where AtEOXact_on_commit_actions() and others are called. You'd have to create a new routine (say AtEOXact_Autovacuum or more specific AtEOXact_AutovacuumWorkItems), to be called at the right places in xact.c. Keep a global variable, say a list of OIDs. On subxact commit, the list is reassigned to its parent transaction; on subxact abort, the list is discarded. On top xact commit, the list of OIDs is passed to some new routine in autovacuum.c that scans the workitem array and deletes items as appropriate. Not sure what's a good place for OIDs to be added to the list. We don't have AM-specific entry points for relation drop. I think this is the weakest point of this. > Another idea is if perform_work_item() were responsible for discarding > relations which disappear. Currently it does this, which is racy since it > holds no lock. That has the property that it remains contained in autovacuum.c, but no other advantages I think.
Re: room for improvement in amcheck btree checking?
On 2020-Dec-01, Mark Dilger wrote: > 7) Run a SQL query that uses an index scan on the table and see that it > errors with something like: > >ERROR: could not read block 0 in file "base/13097/16391": read only 0 of > 8192 bytes > > I found it surprising that even when precisely zero of the tids in the > index exist in the table the index checks all come back clean. Yeah, I've seen this kind of symptom in production databases (indexes pointing to non-existant heap pages). I think one useful cross-check that amcheck could do, is verify that if a heap page is referenced from the index, then the heap page must exist. Otherwise, it's a future index corruption of sorts: the old index entries will point to the wrong new heap tuples as soon as the table grows again.
Re: [DOC] Document concurrent index builds waiting on each other
On 2020-Nov-30, James Coleman wrote: > On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera > wrote: > > > > On 2020-Sep-30, Michael Paquier wrote: > > Yeah, I think it might be more sensible to document this in > > maintenance.sgml, as part of the paragraph that discusses removing > > tuples "to save space". But making it inline with the rest of the flow, > > it seems to distract from higher-level considerations, so I suggest to > > make it a footnote instead. > > I have mixed feelings about wholesale moving it; users aren't likely > to read the vacuum doc when considering how running CIC might impact > their system, though I do understand why it otherwise fits there. Makes sense. ISTM that if we want to have a cautionary blurb CIC docs, it should go in REINDEX CONCURRENTLY as well. > > I'm not sure on the wording to use; what about this? > > The wording seems fine to me. Great, thanks. > This is a replacement for what was 0002 earlier? And 0001 from earlier > still seems to be a useful standalone patch? 0001 is the one that I got pushed yesterday, I think -- correct? src/tools/git_changelog says: Author: Alvaro Herrera Branch: master [58ebe967f] 2020-11-30 18:24:55 -0300 Branch: REL_13_STABLE [3fe0e7c3f] 2020-11-30 18:24:55 -0300 Branch: REL_12_STABLE [b2603f16a] 2020-11-30 18:24:55 -0300 Branch: REL_11_STABLE [ed9c9b033] 2020-11-30 18:24:55 -0300 Branch: REL_10_STABLE [d3bd36a63] 2020-11-30 18:24:55 -0300 Branch: REL9_6_STABLE [b3d33bf59] 2020-11-30 18:24:55 -0300 Branch: REL9_5_STABLE [968a537b4] 2020-11-30 18:24:55 -0300 Document concurrent indexes waiting on each other Because regular CREATE INDEX commands are independent, and there's no logical data dependency, it's not immediately obvious that transactions held by concurrent index builds on one table will block the second phase of concurrent index creation on an unrelated table, so document this caveat. Backpatch this all the way back. In branch master, mention that only some indexes are involved. Author: James Coleman Reviewed-by: David Johnston Discussion: https://postgr.es/m/CAAaqYe994=purn8cjz4ueo_s-ffrr_3ogeryhtdghab2wg_...@mail.gmail.com
Re: wrong link in acronyms.sgml
On 2020-Dec-02, Erik Rijkers wrote: > Hi > > I just noticed that in > > https://www.postgresql.org/docs/13/acronyms.html > (i.e., doc/src/sgml/acronyms.sgml) > > there is under lemma 'HOT' a link with URL: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT;hb=HEAD > > Is this deliberate? Surely pointing 13-docs into HEAD-docs is wrong? Yeah, I noticed this while working on the glossary. This link works: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT;hb=REL_13_STABLE but the problem with this one is that we'll have to update once per release. > Otherwise it may be better to remove the link altogether and just have the > acronym lemma: 'HOT' and explanation 'Heap-Only Tuple'. Yeah, I don't think pointing to the source-code README file is a great resource. I think starting with 13 we should add an entry in the glossary for Heap-Only Tuple, and make the acronym entry point to the glossary. In fact, we could do this with several other acronyms too, such as DDL, DML, GEQO, LSN, MVCC, SQL, TID, WAL. The links to external resources can then be put in the glossary definition, if needed.
Re: convert elog(LOG) calls to ereport
On 2020-Dec-02, Peter Eisentraut wrote: > There are a number of elog(LOG) calls that appear to be user-facing, so they > should be ereport()s. This patch changes them. There are more elog(LOG) > calls remaining, but they all appear to be some kind of debugging support. > Also, I changed a few elog(FATAL)s that were nearby, but I didn't > specifically look for them. > - elog(LOG, "WSAIoctl(SIO_KEEPALIVE_VALS) failed: %ui", > - WSAGetLastError()); > + ereport(LOG, > + (errmsg("WSAIoctl(SIO_KEEPALIVE_VALS) failed: > %ui", > + WSAGetLastError(; Please take the opportunity to move the flag name out of the message in this one, also. I do wonder if it'd be a good idea to move the syscall name itself out of the message, too; that would reduce the number of messages to translate 50x to just "%s(%s) failed: %m" instead of one message per distinct syscall. Should fd.c messages do errcode_for_file_access() like elsewhere? Overall, it looks good to me. Thanks
Re: Huge memory consumption on partitioned table with FKs
Hello I haven't followed this thread's latest posts, but I'm unclear on the lifetime of the new struct that's being allocated in TopMemoryContext. At what point are those structs freed? Also, the comment that was in RI_ConstraintInfo now appears in RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now undocumented. What is the relationship between those two structs? I see that they have pointers to each other, but I think the relationship should be documented more clearly. Thanks!
Re: Autovacuum on partitioned table (autoanalyze)
Hello Yuzuko, On 2020-Dec-02, yuzuko wrote: > The problem Horiguchi-san mentioned is as follows: > [explanation] Hmm, I see. So the problem is that if some ancestor is analyzed first, then analyze of one of its partition will cause a redundant analyze of the ancestor, because the number of tuples that is propagated from the partition represents a set that had already been included in the ancestor's analysis. If the problem was just that, then I think it would be very simple to solve: just make sure to sort the tables to vacuum so that all leaves are vacuumed first, and then all ancestors, sorted from the bottom up. Problem solved. But I'm not sure that that's the whole story, for two reasons: one, two workers can run simultaneously, where one analyzes the partition and the other analyzes the ancestor. Then the order is not guaranteed (and each process will get no effect from remembering whether it did that one or not). Second, manual analyzes can occur in any order. Maybe it's more useful to think about this in terms of rememebering that partition P had changed_tuples set to N when we analyzed ancestor A. Then, when we analyze partition P, we send the message listing A as ancestor; on receipt of that message, we see M+N changed tuples in P, but we know that we had already seen N, so we only record M. I'm not sure how to implement this idea however, since on analyze of ancestor A we don't have the list of partitions, so we can't know the N for each partition.
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On 2020-Dec-04, Dmitry Dolgov wrote: > * This one is mostly for me to understand. There are couple of places > with a commentary that 'PROC_IN_SAFE_IC is not necessary, because the > transaction only takes a snapshot to do some catalog manipulation'. > But for some of them I don't immediately see in the relevant code > anything related to snapshots. E.g. one in DefineIndex is followed by > WaitForOlderSnapshots (which seems to only do waiting, not taking a > snapshot), index_set_state_flags and CacheInvalidateRelcacheByRelid. > Is taking a snapshot hidden somewhere there inside? Well, they're actually going to acquire an Xid, not a snapshot, so the comment is slightly incorrect; I'll fix it, thanks for pointing that out. The point stands: because those transactions are of very short duration (at least of very short duration after the point where the XID is obtained), it's not necessary to set the PROC_IN_SAFE_IC flag, since it won't cause any disruption to other processes. It is possible that I copied the wrong comment in DefineIndex. (I only noticed that one after I had mulled over the ones in ReindexRelationConcurrently, each of which is skipping setting the flag for slightly different reasons.)
Re: walsender bug: stuck during shutdown
On 2020-Nov-26, Fujii Masao wrote: > Yes, so the problem here is that walsender goes into the busy loop > in that case. Seems this happens only in logical replication walsender. > In physical replication walsender, WaitLatchOrSocket() in WalSndLoop() > seems to work as expected and prevent it from entering into busy loop > even in that case. > > /* >* If postmaster asked us to stop, don't wait anymore. >* >* It's important to do this check after the recomputation of >* RecentFlushPtr, so we can send all remaining data before > shutting >* down. >*/ > if (got_STOPPING) > break; > > The above code in WalSndWaitForWal() seems to cause this issue. But I've > not come up with idea about how to fix yet. With DEBUG1 I observe that walsender is getting a lot of 'r' messages (standby reply) with all zeroes: 2020-12-01 21:01:24.100 -03 [15307] DEBUG: write 0/0 flush 0/0 apply 0/0 However, while doing that I also observed that if I do send some activity to the logical replication stream, with the provided program, it will *still* have the 'write' pointer set to 0/0, and the 'flush' pointer has moved forward to what was sent. I'm not clear on what causes the write pointer to move forward in logical replication. Still, the previously proposed patch does resolve the problem in either case.
Re: POC: Better infrastructure for automated testing of concurrency issues
On 2020-Nov-25, Alexander Korotkov wrote: > In the view of above, I'd like to propose a POC patch, which implements new > builtin infrastructure for reproduction of concurrency issues in automated > test suites. The general idea is so-called "stop events", which are > special places in the code, where the execution could be stopped on some > condition. Stop event also exposes a set of parameters, encapsulated into > jsonb value. The condition over stop event parameters is defined using > jsonpath language. +1 for the idea. I agree we have a need for something on this area; there are *many* scenarios currently untested because of the lack of what you call "stop points". I don't know if jsonpath is the best way to implement it, but at least it is readily available and it seems a decent way to go at it.
Re: Improper use about DatumGetInt32
On 2020-Dec-03, Peter Eisentraut wrote: > On 2020-11-30 16:32, Alvaro Herrera wrote: > > On 2020-Nov-30, Peter Eisentraut wrote: > > > > > Patch updated this way. I agree it's better that way. > > > > Thanks, LGTM. > > For a change like this, do we need to change the C symbol names, so that > there is no misbehavior if the shared library is not updated at the same > time as the extension is upgraded in SQL? Good question. One point is that since the changes to the arguments are just in the way we read the values from the Datum C-values, there's no actual ABI change. So if I understand correctly, there's no danger of a crash; there's just a danger of misinterpreting a value. I don't know if it's possible to determine (at function execution time) that we're running with the old extension version; if so it might suffice to throw a warning but still have the SQL function run the same C function. If we really think that we ought to differentiate, then we could do what pg_stat_statement does, and have a separate C function that's called with the obsolete signature (pg_stat_statements_1_8 et al).
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-Dec-04, Michael Paquier wrote: > VacuumOption does that since 6776142, and ClusterOption since 9ebe057, > so switching ReindexOption to just match the two others still looks > like the most consistent move. 9ebe057 goes to show why this is a bad idea, since it has this: +typedef enum ClusterOption +{ + CLUOPT_RECHECK, /* recheck relation state */ + CLUOPT_VERBOSE /* print progress info */ +} ClusterOption; and then you do things like + if ($2) + n->options |= CLUOPT_VERBOSE; and then tests like + if ((options & VACOPT_VERBOSE) != 0) Now if you were to ever define third and fourth values in that enum, this would immediately start malfunctioning. FWIW I'm with Peter on this.
Re: A few new options for CHECKPOINT
On the UI of this patch, you're proposing to add the option FAST. I'm not a fan of this option name and propose that (if we have it) we use the name SPREAD instead (defaults to false). Now we don't actually explain the term "spread" much in the documentation; we just say "the writes are spread". But it seems more natural to build on that adjective rather than "fast/slow". I think starting a spread checkpoint has some usefulness, if your checkpoint interval is very large but your completion target is not very close to 1. In that case, you're expressing that you want a checkpoint to start now and not impact production unduly, so that you know when it finishes and therefore when is it a good time to start a backup. (You will still have some WAL to replay, but it won't be as much as if you just ignored checkpoint considerations completely.) On the subject of measuring replay times for backups taking while pgbench is pounding the database, I think a realistic test does *not* have pgbench running at top speed; rather you have some non-maximal "-R xyz" option. You would probably determine a value to use by running without -R, observing what's a typical transaction rate, and using some fraction (say, half) of that in the real run.
Re: Removal of operator_precedence_warning
On 2020-Dec-04, Tom Lane wrote: > I think it's time for $SUBJECT. We added this GUC in 9.5, which > will be EOL by the time of our next major release, and it was never > meant as more than a transitional aid. Moreover, it's been buggy > as heck (cf abb164655, 05104f693, 01e0cbc4f, 4cae471d1), and the > fact that some of those errors went undetected for years shows that > it's not really gotten much field usage. > > Hence, I propose the attached. Comments? I wonder if it'd be fruitful to ask the submitters of those bugs about their experiences with the feature. Did they find it useful in finding precedence problems in their code? Did they experience other problems that they didn't report? Reading the reports mentioned in those commits, it doesn't look like any of them were actually using the feature -- they all seem to have come across the problems by accidents of varying nature.
Re: A few new options for CHECKPOINT
On 2020-Dec-04, Bossart, Nathan wrote: > On 12/4/20, 1:47 PM, "Alvaro Herrera" wrote: > > On the UI of this patch, you're proposing to add the option FAST. I'm > > not a fan of this option name and propose that (if we have it) we use > > the name SPREAD instead (defaults to false). > > > > Now we don't actually explain the term "spread" much in the documentation; > > we just say "the writes are spread". But it seems more natural to build > > on that adjective rather than "fast/slow". > > Here is a version of the patch that uses SPREAD instead of FAST. WFM. Instead of adding checkpt_option_list, how about utility_option_list? It seems intended for reuse.
Re: Change definitions of bitmap flags to bit-shifting style
On 2020-Dec-05, Tom Lane wrote: > FWIW, personally I'd vote for doing the exact opposite. When you are > debugging and examining the contents of a bitmask variable, it's easier to > correlate a value like "0x03" with definitions made in the former style. > Or at least I think so; maybe others see it differently. The hexadecimal representation is more natural to me than bit-shifting, so I would prefer to use that style too. But maybe I'm trained to it because of looking at t_infomask symbols constantly.
Re: A few new options for CHECKPOINT
On 2020-Dec-05, Stephen Frost wrote: > So- just to be clear, CHECKPOINTs are more-or-less always happening in > PG, and running this command might do something or might end up doing > nothing depending on if a checkpoint is already in progress and this > request just gets consolidated into an existing one, and it won't > actually reduce the amount of WAL replay except in the case where > checkpoint completion target is set to make a checkpoint happen in less > time than checkpoint timeout, which ultimately isn't a great way to run > the system anyway. You keep making this statement, and I don't necessarily disagree, but if that is the case, please explain why don't we have checkpoint_completion_target set to 0.9 by default? Should we change that? > Assuming we actually want to do this, which I still generally don't > agree with since it isn't really clear if it'll actually end up doing > something, or not, wouldn't it make more sense to have a command that > just sits and waits for the currently running (or next) checkpoint to > complete..? For the use-case that was explained, at least, we don't > actually need to cause another checkpoint to happen, we just want to > know when a checkpoint has completed, right? Yes, I agree that the use case for this is unclear.
Re: Huge memory consumption on partitioned table with FKs
On 2020-Dec-07, Amit Langote wrote: > On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi > wrote: > > > Also, the comment that was in RI_ConstraintInfo now appears in > > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now > > > undocumented. What is the relationship between those two structs? I > > > see that they have pointers to each other, but I think the relationship > > > should be documented more clearly. > > > > I'm not sure the footprint of this patch worth doing but here is a bit > > more polished version. > > I noticed that the foreign_key test fails and it may have to do with > the fact that a partition's param info remains attached to the > parent's RI_ConstraintInfo even after it's detached from the parent > table using DETACH PARTITION. I think this bit about splitting the struct is a distraction. Let's get a patch that solves the bug first, and then we can discuss what further refinements we want to do. I think we should get your patch in CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com committed (which I have not read yet.) Do you agree with this plan?
Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently
Does "ALTER TABLE ONLY parent" work correctly? Namely, do not affect existing partitions, but cause future partitions to acquire the new setting. This sounds very much related to previous discussion on REPLICA IDENTITY not propagating to partitions; see https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql particularly Robert Haas' comments at http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=bsp_m+702eof42fqrtqtyio1nu...@mail.gmail.com and downstream discussion from there.
Re: Commitfest statistics
On 2020-Dec-07, Tom Lane wrote: > Anastasia Lubennikova writes: > > Firstly, we use it to track patches that we want to see in the nearest > > releases and concentrate our efforts on. And current CFM guideline [1] > > reflects this idea. It suggests, that after the commitfest closure date > > we relentlessly throw to RWF patches that got at least some feedback. To > > be honest, I was reluctant to return semi-ready patches, because it > > means that they will get lost somewhere in mailing lists. And it seems > > like other CFMs did the same. > > Yeah, the aggressive policy suggested in "Sudden Death Overtime" is > certainly not what's been followed lately. I agree that that's > probably too draconic. On the other hand, if a patch sits in the > queue for several CFs without getting committed, that suggests that > maybe we ought to reject it on the grounds of "apparently nobody but > the author cares about this". That argument is easier to make for > features than bug fixes of course, so maybe the policy needs to > distinguish what kind of change is being considered. Note that this checklist was written in 2013 and has never been updated since then. I think there is nothing in that policy that we do use. I'm thinking that rather than try to fine-tune that document, we ought to rewrite one from scratch. For one thing, "a beer or three" only at end of CF is surely not sufficient.
Re: range_agg
On 2020-Dec-08, Alexander Korotkov wrote: > I also found a problem in multirange types naming logic. Consider the > following example. > > create type a_multirange AS (x float, y float); > create type a as range(subtype=text, collation="C"); > create table tbl (x __a_multirange); > drop type a_multirange; > > If you dump this database, the dump couldn't be restored. The > multirange type is named __a_multirange, because the type named > a_multirange already exists. However, it might appear that > a_multirange type is already deleted. When the dump is restored, a > multirange type is named a_multirange, and the corresponding table > fails to be created. The same thing doesn't happen with arrays, > because arrays are not referenced in dumps by their internal names. > > I think we probably should add an option to specify multirange type > names while creating a range type. Then dump can contain exact type > names used in the database, and restore wouldn't have a names > collision. Hmm, good point. I agree that a dump must preserve the name, since once created it is user-visible. I had not noticed this problem, but it's obvious in retrospect. > In general, I wonder if we can make the binary format of multiranges > more efficient. It seems that every function involving multiranges > from multirange_deserialize(). I think we can make functions like > multirange_contains_elem() much more efficient. Multirange is > basically an array of ranges. So we can pack it as follows. > 1. Typeid and rangecount > 2. Tightly packed array of flags (1-byte for each range) > 3. Array of indexes of boundaries (4-byte for each range). Or even > better we can combine offsets and lengths to be compression-friendly > like jsonb JEntry's do. > 4. Boundary values > Using this format, we can implement multirange_contains_elem(), > multirange_contains_range() without deserialization and using binary > search. That would be much more efficient. What do you think? I also agree. I spent some time staring at the I/O code a couple of months back but was unable to focus on it for long enough. I don't know JEntry's format, but I do remember that the storage format for JSONB was widely discussed back then; it seems wise to apply similar logic or at least similar reasoning.
Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently
On 2020-Dec-08, tsunakawa.ta...@fujitsu.com wrote: > From: Alvaro Herrera > > Does "ALTER TABLE ONLY parent" work correctly? Namely, do not affect > > existing partitions, but cause future partitions to acquire the new > > setting. > > Yes, it works correctly in the sense that ALTER TABLE ONLY on a > partitioned table does nothing because it has no storage and therefore > logged/unlogged has no meaning. But what happens when you create another partition after you change the "loggedness" of the partitioned table? > > This sounds very much related to previous discussion on REPLICA IDENTITY > > not propagating to partitions; see > > https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql > > particularly Robert Haas' comments at > > http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=BsP_m+702eOf42fqRtQTYio > > 1nu...@mail.gmail.com > > and downstream discussion from there. > > There was a hot discussion. I've read through it. I hope I'm not > doing strange in light of that discussioin. Well, my main take from that is we should strive to change all subcommands together, in some principled way, rather than change only one or some, in arbitrary ways.
Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently
On 2020-Dec-09, tsunakawa.ta...@fujitsu.com wrote: > From: Alvaro Herrera > > But what happens when you create another partition after you change the > > "loggedness" of the partitioned table? > > The new partition will have a property specified when the user creates > it. That is, while the storage property of each storage unit > (=partition) is basically independent, ALTER TABLE on a partitioned > table is a convenient way to set the same property value to all its > underlying storage units. Well, that definition seems unfriendly to me. I prefer the stance that if you change the value for the parent, then future partitions inherit that value. > I got an impression from the discussion that some form of consensus on > the principle was made and you were trying to create a fix for REPLICA > IDENTITY. Do you think the principle was unclear and we should state > it first (partly to refresh people's memories)? I think the principle was sound -- namely, that we should make all those commands recurse by default, and for cases where it matters, the parent's setting should determine the default for future children. > I'm kind of for it, but I'm hesitant to create the fix for all ALTER > actions, because it may take a lot of time and energy as you were > afraid. Can we define the principle, and then create individual fixes > independently based on that principle? That seems acceptable to me, as long as we get all changes in the same release. What we don't want (according to my reading of that discussion) is to change the semantics of a few subcommands in one release, and the semantics of a few other subcommands in another release.
Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently
On 2020-Dec-09, Bharath Rupireddy wrote: > I'm not sure how many more of such commands exist which require changes. The other thread has a list. I think it is complete, but if you want to double-check, that would be helpful. > How about doing it this way? > > 1) Have a separate common thread listing the commands and specifying > clearly the agreed behaviour for such commands. > 2) Whenever the separate patches are submitted(hopefully) by the > hackers for such commands, after the review we can make a note of that > patch in the common thread. > 3) Once the patches for all the listed commands are > received(hopefully) , then they can be committed together in one > release. Sounds good. I think this thread is a good place to collect those patches, but if you would prefer to have a new thread, feel free to start one (I'd suggest CC'ing me and Tsunakawa-san).
Re: Change default of checkpoint_completion_target
Howdy, On 2020-Dec-10, Stephen Frost wrote: > * Laurenz Albe (laurenz.a...@cybertec.at) wrote: > > On Tue, 2020-12-08 at 17:29 +, Bossart, Nathan wrote: > > > +1 to setting checkpoint_completion_target to 0.9 by default. > > > > +1 for changing the default or getting rid of it, as Tom suggested. > > Attached is a patch to change it from a GUC to a compile-time #define > which is set to 0.9, with accompanying documentation updates. I think we should leave a doc stub or at least an , to let people know the GUC has been removed rather than just making it completely invisible. (Maybe piggyback on the stuff in [1]?) [1] https://postgr.es/m/CAGRY4nyA=jmBNa4LVwgGO1GyO-RnFmfkesddpT_uO+3=mot...@mail.gmail.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
By the way-- What did you think of the idea of explictly marking the types used for bitmasks using types bits32 and friends, instead of plain int, which is harder to spot?
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-Dec-12, Peter Eisentraut wrote: > On 2020-12-11 21:27, Alvaro Herrera wrote: > > By the way-- What did you think of the idea of explictly marking the > > types used for bitmasks using types bits32 and friends, instead of plain > > int, which is harder to spot? > > If we want to make it clearer, why not turn the thing into a struct, as in > the attached patch, and avoid the bit fiddling altogether. I don't like this idea too much, because adding an option causes an ABI break. I don't think we commonly add options in backbranches, but it has happened. The bitmask is much easier to work with in that regard.
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-Dec-23, Michael Paquier wrote: > bool > -reindex_relation(Oid relid, int flags, int options) > +reindex_relation(Oid relid, int flags, ReindexOptions *options) > { > Relationrel; > Oid toast_relid; Wait a minute. reindex_relation has 'flags' and *also* 'options' with an embedded 'flags' member? Surely that's not right. I see that they carry orthogonal sets of options ... but why aren't they a single bitmask instead of two separate ones? This looks weird and confusing. Also: it seems a bit weird to me to put the flags inside the options struct. I would keep them separate -- so initially the options struct would only have the tablespace OID, on API cleanliness grounds: struct ReindexOptions { tablepaceOidoid; }; extern bool reindex_relation(Oid relid, bits32 flags, ReindexOptions *options); I guess you could argue that it's more performance to set up only two arguments to the function call instead of three .. but I doubt that's measurable for anything in DDL-land. But also, are we really envisioning that these routines would have all that many additional options? Maybe it is sufficient to do just extern bool reindex_relation(Oid relid, bits32 flags, tablespaceOid Oid);
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-Dec-23, Justin Pryzby wrote: > This was getting ugly: > > extern void reindex_index(Oid indexId, bool skip_constraint_checks, > char relpersistence, int options, Oid > tablespaceOid)Z Is this what I suggested?
Re: list of extended statistics on psql
On 2021-Jan-07, Tomas Vondra wrote: > On 1/7/21 1:46 AM, Tatsuro Yamada wrote: > > I overlooked the check for MCV in the logic building query > > because I created the patch as a new feature on PG14. > > I'm not sure whether we should do back patch or not. However, I'll > > add the check on the next patch because it is useful if you decide to > > do the back patch on PG10, 11, 12, and 13. > > BTW perhaps a quick look at the other \d commands would show if there are > precedents. I didn't have time for that. Yes, we do promise that new psql works with older servers. I think we would not backpatch any of this, though. -- Álvaro Herrera
Re: Key management with tests
On 2021-Jan-07, Bruce Momjian wrote: > All the tests pass now. The current src/test directory is 19MB, and > adding these tests takes it to 23MB, or a 20% increase. That seems like > a lot. It is testing 128-bit and 256-bit keys --- should we do fewer > tests, or just test 256, or use gzip to compress the tests by 50%? > (Does every platform have gzip?) So the tests are about 95% of the patch ... do we really need that many tests? -- Álvaro Herrera
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On 2020-Dec-01, Alvaro Herrera wrote: > On 2020-Nov-30, Justin Pryzby wrote: > Thanks for all the comments. I'll incorporate everything and submit an > updated version later. Here's a rebased version 5, with the typos fixed. More comments below. > > The attname "detached" is a stretch of what's intuitive (it's more like > > "detachING" or half-detached). But I think psql should for sure show > > something > > more obvious to users. Esp. seeing as psql output isn't documented. Let's > > figure out what to show to users and then maybe rename the column that, too. > > OK. I agree that "being detached" is the state we want users to see, or > maybe "detach pending", or "unfinisheddetach" (ugh). I'm not sure that > pg_inherits.inhbeingdetached" is a great column name. Opinions welcome. I haven't changed this yet; I can't make up my mind about what I like best. Partition of: parent FOR VALUES IN (1) UNFINISHED DETACH Partition of: parent FOR VALUES IN (1) UNDER DETACH Partition of: parent FOR VALUES IN (1) BEING DETACHED > > ATExecDetachPartition: > > Doesn't this need to lock the table before testing for default partition ? > > Correct, it does. I failed to point out that by the time ATExecDetachPartition is called, the relation has already been locked by the invoking ALTER TABLE support code. > > I ended up with apparently broken constraint when running multiple loops > > around > > a concurrent detach / attach: > > > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; > > do :; done& > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; > > do :; done& > > > > "p1_check" CHECK (true) > > "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2) > > Not good. Haven't had time to investigate this problem yet. -- Álvaro Herrera >From d21acb0e99ed3d296db70fed2d5d036d721d611b Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 13 Jul 2020 20:15:30 -0400 Subject: [PATCH v5 1/2] Let ALTER TABLE exec routines deal with the relation This means that ATExecCmd relies on AlteredRelationInfo->rel instead of keeping the relation as a local variable; this is useful if the subcommand needs to modify the relation internally. For example, a subcommand that internally commits its transaction needs this. --- src/backend/commands/tablecmds.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 993da56d43..a8528a3423 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -156,6 +156,8 @@ typedef struct AlteredTableInfo Oid relid; /* Relation to work on */ char relkind; /* Its relkind */ TupleDesc oldDesc; /* Pre-modification tuple descriptor */ + /* Transiently set during Phase 2, normally set to NULL */ + Relation rel; /* Information saved by Phase 1 for Phase 2: */ List *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */ /* Information saved by Phases 1/2 for Phase 3: */ @@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd, AlterTableUtilityContext *context); static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, AlterTableUtilityContext *context); -static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, +static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass, AlterTableUtilityContext *context); static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, @@ -4405,7 +4407,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, { AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab); List *subcmds = tab->subcmds[pass]; - Relation rel; ListCell *lcmd; if (subcmds == NIL) @@ -4414,10 +4415,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, /* * Appropriate lock was obtained by phase 1, needn't get it again */ - rel = relation_open(tab->relid, NoLock); + tab->rel = relation_open(tab->relid, NoLock); foreach(lcmd, subcmds) -ATExecCmd(wqueue, tab, rel, +ATExecCmd(wqueue, tab, castNode(AlterTableCmd, lfirst(lcmd)), lockmode, pass, context); @@ -4429,7 +4430,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, if (pass == AT_PASS_ALTER_TYPE) ATPostAlterTypeCleanup(wqueue, tab, lockmode); - relation
Re: A failure of standby to follow timeline switch
Masao-san: Are you intending to act as committer for these? Since the bug is mine I can look into it, but since you already did all the reviewing work, I'm good with you giving it the final push. 0001 looks good to me; let's get that one committed quickly so that we can focus on the interesting stuff. While the implementation of find_in_log is quite dumb (not this patch's fault), it seems sufficient to deal with small log files. We can improve the implementation later, if needed, but we have to get the API right on the first try. 0003: The fix looks good to me. I verified that the test fails without the fix, and it passes with the fix. The test added in 0002 is a bit optimistic regarding timing, as well as potentially slow; it loops 1000 times and sleeps 100 milliseconds each time. In a very slow server (valgrind or clobber_cache animals) this could not be sufficient time, while on fast servers it may end up waiting longer than needed. Maybe we can do something like this: for (my $i = 0 ; $i < 1000; $i++) { my $current_log_size = determine_current_log_size() if ($node_standby_3->find_in_log( "requested WAL segment [0-9A-F]+ has already been removed", $logstart)) { last; } elsif ($node_standby_3->find_in_log( "End of WAL reached on timeline", $logstart)) { $success = 1; last; } $logstart = $current_log_size; while (determine_current_log_size() == current_log_size) { usleep(10_000); # with a retry count? } } With test patch, make check PROVE_FLAGS="--timer" PROVE_TESTS=t/001_stream_rep.pl ok 6386 ms ( 0.00 usr 0.00 sys + 1.14 cusr 0.93 csys = 2.07 CPU) ok 6352 ms ( 0.00 usr 0.00 sys + 1.10 cusr 0.94 csys = 2.04 CPU) ok 6255 ms ( 0.01 usr 0.00 sys + 0.99 cusr 0.97 csys = 1.97 CPU) without test patch: ok 4954 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.64 csys = 1.35 CPU) ok 5033 ms ( 0.01 usr 0.00 sys + 0.71 cusr 0.73 csys = 1.45 CPU) ok 4991 ms ( 0.01 usr 0.00 sys + 0.73 cusr 0.59 csys = 1.33 CPU) -- Álvaro Herrera
Re: PATCH: Batch/pipelining support for libpq
On 2021-Feb-16, Craig Ringer wrote: > FWIW I'm also thinking of revising the docs to mostly use the term > "pipeline" instead of "batch". Use "pipelining and batching" in the chapter > topic, and mention "batch" in the index, and add a para that explains how > to run batches on top of pipelining, but otherwise use the term "pipeline" > not "batch". Hmm, this is a good point. It means I have a lot of API renaming to do. I'll get on it now and come back with a proposal. -- Álvaro Herrera Valdivia, Chile
Re: PATCH: Batch/pipelining support for libpq
Here's a new version, where I've renamed everything to "pipeline". I think the docs could use some additional tweaks now in order to make a coherent story on pipeline mode, how it can be used in a batched fashion, etc. Here's the renames I applied. It's mostly mechanical, except PQbatchSendQueue is now PQsendPipeline: PQBatchStatus -> PGpipelineStatus (enum) PQBATCH_MODE_OFF -> PQ_PIPELINE_OFF PQBATCH_MODE_ON -> PQ_PIPELINE_ON PQBATCH_MODE_ABORTED -> PQ_PIPELINE_ABORTED PQbatchStatus -> PQpipelineStatus (function) PQenterBatchMode -> PQenterPipelineMode PQexitBatchMode -> PQexitPipelineMode PQbatchSendQueue -> PQsendPipeline PGRES_BATCH_END -> PGRES_PIPELINE_END PGRES_BATCH_ABORTED -> PGRES_PIPELINE_ABORTED Also, PQbatchStatus(conn) returns enum PGpipelineStatus (it previously returned int). I'm tempted to rename PGASYNC_QUEUED to PGASYNC_PIPELINE_IDLE (not sure if PGASYNC_PIPELINE_READY fits better with the existing one). In pgbench, I changed the metacommands to be \startpipeline and \endpipeline. There's a failing Assert() there which I commented out; needs fixed. -- Álvaro Herrera39°49'30"S 73°17'W diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index b7a82453f0..9f98257dbe 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3099,6 +3099,31 @@ ExecStatusType PQresultStatus(const PGresult *res); + + + PGRES_PIPELINE_END + + +The PGresult represents the end of a pipeline. +This status occurs only when pipeline mode has been selected. + + + + + + PGRES_PIPELINE_ABORTED + + +The PGresult represents a pipeline that's +received an error from the server. PQgetResult +must be called repeatedly, and it will return this status code, +until the end of the current pipeline, at which point it will return +PGRES_PIPELINE_END and normal processing can +resume. + + + + If the result status is PGRES_TUPLES_OK or @@ -4857,6 +4882,494 @@ int PQflush(PGconn *conn); + + Pipeline Mode + + + libpq + pipeline mode + + + + pipelining + in libpq + + + + batch mode + in libpq + + + + libpq pipeline mode allows applications to + send a query without having to read the result of the previously + sent query. Taking advantage of the pipeline mode, a client will wait + less for the server, since multiple queries/results can be sent/ + received in a single network transaction. + + + + While pipeline mode provides a significant performance boost, writing + clients using the pipeline mode is more complex because it involves + managing a queue of pending queries and finding which result + corresponds to which query in the queue. + + + + Pipeline mode also generally consumes more memory on both the client and server, + though careful and aggressive management of the send/receive queue can mitigate + this. This applies whether or not the connection is in blocking or non-blocking + mode. + + + + Using Pipeline Mode + + +To issue pipelines the application must switch a connection into pipeline mode. +Enter pipeline mode with +or test whether pipeline mode is active with +. +In pipeline mode, only asynchronous operations +are permitted, and COPY is disallowed. +Using any synchronous command execution functions, +such as PQfn, or PQexec +and its sibling functions, is an error condition. + + + + + It is best to use pipeline mode with libpq in + non-blocking mode. If used + in blocking mode it is possible for a client/server deadlock to occur. + + +The client will block trying to send queries to the server, but the +server will block trying to send results to the client from queries +it has already processed. This only occurs when the client sends +enough queries to fill its output buffer and the server's receive +buffer before switching to processing input from the server, +but it's hard to predict exactly when that will happen. + + + + + + +Issuing Queries + + + After entering pipeline mode the application dispatches requests using + normal asynchronous libpq functions, such as: + PQsendQueryParams, PQsendPrepare, + PQsendQueryPrepared, PQsendDescribePortal, + and PQsendDescribePrepared. + The asynchronous requests are followed by a + + call to mark the end of the pipeline. The client need not + call PQgetResult immediately after + dispatching each operation. + Result processing + is handled separately. + + + + The serve
Re: PATCH: Batch/pipelining support for libpq
Hello, thanks for looking at this patch. On 2021-Feb-16, Zhihong Yu wrote: > + if (querymode == QUERY_SIMPLE) > + { > + commandFailed(st, "startpipeline", "cannot use pipeline mode > with the simple query protocol"); > + st->state = CSTATE_ABORTED; > + return CSTATE_ABORTED; > > I wonder why the st->state is only assigned for this if block. The state is > not set for other cases where CSTATE_ABORTED is returned. Yeah, that's a simple oversight. We don't need to set st->state, because the caller sets it to the value we return. > Should PQ_PIPELINE_OFF be returned for the following case ? > > +PQpipelineStatus(const PGconn *conn) > +{ > + if (!conn) > + return false; Yep. Thanks -- Álvaro Herrera39°49'30"S 73°17'W
Re: PATCH: Batch/pipelining support for libpq
On 2021-Jan-21, Zhihong Yu wrote: > It seems '\\gset or \\aset is not ' would correspond to the check more > closely. > > + if (my_command->argc != 1) > + syntax_error(source, lineno, my_command->first_line, > my_command->argv[0], > > It is possible that my_command->argc == 0 (where my_command->argv[0] > shouldn't be accessed) ? No -- process_backslash_command reads the word and sets it as argv[0]. > + appendPQExpBufferStr(&conn->errorMessage, > + libpq_gettext("cannot queue commands > during COPY\n")); > + return false; > + break; > > Is the break necessary ? Similar comment for pqBatchProcessQueue(). Not necessary. I removed them. > +int > +PQexitBatchMode(PGconn *conn) > > Since either 0 or 1 is returned, maybe change the return type to bool. I was kinda tempted to do that, but in reality a lot of libpq's API is defined like that -- to return 0 (failure)/1 (success) as ints, not bools. For example see PQsendQuery(). I'm not inclined to do different for these new functions. (One curious case is PQsetvalue, which returns int, and is documented as "returns zero for failure" and then uses "return false"). > Also, the function operates on PGconn - should the function be static > (pqBatchProcessQueue is) ? I don't understand this suggestion. How would an application call it, if we make it static? Thanks -- Álvaro Herrera Valdivia, Chile "Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en La Feria de las Tinieblas, R. Bradbury)
Re: PATCH: Batch/pipelining support for libpq
Hi, On 2021-Feb-19, Zhihong Yu wrote: > Hi, > +static int pqBatchProcessQueue(PGconn *conn); > > I was suggesting changing: > > +int > +PQexitBatchMode(PGconn *conn) > > to: > > +static int > +PQexitBatchMode(PGconn *conn) I understand that, but what I'm saying is that it doesn't work. pqBatchProcessQueue can be static because it's only used internally in libpq, but PQexitBatchMode is supposed to be called by the client application. -- Álvaro Herrera Valdivia, Chile "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Re: PATCH: Batch/pipelining support for libpq
On 2021-Feb-20, Craig Ringer wrote: > On Wed, 17 Feb 2021, 07:13 Alvaro Herrera, wrote: > > I > > think the docs could use some additional tweaks now in order to make a > > coherent story on pipeline mode, how it can be used in a batched > > fashion, etc. > > I'll do that soon and send an update. I started to do that, but was sidetracked having a look at error handling while checking one of the claims. So now it starts with this: Issuing Queries After entering pipeline mode, the application dispatches requests using , or its prepared-query sibling . These requests are queued on the client-side until flushed to the server; this occurs when is used to establish a synchronization point in the pipeline, or when is called. The functions , , and also work in pipeline mode. Result processing is described below. The server executes statements, and returns results, in the order the client sends them. The server will begin executing the commands in the pipeline immediately, not waiting for the end of the pipeline. If any statement encounters an error, the server aborts the current transaction and skips processing commands in the pipeline until the next synchronization point established by PQsendPipeline. (This remains true even if the commands in the pipeline would rollback the transaction.) Query processing resumes after the synchronization point. (Note I changed the wording that "the pipeline is ended by PQsendPipeline" to "a synchronization point is established". Is this easily understandable? On re-reading it, I'm not sure it's really an improvement.) BTW we don't explain why doesn't PQsendQuery work (to wit: because it uses "simple" query protocol and thus doesn't require the Sync message). I think we should either document that, or change things so that PQsendQuery no longer uses a 'Q' message when in pipeline mode; instead use extended query protocol. I don't see why we'd force people to use PQsendQueryParams when not necessary. BTW, in my experimentation, the sequence of PGresult that you get when handling a pipeline with errors don't make a lot of sense. I'll spend some more time on it. While at it, as for the PGresult sequence and NULL returns: I think for PQexec and maybe PQsendQuery, it makes sense to loop until PQgetResult returns NULL, because you never know how many results are you going to get. But for PQsendQueryParams, this is no longer true, because you can't send multiple queries in one query string. So the only way to get multiple results, is by using single-row mode. But that already has its own protocol for multiple results, namely to get a stream of PGRES_SINGLE_TUPLE terminated by a zero-rows PGRES_TUPLES_OK. So I'm not sure there is a strong need for the mandatory NULL result. -- Álvaro Herrera39°49'30"S 73°17'W
Re: Printing page request trace from buffer manager
On 2020-Dec-08, Irodotos Terpizis wrote: > Initially, I modified the code within the BufferAlloc method in the > bufmgr.c file, > to log the pages that were requested and were already in the cache, > the pages that were evicted and the pages that > replaced them. However, I feel that this might not be the most optimal > way, as the log file is a mess and > it is hard to analyze. I was wondering if there is a more optimal way > to achieve this. Hi Irodotos, Did you find an answer to this question? Can you explain in what way the log is a mess? Maybe you need to start by defining how would you like the log file to look. -- Álvaro Herrera39°49'30"S 73°17'W "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
Re: Bizarre behavior of \w in a regular expression bracket construct
On 2021-Feb-21, Joel Jacobson wrote: >regex| engine |deduced_ranges > ++--- > ^([a-z])$ | pg | [a-z] > ^([a-z])$ | pl | [a-z] > ^([a-z])$ | v8 | [a-z] > ^([\d-a])$ | pg | > ^([\d-a])$ | pl | [-0-9a] > ^([\d-a])$ | v8 | [-0-9a] > ^([\w-;])$ | pg | > ^([\w-;])$ | pl | [-0-9;A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w-;])$ | v8 | [-0-9;A-Z_a-z] > ^([\w-_])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w-_])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w-_])$ | v8 | [-0-9A-Z_a-z] > ^([\w])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w])$ | pl | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w])$ | v8 | [0-9A-Z_a-z] > ^([\W])$ | pg | > ^([\W])$ | pl | [\x01-/:-@[-^`{-©«-´¶-¹»-¿×÷] > ^([\W])$ | v8 | [\x01-/:-@[-^`{-ÿ] > ^([\w-a])$ | pg | [0-9A-Z_-zªµºÀ-ÖØ-öø-ÿ] > ^([\w-a])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ] > ^([\w-a])$ | v8 | [-0-9A-Z_a-z] It looks like the interpretation of these other engines is that [\d-a] is the set of \d, the literal character "-", and the literal character "a". In other words, the - preceded by \d or \w (or any other character class, I guess?) loses its special meaning of identifying a character range. This one I didn't understand: > ^([\W])$ | pg | -- Álvaro Herrera Valdivia, Chile "Porque francamente, si para saber manejarse a uno mismo hubiera que rendir examen... ¿Quién es el machito que tendría carnet?" (Mafalda)
Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination
On 2021-Feb-05, Julien Rouhaud wrote: > Thanks, that's way better, copied in v3. Thank you, pushed. The code changes are only relevant in master, but I did back-patch the README.tuplock to all live branches. > I'm still a bit worried about that description though, as that flag > isn't consistently set for the FOR UPDATE case. Well, to be more > precise it's maybe consistently set when the hint bits are computed, > but in some cases the flag is later cleared, so you won't reliably > find it in the tuple. Is that right? I think compute_new_xmax_infomask would set the bit to infomask2_old_tuple (its last output argument) if it's needed, and so the bit would find its way to infomask2 eventually. Do you have a case where it doesn't achieve that? -- Álvaro Herrera Valdivia, Chile "E pur si muove" (Galileo Galilei)
Re: Fix typo about generate_gather_paths
On 2020-Dec-22, Tomas Vondra wrote: > Thanks. I started looking at this a bit more closely, and I think most of > the changes are fine - the code was changed to call a different function, > but the comments still reference generate_gather_paths(). Hi, this was forgotten. It seemed better to fix at least some of the wrong references than not do anything, so I pushed the parts that seemed 100% correct. Regarding this one: > The one exception seems to be create_ordered_paths(), because that comment > also makes statements about what generate_gather_pathes is doing. And some > of it does not apply to generate_useful_gather_paths. > For example it says it generates order-preserving Gather Merge paths, but > generate_useful_gather_paths also generates paths with sorts (which are > clearly not order-preserving). I left this one out. If Hou or Tomas want to propose/push a further patch, that'd be great. Thanks! -- Álvaro Herrera39°49'30"S 73°17'W "I'm always right, but sometimes I'm more right than other times." (Linus Torvalds)
Re: Bizarre behavior of \w in a regular expression bracket construct
On 2021-Feb-23, Tom Lane wrote: > * Create infrastructure to allow treating \w as a character class > in its own right. (I did not expose [[:word:]] as a class name, > though it would be a little more symmetric to do so; should we?) Apparently [:word:] is a GNU extension (or at least a "bash-specific character class"[1] but apparently Emacs also supports it?); all the others are mandated by POSIX[2]. [1] https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions [2] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 I think it'd be fine to expose [:word:] ... > [1] https://www.regular-expressions.info/charclasssubtract.html I had never heard of this subtraction thing. Nightmarish and confusing syntax, but useful. > +Also, the character class shorthands \D > +and \W will match a newline regardless of this mode. > +(Before PostgreSQL 14, they did not match > +newlines in newline-sensitive mode.) This seems an acceptable change to me, but then I only work here. -- Álvaro Herrera39°49'30"S 73°17'W
Re: [PATCH] pgbench: Remove ecnt, a member variable of CState
On 2021-Feb-26, Michael Paquier wrote: > On Fri, Feb 26, 2021 at 05:39:31PM +0900, miyake_kouta wrote: > > Also, the current pgbench's client abandons processing after hitting error, > > so this variable is no need, I think. > > Agreed. Its last use was in 12788ae, as far as I can see. So let's > just cleanup that. +1 -- Álvaro Herrera Valdivia, Chile
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On 2021-Jan-10, Justin Pryzby wrote: > On Fri, Jan 08, 2021 at 04:14:33PM -0300, Alvaro Herrera wrote: > > > > I ended up with apparently broken constraint when running multiple > > > > loops around > > > > a concurrent detach / attach: > > > > > > > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR > > > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 > > > > CONCURRENTLY"; do :; done& > > > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR > > > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 > > > > CONCURRENTLY"; do :; done& > > > > > > > > "p1_check" CHECK (true) > > > > "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2) > > > > > > Not good. > > > > Haven't had time to investigate this problem yet. > > I guess it's because you commited the txn and released lock in the middle of > the command. Hmm, but if we take this approach, then we're still vulnerable to the problem that somebody can do DETACH CONCURRENTLY and cancel the wait (or crash the server), then mess up the state before doing DETACH FINALIZE: when they cancel the wait, the lock will be released. I think the right fix is to disallow any action on a partition which is pending detach other than DETACH FINALIZE. (Didn't do that here.) Here's a rebase to current sources; there are no changes from v5. -- Álvaro Herrera Valdivia, Chile "No hay ausente sin culpa ni presente sin disculpa" (Prov. francés) >From d042b99ad3368ba0b658ac9260450ed8e39accea Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 13 Jul 2020 20:15:30 -0400 Subject: [PATCH v6 1/2] Let ALTER TABLE exec routines deal with the relation This means that ATExecCmd relies on AlteredRelationInfo->rel instead of keeping the relation as a local variable; this is useful if the subcommand needs to modify the relation internally. For example, a subcommand that internally commits its transaction needs this. --- src/backend/commands/tablecmds.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index b2457a6924..b990063d38 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -156,6 +156,8 @@ typedef struct AlteredTableInfo Oid relid; /* Relation to work on */ char relkind; /* Its relkind */ TupleDesc oldDesc; /* Pre-modification tuple descriptor */ + /* Transiently set during Phase 2, normally set to NULL */ + Relation rel; /* Information saved by Phase 1 for Phase 2: */ List *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */ /* Information saved by Phases 1/2 for Phase 3: */ @@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd, AlterTableUtilityContext *context); static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, AlterTableUtilityContext *context); -static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, +static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass, AlterTableUtilityContext *context); static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, @@ -4513,7 +4515,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, { AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab); List *subcmds = tab->subcmds[pass]; - Relation rel; ListCell *lcmd; if (subcmds == NIL) @@ -4522,10 +4523,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, /* * Appropriate lock was obtained by phase 1, needn't get it again */ - rel = relation_open(tab->relid, NoLock); + tab->rel = relation_open(tab->relid, NoLock); foreach(lcmd, subcmds) -ATExecCmd(wqueue, tab, rel, +ATExecCmd(wqueue, tab, castNode(AlterTableCmd, lfirst(lcmd)), lockmode, pass, context); @@ -4537,7 +4538,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, if (pass == AT_PASS_ALTER_TYPE) ATPostAlterTypeCleanup(wqueue, tab, lockmode); - relation_close(rel, NoLock); + if (tab->rel) + { +relation_close(tab->rel, NoLock); +tab->rel = NULL; + } } } @@ -4563,11 +4568,12 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, * ATExecCmd: dispatch a subcommand to appropriate execution routine */ static void -ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, +ATExecCmd(List **wqueue, AlteredTableInfo *tab, AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass, AlterTableUtilityContext *context) { ObjectAddress
Re: [PATCH] Bug fix in initdb output
On 2021-Mar-01, Juan José Santamaría Flecha wrote: > On Mon, Mar 1, 2021 at 5:52 AM Nitin Jadhav > wrote: > > > > >> Please share your thoughts on this. If we go ahead with this change, > > then we need to back-patch. I would be happy to create those patches. > > A full path works, even with the slashes. The commiter will take care of > back-patching, if needed. As for HEAD at least, this LGTM. I don't get it. I thought the windows API accepted both forward slashes and backslashes as path separators. Did you try the command and see it fail? -- Álvaro Herrera39°49'30"S 73°17'W "Entristecido, Wutra (canción de Las Barreras) echa a Freyr a rodar y a nosotros al mar"
Re: [PATCH] Bug fix in initdb output
On 2021-Mar-01, Juan José Santamaría Flecha wrote: > This is not a problem with the APi, but the shell. e.g. when using a CMD: > > - This works: > c:\>c:\Windows\System32\notepad.exe > c:\>c:/Windows/System32/notepad.exe > c:\>/Windows/System32/notepad.exe > > - This doesn't: > c:\>./Windows/System32/notepad.exe > '.' is not recognized as an internal or external command, > operable program or batch file. > c:\>Windows/System32/notepad.exe > 'Windows' is not recognized as an internal or external command, > operable program or batch file. Ah, so another way to fix it would be to make the path to pg_ctl be absolute? -- Álvaro Herrera39°49'30"S 73°17'W
Re: [PATCH] Bug fix in initdb output
On 2021-Mar-01, Juan José Santamaría Flecha wrote: > On Mon, Mar 1, 2021 at 7:50 PM Alvaro Herrera > wrote: > > Ah, so another way to fix it would be to make the path to pg_ctl be > > absolute? > > Yes, that's right. If you call initdb with an absolute path you won't see a > problem. So, is make_native_path a better fix than make_absolute_path? (I find it pretty surprising that initdb shows a relative path, but maybe that's just me.) -- Álvaro Herrera39°49'30"S 73°17'W "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
Re: [PATCH] Bug fix in initdb output
On 2021-Mar-01, Juan José Santamaría Flecha wrote: > Uhm, now that you point it out, an absolute path would make the message > more consistent and reusable. Well. This code was introduced in a00c58314745, with discussion at http://postgr.es/m/CAHeEsBeAe1FeBypT3E8R1ZVZU0e8xv3A-7BHg6bEOi=jzny...@mail.gmail.com which did not touch on the point of the pg_ctl path being relative or absolute. The previous decision to use relative seems to have been made here in commit ee814b4511ec, which was backed by this discussion https://www.postgresql.org/message-id/flat/200411020134.52513.peter_e%40gmx.net So I'm not sure that anybody would love me if I change it again to absolute. -- Álvaro Herrera Valdivia, Chile Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
Re: [PATCH] Bug fix in initdb output
On 2021-Mar-02, Nitin Jadhav wrote: > > FWIW, I don't think that it is a good idea to come back to this > > decision for *nix platforms, so I would let it as-is, and use > > relative paths if initdb is called using a relative path. > > The command to be displayed either in absolute path or relative path > depends on the way the user is calling initdb. I agree with the above > comment to keep this behaviour as-is, that is if the initdb is called > using relative path, then it displays relative path otherwise absolute > path. Yeah. > > However, if you can write something that makes the path printed > > compatible for a WIN32 terminal while keeping the behavior > > consistent with other platforms, people would have no objections to > > that. > > I feel the patch attached above handles this scenario. Agreed. I'll push the original patch then. Thanks everybody for the discussion. -- Álvaro Herrera39°49'30"S 73°17'W "Porque francamente, si para saber manejarse a uno mismo hubiera que rendir examen... ¿Quién es el machito que tendría carnet?" (Mafalda)
Re: WIP: BRIN multi-range indexes
On 2021-Mar-03, Tomas Vondra wrote: > 1) The 0001 patch allows passing of all scan keys to BRIN opclasses, > which is needed for the minmax-multi to work. But it also modifies the > existing opclasses (minmax and inclusion) to do this - but for those > opclasses it does not make much difference, I think. Initially this was > done because the patch did this for all opclasses, but then we added the > detection based on number of parameters. So I wonder if we should just > remove this, to make the patch a bit smaller. It'll also test the other > code path (for opclasses without the last parameter). I think it makes sense to just do them all in one pass. I think trying to keep the old way working just because it's how it was working previously does not have much benefit. I don't think we care about the *patch* being small as much as the resulting *code* being as simple as possible (but no simpler). > 2) This needs bsearch_arg, but we only have that defined/used in > extended_statistics_internal.h - it seems silly to include that here, as > this has nothing to do with extended stats, so I simply added another > prototype (which gets linked correctly). But, I suppose a better way > would be to define our "portable" variant pg_bsearch_arg, next to > pg_qsort etc. Yeah, I think it makes sense to move bsearch_arg to a place where it's more generally accesible (src/port/bsearch_arg.c I suppose), and make both places use that. -- Álvaro Herrera Valdivia, Chile
Re: WIP: BRIN multi-range indexes
On 2021-Mar-03, Tomas Vondra wrote: > That's kinda my point - I agree the size of the patch is not the primary > concern, but it makes the minmax/inclusion code a bit more complicated > (because they now have to loop over the keys), with very little benefit > (there might be some speedup, but IMO it's rather negligible). Yeah, OK. > Alternatively we could simply remove the code supporting the old API > with "consistent" functions without the additional parameter. But the > idea was to seamlessly support existing opclasses / not breaking them > unnecessarily (I know we don't guarantee that in major upgrades, but as > they may not benefit from this, why break them?). It'd simplify the code > in brin.c a little bit, but the opclasses a bit more complex. Well, I doubt any opclass-support functions exist outside of core. Or am I just outdated and we do know of some? -- Álvaro Herrera Valdivia, Chile "Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
Re: Huge memory consumption on partitioned table with FKs
On 2021-Mar-03, Amit Langote wrote: > I don't know of any unaddressed comments on the patch, so I've marked > the entry Ready for Committer. Thanks, I'll look at it later this week. -- Álvaro Herrera39°49'30"S 73°17'W #error "Operator lives in the wrong universe" ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)
Re: PATCH: Batch/pipelining support for libpq
On 2021-Mar-03, k.jami...@fujitsu.com wrote: > I tried applying this patch to test it on top of Iwata-san's libpq trace log > [1]. > In my environment, the compiler complains. > It seems that in libpqwalreceiver.c: libpqrcv_exec() > the switch for PQresultStatus needs to handle the > cases for PGRES_PIPELINE_END and PGRES_PIPELINE_ABORTED too. Yeah, I noticed this too and decided to handle these cases as errors. Here's v28, which includes that fix, and also gets rid of the PGASYNC_QUEUED state; I instead made the rest of the code work using the existing PGASYNC_IDLE state. This allows us get rid of a few cases where we had to report "internal error: IDLE state unexpected" in a few cases, and was rather pointless. With the current formulation, both pipeline mode and normal mode share pretty much all state. Also, I've renamed PGRES_PIPELINE_END to PGRES_PIPELINE_SYNC, because that's closer to what it means: that result state by no means that pipeline mode has ended. It only means that you called PQsendPipeline() so the server gives you a Sync message. I'm much more comfortable with this version, so I'm marking the patch as Ready for Committer in case anybody else wants to look at this before I push it. However: on errors, I think it's a bit odd that you get a NULL from PQgetResult after getting PGRES_PIPELINE_ABORTED. Maybe I should suppress that. I'm no longer wrestling with the NULL after PGRES_PIPELINE_END however, because that's not critical and you can not ask for it and things work fine. (Also, the test pipeline.c program in src/test/modules/test_libpq has a new "transaction" mode that is not exercised by the TAP program; I'll plug that in before commit.) -- Álvaro Herrera Valdivia, Chile diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index cc20b0f234..4ea8cb8c93 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3199,6 +3199,32 @@ ExecStatusType PQresultStatus(const PGresult *res); + + + PGRES_PIPELINE_SYNC + + +The PGresult represents a point in a +pipeline where a synchronization point has been established. +This status occurs only when pipeline mode has been selected. + + + + + + PGRES_PIPELINE_ABORTED + + +The PGresult represents a pipeline that's +received an error from the server. PQgetResult +must be called repeatedly, and it will return this status code, +until the end of the current pipeline, at which point it will return +PGRES_PIPELINE_SYNC and normal processing can +resume. + + + + If the result status is PGRES_TUPLES_OK or @@ -4957,6 +4983,498 @@ int PQflush(PGconn *conn); + + Pipeline Mode + + + libpq + pipeline mode + + + + pipelining + in libpq + + + + batch mode + in libpq + + + + libpq pipeline mode allows applications to + send a query without having to read the result of the previously + sent query. Taking advantage of the pipeline mode, a client will wait + less for the server, since multiple queries/results can be sent/ + received in a single network transaction. + + + + While pipeline mode provides a significant performance boost, writing + clients using the pipeline mode is more complex because it involves + managing a queue of pending queries and finding which result + corresponds to which query in the queue. + + + + Pipeline mode also generally consumes more memory on both the client and server, + though careful and aggressive management of the send/receive queue can mitigate + this. This applies whether or not the connection is in blocking or non-blocking + mode. + + + + Using Pipeline Mode + + +To issue pipelines the application must switch a connection into pipeline mode. +Enter pipeline mode with +or test whether pipeline mode is active with +. +In pipeline mode, only asynchronous operations +are permitted, and COPY is disallowed. +Using any synchronous command execution functions, +such as PQfn, or PQexec +and its sibling functions, is an error condition. + + + + + It is best to use pipeline mode with libpq in + non-blocking mode. If used + in blocking mode it is possible for a client/server deadlock to occur. + + +The client will block trying to send queries to the server, but the +server will block trying to send results to the client from queries +it has already processed. This only occurs when the client sends +enough queries to fill both its output buffer and the server's receive +buffer before switching to processing input from the server, +but it's hard to predict exactly when that will happen. + + +
Re: PATCH: Batch/pipelining support for libpq
On 2021-Mar-03, 'Alvaro Herrera' wrote: > I'm much more comfortable with this version, so I'm marking the patch as > Ready for Committer in case anybody else wants to look at this before I > push it. Actually, I just noticed a pretty serious problem, which is that when we report an error, we don't set "conn->errQuery" to the current query but to the *previous* query, leading to non-sensical error reports like $ ./pipeline pipeline_abort aborted pipeline... ERROR: no existe la función no_such_function(integer) LINE 1: INSERT INTO pq_pipeline_demo(itemno) VALUES ($1); ^ HINT: Ninguna función coincide en el nombre y tipos de argumentos. Puede ser necesario agregar conversión explícita de tipos. ok (Sorry about the Spanish there, but the gist of the problem is that we're positioning the error cursor on a query that succeeded prior to the query that raised the error.) This should obviously not occur. I'm trying to figure out how to repair it and not break everything again ... -- Álvaro Herrera Valdivia, Chile
Re: PATCH: Batch/pipelining support for libpq
On 2021-Mar-03, 'Alvaro Herrera' wrote: > This should obviously not occur. I'm trying to figure out how to repair > it and not break everything again ... I think trying to set up the connection state so that the next query appears in conn->last_query prior to PQgetResult being called again leads to worse breakage. The simplest fix seems to make fe-protocol3.c aware that in this case, the query is in conn->cmd_queue_head instead, as in the attached patch. -- Álvaro Herrera39°49'30"S 73°17'W diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 584922b340..b89f500902 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -962,9 +962,21 @@ pqGetErrorNotice3(PGconn *conn, bool isError) * Save the active query text, if any, into res as well; but only if we * might need it for an error cursor display, which is only true if there * is a PG_DIAG_STATEMENT_POSITION field. + * + * Note that in pipeline mode, we have not yet advanced the query pointer + * to the next query, so we have to look at that. */ - if (have_position && conn->last_query && res) - res->errQuery = pqResultStrdup(res, conn->last_query); + if (have_position && res) + { + if (conn->pipelineStatus != PQ_PIPELINE_OFF && + conn->asyncStatus == PGASYNC_IDLE) + { + if (conn->cmd_queue_head) +res->errQuery = pqResultStrdup(res, conn->cmd_queue_head->query); + } + else if (conn->last_query) + res->errQuery = pqResultStrdup(res, conn->last_query); + } /* * Now build the "overall" error message for PQresultErrorMessage.
Re: PATCH: Batch/pipelining support for libpq
On 2021-Mar-03, 'Alvaro Herrera' wrote: > On 2021-Mar-03, 'Alvaro Herrera' wrote: > > > This should obviously not occur. I'm trying to figure out how to repair > > it and not break everything again ... > > I think trying to set up the connection state so that the next query > appears in conn->last_query prior to PQgetResult being called again > leads to worse breakage. The simplest fix seems to make fe-protocol3.c > aware that in this case, the query is in conn->cmd_queue_head instead, > as in the attached patch. I wonder if it would make sense to get rid of conn->last_query completely and just rely always on conn->cmd_queue_head, where the normal (non-pipeline) would use the first entry of the command queue. That might end up being simpler than pipeline mode "pretending" to take over ->last_query. -- Álvaro Herrera39°49'30"S 73°17'W "La verdad no siempre es bonita, pero el hambre de ella sí"
Re: make coverage-html would fail within build directory separate from source tree
On 2021-Mar-04, walker wrote: > Hi, hackers > > During installation from source code, I created a build directory separate > from the source tree, and execute the following command in the build > directory: > /home/postgres/postgresql-13.2/configure -- enable-coverage > make > make check > make coverage-html > > > However, while executing make coverage-html, it failed with the following > error messages: > /bin/lcov --gcov-tool /bin/gcov -q --no-external -c -i -d . -d > /home/postgres/postgresql-13.2/ -o lcve_base.info > ... > geninfo: ERROR: no .gcno files found in /home/postgres/postgresql-13.2/! > make: *** [lcov_base.info] Error 255 > make: *** Deleting file 'lcov_base.info' Hmm, it works fine for me. config.log says I do this (in /pgsql/build/master-coverage): $ /pgsql/source/REL_13_STABLE/configure --enable-debug --enable-depend --enable-cassert --enable-coverage --cache-file=/home/alvherre/run/pgconfig.master-coverage.cache --enable-thread-safety --enable-tap-tests --with-python --with-perl --with-tcl --with-openssl --with-libxml --with-tclconfig=/usr/lib/tcl8.6 PYTHON=/usr/bin/python3 --prefix=/pgsql/install/master-coverage --with-pgport=55451 I do run "make install" too, though (and "make -C contrib install"). Not sure if that makes a difference. But for sure there are no .gcno files in the source dir -- they're all in the build dir. -- Álvaro Herrera Valdivia, Chile "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas/ desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
Re: make coverage-html would fail within build directory separate from source tree
On 2021-Mar-04, Alvaro Herrera wrote: > On 2021-Mar-04, walker wrote: > > > Hi, hackers > > > > During installation from source code, I created a build directory separate > > from the source tree, and execute the following command in the build > > directory: > > /home/postgres/postgresql-13.2/configure -- enable-coverage > > make > > make check > > make coverage-html > > > > > > However, while executing make coverage-html, it failed with the following > > error messages: > > /bin/lcov --gcov-tool /bin/gcov -q --no-external -c -i -d . -d > > /home/postgres/postgresql-13.2/ -o lcve_base.info > > ... > > geninfo: ERROR: no .gcno files found in /home/postgres/postgresql-13.2/! "make coverage-html" outputs this: note that I get a WARNING about the source directory rather than an ERROR: /usr/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -i -d . -d /pgsql/source/REL_13_STABLE/ -o lcov_base.info geninfo: WARNING: no .gcno files found in /pgsql/source/REL_13_STABLE/ - skipping! /usr/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -d . -d /pgsql/source/REL_13_STABLE/ -o lcov_test.info geninfo: WARNING: no .gcda files found in /pgsql/source/REL_13_STABLE/ - skipping! rm -rf coverage /usr/bin/genhtml -q --legend -o coverage --title='PostgreSQL 13.2' --num-spaces=4 lcov_base.info lcov_test.info touch coverage-html-stamp (In my system, /pgsql is a symlink to /home/alvhere/Code/pgsql/) $ geninfo --version geninfo: LCOV version 1.13 -- Álvaro Herrera Valdivia, Chile
Re: make coverage-html would fail within build directory separate from source tree
On 2021-Mar-04, walker wrote: > thanks for your reply. it indeed that there are no .gcon files in source tree > directory, they're in build tree directory, which results in failures. > > > That's a bit wired. > > > Add more detailed testing steps: > mkdir build_dir > cd build_dir > /home/postgres/postgresql-13.2/configure -- enable-coverage > make > make check > make coverage-html Hmm, my build dir is not inside the source dir -- is yours? Maybe that makes a difference? Also, what version of lcov do you have? -- Álvaro Herrera39°49'30"S 73°17'W "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
Re: make coverage-html would fail within build directory separate from source tree
On 2021-Mar-04, walker wrote: > The same, the build directory is outside the source tree. > > > the version of lcov is 1.10 That seems *really* ancient. Please try with a fresher version. -- Álvaro Herrera39°49'30"S 73°17'W "How amazing is that? I call it a night and come back to find that a bug has been identified and patched while I sleep."(Robert Davidson) http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
On 2021-Mar-04, Ibrar Ahmed wrote: > The build is failing for this patch, can you please take a look at this? > > https://cirrus-ci.com/task/4568547922804736 > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.129221 > > > I am marking the patch "Waiting on Author" I don't know why you chose to respond to such an old message, but here's a rebase which also includes the workaround I suggested last night for the problem of the outdated query printed by error messages, as well as Justin's suggested fixes. (I also made a few tweaks to the TAP test). -- Álvaro Herrera Valdivia, Chile "No renuncies a nada. No te aferres a nada."
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
Apparently, the archives system or the commitfest system is not picking up new messages to the thread, so the CF app is trying to apply a very old patch version. I'm not sure what's up with that. Thomas, any clues on where to look? -- Álvaro Herrera Valdivia, Chile "Oh, great altar of passive entertainment, bestow upon me thy discordant images at such speed as to render linear thought impossible" (Calvin a la TV)
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
> I think it's just because you forgot the patch. > https://www.postgresql.org/message-id/20210304142627.GA5978%40alvherre.pgsql -- Álvaro Herrera39°49'30"S 73°17'W diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 0553279314..c16befa314 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3173,6 +3173,33 @@ ExecStatusType PQresultStatus(const PGresult *res); + + + PGRES_PIPELINE_SYNC + + +The PGresult represents a +synchronization point in pipeline mode, requested by +. +This status occurs only when pipeline mode has been selected. + + + + + + PGRES_PIPELINE_ABORTED + + +The PGresult represents a pipeline that has +received an error from the server. PQgetResult +must be called repeatedly, and each time it will return this status code +until the end of the current pipeline, at which point it will return +PGRES_PIPELINE_SYNC and normal processing can +resume. + + + + If the result status is PGRES_TUPLES_OK or @@ -4919,6 +4946,498 @@ int PQflush(PGconn *conn); + + Pipeline Mode + + + libpq + pipeline mode + + + + pipelining + in libpq + + + + batch mode + in libpq + + + + libpq pipeline mode allows applications to + send a query without having to read the result of the previously + sent query. Taking advantage of the pipeline mode, a client will wait + less for the server, since multiple queries/results can be sent/ + received in a single network transaction. + + + + While pipeline mode provides a significant performance boost, writing + clients using the pipeline mode is more complex because it involves + managing a queue of pending queries and finding which result + corresponds to which query in the queue. + + + + Pipeline mode also generally consumes more memory on both the client and server, + though careful and aggressive management of the send/receive queue can mitigate + this. This applies whether or not the connection is in blocking or non-blocking + mode. + + + + Using Pipeline Mode + + +To issue pipelines, the application must switch a connection into pipeline mode. +Enter pipeline mode with +or test whether pipeline mode is active with +. +In pipeline mode, only asynchronous operations +are permitted, and COPY is disallowed. +Using any synchronous command execution functions, +such as PQfn, or PQexec +and its sibling functions, is an error condition. + + + + + It is best to use pipeline mode with libpq in + non-blocking mode. If used + in blocking mode it is possible for a client/server deadlock to occur. + + +The client will block trying to send queries to the server, but the +server will block trying to send results to the client from queries +it has already processed. This only occurs when the client sends +enough queries to fill both its output buffer and the server's receive +buffer before it switches to processing input from the server, +but it's hard to predict exactly when that will happen. + + + + + + +Issuing Queries + + + After entering pipeline mode, the application dispatches requests using + , + or its prepared-query sibling + . + These requests are queued on the client-side until flushed to the server; + this occurs when is used to + establish a synchronization point in the pipeline, + or when is called. + The functions , + , and + also work in pipeline mode. + Result processing is described below. + + + + The server executes statements, and returns results, in the order the + client sends them. The server will begin executing the commands in the + pipeline immediately, not waiting for the end of the pipeline. + If any statement encounters an error, the server aborts the current + transaction and skips processing commands in the pipeline until the + next synchronization point established by PQsendPipeline. + (This remains true even if the commands in the pipeline would rollback + the transaction.) + Query processing resumes after the synchronization point. + + + + It's fine for one operation to depend on the results of a + prior one; for example, one query may define a table that the next + query in the same pipeline uses. Similarly, an application may + create a named prepared statement and execute it with later + statements in the same pipeline. + + + + +Processing Results + + + To process the result of one query in a pipeline, the
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
v30 contains changes to hopefully make it build on MSVC. -- Álvaro Herrera Valdivia, Chile diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 0553279314..c16befa314 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3173,6 +3173,33 @@ ExecStatusType PQresultStatus(const PGresult *res); + + + PGRES_PIPELINE_SYNC + + +The PGresult represents a +synchronization point in pipeline mode, requested by +. +This status occurs only when pipeline mode has been selected. + + + + + + PGRES_PIPELINE_ABORTED + + +The PGresult represents a pipeline that has +received an error from the server. PQgetResult +must be called repeatedly, and each time it will return this status code +until the end of the current pipeline, at which point it will return +PGRES_PIPELINE_SYNC and normal processing can +resume. + + + + If the result status is PGRES_TUPLES_OK or @@ -4919,6 +4946,498 @@ int PQflush(PGconn *conn); + + Pipeline Mode + + + libpq + pipeline mode + + + + pipelining + in libpq + + + + batch mode + in libpq + + + + libpq pipeline mode allows applications to + send a query without having to read the result of the previously + sent query. Taking advantage of the pipeline mode, a client will wait + less for the server, since multiple queries/results can be sent/ + received in a single network transaction. + + + + While pipeline mode provides a significant performance boost, writing + clients using the pipeline mode is more complex because it involves + managing a queue of pending queries and finding which result + corresponds to which query in the queue. + + + + Pipeline mode also generally consumes more memory on both the client and server, + though careful and aggressive management of the send/receive queue can mitigate + this. This applies whether or not the connection is in blocking or non-blocking + mode. + + + + Using Pipeline Mode + + +To issue pipelines, the application must switch a connection into pipeline mode. +Enter pipeline mode with +or test whether pipeline mode is active with +. +In pipeline mode, only asynchronous operations +are permitted, and COPY is disallowed. +Using any synchronous command execution functions, +such as PQfn, or PQexec +and its sibling functions, is an error condition. + + + + + It is best to use pipeline mode with libpq in + non-blocking mode. If used + in blocking mode it is possible for a client/server deadlock to occur. + + +The client will block trying to send queries to the server, but the +server will block trying to send results to the client from queries +it has already processed. This only occurs when the client sends +enough queries to fill both its output buffer and the server's receive +buffer before it switches to processing input from the server, +but it's hard to predict exactly when that will happen. + + + + + + +Issuing Queries + + + After entering pipeline mode, the application dispatches requests using + , + or its prepared-query sibling + . + These requests are queued on the client-side until flushed to the server; + this occurs when is used to + establish a synchronization point in the pipeline, + or when is called. + The functions , + , and + also work in pipeline mode. + Result processing is described below. + + + + The server executes statements, and returns results, in the order the + client sends them. The server will begin executing the commands in the + pipeline immediately, not waiting for the end of the pipeline. + If any statement encounters an error, the server aborts the current + transaction and skips processing commands in the pipeline until the + next synchronization point established by PQsendPipeline. + (This remains true even if the commands in the pipeline would rollback + the transaction.) + Query processing resumes after the synchronization point. + + + + It's fine for one operation to depend on the results of a + prior one; for example, one query may define a table that the next + query in the same pipeline uses. Similarly, an application may + create a named prepared statement and execute it with later + statements in the same pipeline. + + + + +Processing Results + + + To process the result of one query in a pipeline, the application calls + PQgetResult repeatedly and handles each result + until PQgetResult re
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
On 2021-Mar-04, Alvaro Herrera wrote: > v30 contains changes to hopefully make it build on MSVC. Hm, that didn't work -- appveyor still says: Project "C:\projects\postgresql\pgsql.sln" (1) is building "C:\projects\postgresql\pipeline.vcxproj" (75) on node 1 (default targets). PrepareForBuild: Creating directory ".\Release\pipeline\". Creating directory ".\Release\pipeline\pipeline.tlog\". InitializeBuildStatus: Creating ".\Release\pipeline\pipeline.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified. ClCompile: C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_amd64\CL.exe /c /Isrc/include /Isrc/include/port/win32 /Isrc/include/port/win32_msvc /Zi /nologo /W3 /WX- /Ox /D WIN32 /D _WINDOWS /D __WINDOWS__ /D __WIN32__ /D WIN32_STACK_RLIMIT=4194304 /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _MBCS /GF /Gm- /EHsc /MD /GS /fp:precise /Zc:wchar_t /Zc:forScope /Fo".\Release\pipeline\\" /Fd".\Release\pipeline\vc120.pdb" /Gd /TC /wd4018 /wd4244 /wd4273 /wd4102 /wd4090 /wd4267 /errorReport:queue /MP src/test/modules/test_libpq/pipeline.c pipeline.c src/test/modules/test_libpq/pipeline.c(11): fatal error C1083: Cannot open include file: 'libpq-fe.h': No such file or directory [C:\projects\postgresql\pipeline.vcxproj] Done Building Project "C:\projects\postgresql\pipeline.vcxproj" (default targets) -- FAILED. Project "C:\projects\postgresql\pgsql.sln" (1) is building "C:\projects\postgresql\test_parser.vcxproj" (76) on node 1 (default targets). I think the problem is that the project is called pipeline and not test_libpq, so there's no match in the name. I'm going to rename the whole thing to src/test/modules/libpq_pipeline/ and see if the msvc tooling likes that better. -- Álvaro Herrera39°49'30"S 73°17'W