Re: Improve description of XLOG_RUNNING_XACTS
At Fri, 9 Sep 2022 09:48:05 +0900, Masahiko Sawada wrote in > On Tue, Aug 23, 2022 at 11:53 AM Kyotaro Horiguchi > wrote: > > > > At Mon, 15 Aug 2022 11:16:56 +0900, Masahiko Sawada > > wrote in > > > Or we can output the "subxid overwlowed" first. > > > > (I prefer this, as that doesn't change the output in the normal case > > but the anormality will be easilly seen if happens.) > > > > Updated the patch accordingly. Thanks! Considering the discussion so far, how about adding a comment like this? + appendStringInfoString(buf, "; subxid overflowed"); + ++ /* ++ * subxids and subxid_overflow are mutually exclusive, but we deliberitely ++ * print the both simultaneously in case the record is broken. ++ */ + if (xlrec->subxcnt > 0) + { + appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt); + for (i = 0; i < xlrec->subxcnt; i++) + appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]); + } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Perform streaming logical transactions by background workers and parallel apply
Here are my review comments for the v28-0001 patch: (There may be some overlap with other people's review comments and/or some fixes already made). == 1. Commit Message In addition, the patch extends the logical replication STREAM_ABORT message so that abort_time and abort_lsn can also be sent which can be used to update the replication origin in parallel apply worker when the streaming transaction is aborted. ~ Should this also mention that because this message extension is needed to support parallel streaming, meaning that parallel streaming is not supported for publications on servers < PG16? == 2. doc/src/sgml/config.sgml Specifies maximum number of logical replication workers. This includes -both apply workers and table synchronization workers. +apply leader workers, parallel apply workers, and table synchronization +workers. "apply leader workers" -> "leader apply workers" ~~~ 3. max_logical_replication_workers (integer) Specifies maximum number of logical replication workers. This includes apply leader workers, parallel apply workers, and table synchronization workers. Logical replication workers are taken from the pool defined by max_worker_processes. The default value is 4. This parameter can only be set at server start. ~ I did not really understand why the default is 4. Because the default tablesync workers is 2, and the default parallel workers is 2, but what about accounting for the apply worker? Therefore, shouldn't max_logical_replication_workers default be 5 instead of 4? == 4. src/backend/commands/subscriptioncmds.c - defGetStreamingMode + } + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s requires a Boolean value or \"parallel\"", + def->defname))); + return SUBSTREAM_OFF; /* keep compiler quiet */ +} Some whitespace before the ereport and the return might be tidier. == 5. src/backend/libpq/pqmq.c + { + if (IsParallelWorker()) + SendProcSignal(pq_mq_parallel_leader_pid, +PROCSIG_PARALLEL_MESSAGE, +pq_mq_parallel_leader_backend_id); + else + { + Assert(IsLogicalParallelApplyWorker()); + SendProcSignal(pq_mq_parallel_leader_pid, +PROCSIG_PARALLEL_APPLY_MESSAGE, +pq_mq_parallel_leader_backend_id); + } + } This code can be simplified if you want to. For example, { ProcSignalReason reason; Assert(IsParallelWorker() || IsLogicalParallelApplyWorker()); reason = IsParallelWorker() ? PROCSIG_PARALLEL_MESSAGE : PROCSIG_PARALLEL_APPLY_MESSAGE; SendProcSignal(pq_mq_parallel_leader_pid, reason, pq_mq_parallel_leader_backend_id); } == 6. src/backend/replication/logical/applyparallelworker.c Is there a reason why this file is called applyparallelworker.c instead of parallelapplyworker.c? Now this name is out of step with names of all the new typedefs etc. ~~~ 7. +/* + * There are three fields in each message received by parallel apply worker: + * start_lsn, end_lsn and send_time. Because we have updated these statistics + * in leader apply worker, we could ignore these fields in parallel apply + * worker (see function LogicalRepApplyLoop). + */ +#define SIZE_STATS_MESSAGE (2 * sizeof(XLogRecPtr) + sizeof(TimestampTz)) SUGGESTION (Just dded word "the" and change "could" -> "can") There are three fields in each message received by the parallel apply worker: start_lsn, end_lsn and send_time. Because we have updated these statistics in the leader apply worker, we can ignore these fields in the parallel apply worker (see function LogicalRepApplyLoop). ~~~ 8. +/* + * List that stores the information of parallel apply workers that were + * started. Newly added worker information will be removed from the list at the + * end of the transaction when there are enough workers in the pool. Besides, + * exited workers will be removed from the list after being detected. + */ +static List *ParallelApplyWorkersList = NIL; Perhaps this comment can give more explanation of what is meant by the part that says "when there are enough workers in the pool". ~~~ 9. src/backend/replication/logical/applyparallelworker.c - parallel_apply_can_start + /* + * Don't start a new parallel worker if not in streaming parallel mode. + */ + if (MySubscription->stream != SUBSTREAM_PARALLEL) + return false; "streaming parallel mode." -> "parallel streaming mode." ~~~ 10. + /* + * For streaming transactions that are being applied using parallel apply + * worker, we cannot decide whether to apply the change for a relation that + * is not in the READY state (see should_apply_changes_for_rel) as we won't + * know remote_final_lsn by that time. So, we don't start the new parallel + * apply worker in this case. + */ + if (!AllTablesyncsReady()) + return false; "using parallel apply worker" -> "using a parallel apply worker" ~~~ 11. + /* + * Do not allow parallel apply worker to be started in the parallel apply + * worker. + */ + if (am_parallel_apply_worker()) + return false; I guess the
Re: [BUG] wrong FK constraint name when colliding name on ATTACH
On Fri, Sep 9, 2022 at 4:54 AM Tom Lane wrote: > Andres Freund writes: > > Something here doesn't look to be quite right. Starting with this commit CI > > [1] started to fail on freebsd (stack trace [2]), and in the meson branch > > I've > > also seen the crash on windows (CI run[3], stack trace [4]). > > The crash seems 100% reproducible if I remove the early-exit optimization > from GetForeignKeyActionTriggers: Indeed, reproduced here. > diff --git a/src/backend/commands/tablecmds.c > b/src/backend/commands/tablecmds.c > index 53b0f3a9c1..112ca77d97 100644 > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -10591,8 +10591,6 @@ GetForeignKeyActionTriggers(Relation trigrel, > Assert(*updateTriggerOid == InvalidOid); > *updateTriggerOid = trgform->oid; > } > - if (OidIsValid(*deleteTriggerOid) && OidIsValid(*updateTriggerOid)) > - break; > } > > if (!OidIsValid(*deleteTriggerOid)) > > With that in place, it's probabilistic whether the Asserts notice anything > wrong, and mostly they don't. But there are multiple matching triggers: > > regression=# select oid, tgconstraint, tgrelid,tgconstrrelid, tgtype, tgname > from pg_trigger where tgconstraint = 104301; > oid | tgconstraint | tgrelid | tgconstrrelid | tgtype |tgname > +--+-+---++--- > 104302 | 104301 | 104294 |104294 | 9 | > RI_ConstraintTrigger_a_104302 > 104303 | 104301 | 104294 |104294 | 17 | > RI_ConstraintTrigger_a_104303 > 104304 | 104301 | 104294 |104294 | 5 | > RI_ConstraintTrigger_c_104304 > 104305 | 104301 | 104294 |104294 | 17 | > RI_ConstraintTrigger_c_104305 > (4 rows) > > I suspect that the filter conditions being applied are inadequate > for the case of a self-referential FK, which this evidently is > given that tgrelid and tgconstrrelid are equal. Yes, the loop in GetForeignKeyActionTriggers() needs this: + /* Only ever look at "action" triggers on the PK side. */ + if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_PK) + continue; Likewise, GetForeignKeyActionTriggers() needs this: + /* Only ever look at "check" triggers on the FK side. */ + if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_FK) + continue; We evidently missed this in f4566345cf40b0. > I'd counsel dropping the early-exit optimization; it doesn't > save much I expect, and it evidently hides bugs. Or maybe > make it conditional on !USE_ASSERT_CHECKING. While neither of these functions are called in hot paths, I am inclined to keep the early-exit bit in non-assert builds. Attached a patch. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com action-check-trigger-filter.patch Description: Binary data
Re: proposal: possibility to read dumped table's name from file
On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson wrote: > [v3] Note that the grammar has shift-reduce conflicts. If you run a fairly recent Bison, you can show them like this: bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr] filterparse.y: warning: shift/reduce conflict on token C_INCLUDE [-Wcounterexamples] Example: • C_INCLUDE include_object pattern Shift derivation Filters ↳ 3: Filter ↳ 4: • C_INCLUDE include_object pattern Reduce derivation Filters ↳ 2: Filters Filter ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE [-Wcounterexamples] Example: • C_EXCLUDE exclude_object pattern Shift derivation Filters ↳ 3: Filter ↳ 5: • C_EXCLUDE exclude_object pattern Reduce derivation Filters ↳ 2: Filters Filter ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern -- John Naylor EDB: http://www.enterprisedb.com
Possible crash on standby
Hello. While I played with some patch, I met an assertion failure. #2 0x00b350e0 in ExceptionalCondition ( conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()", errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c", lineNumber=4190) at assert.c:69 #3 0x00586f9c in XLogFileRead (segno=61, emode=13, tli=1, source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190 #4 0x005871d2 in XLogFileReadAnyTLI (segno=61, emode=13, source=XLOG_FROM_ANY) at xlogrecovery.c:4296 #5 0x0058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360, randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1, replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727 This is replayable by the following steps. 1. insert a sleep(1) in WaitForWALToBecomeAvailable(). >* WAL that we restore from archive. >*/ > + sleep(1); > if (WalRcvStreaming()) > XLogShutdownWalRcv(); 2. create a primary with archiving enabled. 3. create a standby with recovering from the primary's archive and unconnectable primary_conninfo. 4. start the primary. 5. switch wal on the primary. 6. Kaboom. This is because WaitForWALToBecomeAvailable doesn't call XLogSHutdownWalRcv() when walreceiver has been stopped before we reach the WalRcvStreaming() call cited above. But we need to set InstasllXLogFileSegmentActive to false even in that case, since no one other than startup process does that. Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might need to correct the dependencies between the flag and walreceiver state, but it not mandatory because XLogShutdownWalRcv() is designed so that it can be called even after walreceiver is stopped. I don't have a clear memory about why we do that at the time, though, but recovery check runs successfully with this. This code was introduced at PG12. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 12100bb9c041c660eaad675d8e6ed69b49270009 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 9 Sep 2022 17:06:47 +0900 Subject: [PATCH v1] Do not skip calling XLogShutdownWalRcv() XLogShutdownWalRcv() only stops wal receiver but also flips down the InstallXLogFileSegmentActive flag. The function is not called when entering archive recovery mode with walreceiver already stopped. Thus archive recovery afterwards faces an assertion failure on the flag in that case. Fix this by always calling the function to keep InstallXLogFileSegmentActive consistent. --- src/backend/access/transam/xlogrecovery.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 9a80084a68..20bd1f1d2e 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -3532,8 +3532,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * walreceiver is not active, so that it won't overwrite * WAL that we restore from archive. */ - if (WalRcvStreaming()) - XLogShutdownWalRcv(); + XLogShutdownWalRcv(); /* * Before we sleep, re-scan for possible new timelines if -- 2.31.1
Re: proposal: possibility to read dumped table's name from file
> On Sep 9, 2022, at 5:53 PM, John Naylor wrote: > > On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson wrote: >> [v3] > > Note that the grammar has shift-reduce conflicts. If you run a fairly > recent Bison, you can show them like this: > > bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y > > filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr] > filterparse.y: warning: shift/reduce conflict on token C_INCLUDE > [-Wcounterexamples] > Example: • C_INCLUDE include_object pattern > Shift derivation >Filters >↳ 3: Filter > ↳ 4: • C_INCLUDE include_object pattern > Reduce derivation >Filters >↳ 2: Filters Filter > ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern > filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE > [-Wcounterexamples] > Example: • C_EXCLUDE exclude_object pattern > Shift derivation >Filters >↳ 3: Filter > ↳ 5: • C_EXCLUDE exclude_object pattern > Reduce derivation >Filters >↳ 2: Filters Filter > ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern > Looks like the last rule for Filters should not be there. I do wonder whether we should be using bison/flex here, seems like using a sledgehammer to crack a nut. Cheers Andrew
RE: Perform streaming logical transactions by background workers and parallel apply
On Friday, September 9, 2022 3:02 PM Peter Smith wrote: > > Here are my review comments for the v28-0001 patch: > > (There may be some overlap with other people's review comments and/or > some fixes already made). > Thanks for the comments. > 3. > > max_logical_replication_workers (integer) > Specifies maximum number of logical replication workers. This > includes apply leader workers, parallel apply workers, and table > synchronization workers. > Logical replication workers are taken from the pool defined by > max_worker_processes. > The default value is 4. This parameter can only be set at server start. > > ~ > > I did not really understand why the default is 4. Because the default > tablesync workers is 2, and the default parallel workers is 2, but > what about accounting for the apply worker? Therefore, shouldn't > max_logical_replication_workers default be 5 instead of 4? The parallel apply is disabled by default, so it's not a must to increase this global default value as discussed[1] [1] https://www.postgresql.org/message-id/CAD21AoCwaU8SqjmC7UkKWNjDg3Uz4FDGurMpis3zw5SEC%2B27jQ%40mail.gmail.com > 6. src/backend/replication/logical/applyparallelworker.c > > Is there a reason why this file is called applyparallelworker.c > instead of parallelapplyworker.c? Now this name is out of step with > names of all the new typedefs etc. It was suggested which is consistent with the "vacuumparallel.c", but I am fine with either name. I can change this if more people think parallelapplyworker.c is better. > 16. src/backend/replication/logical/applyparallelworker.c - > parallel_apply_find_worker > > + /* Return the cached parallel apply worker if valid. */ > + if (stream_apply_worker != NULL) > + return stream_apply_worker; > > Perhaps 'cur_stream_parallel_apply_winfo' is a better name for this var? This looks a bit long to me. > /* Now wait until it attaches. */ > - WaitForReplicationWorkerAttach(worker, generation, bgw_handle); > + return WaitForReplicationWorkerAttach(worker, generation, bgw_handle); > > The comment feels a tiny bit misleading, because there is a chance > that this might not attach at all and return false if something goes > wrong. I feel it might be better to fix this via a separate patch. > Now that there is an apply_action enum I felt it is better for this > code to be using a switch instead of all the if/else. Furthermore, it > might be better to put the switch case in a logical order (e.g. same > as the suggested enums value order of #29a). I'm not sure whether switch case is better than if/else here. But if more people prefer, I can change this. > 23. src/backend/replication/logical/launcher.c - logicalrep_worker_launch > > + nparallelapplyworkers = logicalrep_parallel_apply_worker_count(subid); > + > + /* > + * Return silently if the number of parallel apply workers reached the > + * limit per subscription. > + */ > + if (is_subworker && nparallelapplyworkers >= > max_parallel_apply_workers_per_subscription) > + { > + LWLockRelease(LogicalRepWorkerLock); > + return false; > } > I’m not sure if this is a good idea to be so silent. How will the user > know if they should increase the GUC parameter or not if it never > tells them that the value is too low ? It's like what we do for table sync worker. Besides, I think user is likely to intentionally limit the parallel apply worker number to leave free workers for other purposes. And we do report a WARNING later if there is no free worker slots errmsg("out of logical replication worker slots"). > 41. src/backend/replication/logical/worker.c - store_flush_position > > + /* Skip if not the leader apply worker */ > + if (am_parallel_apply_worker()) > + return; > + > > Code might be better to implement/use a new function so it can check > something like !am_leader_apply_worker() Based on the existing code, both leader and table sync worker could enter this function. Using !am_leader_apply_worker() seems will disallow table sync worker to enter this function which might be not good although . > 47. src/backend/storage/ipc/procsignal.c - procsignal_sigusr1_handler > > @@ -657,6 +658,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) > if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT)) > HandleLogMemoryContextInterrupt(); > > + if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE)) > + HandleParallelApplyMessageInterrupt(); > + > > I wasn’t sure about the placement of this new code because those > CheckProcSignal don’t seem to have any particular order. I think this > belongs adjacent to the PROCSIG_PARALLEL_MESSAGE since it has the most > in common with that one. I'm not very sure, I just followed the way we used to add new SignalReason (e.g. add the new reason at the last but before the Recovery conflict reasons). And the parallel apply is not very similar to parallel query in detail. > I thought it might be worthwhile to also add another function like > am_leader_apply_worker(). I noticed at
Re: why can't a table be part of the same publication as its schema
On Fri, Sep 9, 2022 at 11:44 AM Tom Lane wrote: > > Amit Kapila writes: > > To avoid these confusions, we have disallowed adding a table if its > > schema is already part of the publication and vice-versa. > > Really? > > Is there logic in ALTER TABLE SET SCHEMA that rejects the command > dependent on the contents of the publication tables? > Yes, it has. For example, postgres=# create schema s1; CREATE SCHEMA postgres=# create table s1.t1(c1 int); CREATE TABLE postgres=# create schema s2; CREATE SCHEMA postgres=# create publication pub1 for all tables in schema s2, table s1.t1; CREATE PUBLICATION postgres=# Alter table s1.t1 set schema s2; ERROR: cannot move table "t1" to schema "s2" DETAIL: The schema "s2" and same schema's table "t1" cannot be part of the same publication "pub1". > If so, are > there locks taken in both ALTER TABLE SET SCHEMA and the > publication-modifying commands that are sufficient to prevent > race conditions in such changes? > Good point. I have checked it and found that ALTER TABLE SET SCHEMA takes AccessExclusiveLock on relation and AccessShareLock on the schema which it is going to set. The alter publication command takes ShareUpdateExclusiveLock on relation for dropping/adding a table to publication which will prevent any race condition with ALTER TABLE SET SCHEMA. However, the alter publication command takes AccessShareLock for dropping/adding schema which won't block with ALTER TABLE SET SCHEMA command. So, I think we need to change the lock mode for it in alter publication command. > This position sounds quite untenable from here, even if I found > your arguments-in-support convincing, which I don't really. > ISTM the rule should be along the lines of "table S.T should > be published either if schema S is published or S.T itself is". > There's no obvious need to interconnect the two conditions. > This rule is currently followed when a subscription has more than one publication. It is just that we didn't allow it in the same publication because of a fear that it may cause confusion for some of the users. The other thing to look at here is that the existing case of a "FOR ALL TABLES" publication also follows a similar rule such that it doesn't allow adding individual tables if the publication is for all tables. For example, postgres=# create publication pub1 for all tables; CREATE PUBLICATION postgres=# alter publication pub1 add table t1; ERROR: publication "pub1" is defined as FOR ALL TABLES DETAIL: Tables cannot be added to or dropped from FOR ALL TABLES publications. So, why shouldn't a "FOR ALL TABLES IN SCHEMA" publication follow a similar behavior? -- With Regards, Amit Kapila.
Re: making relfilenodes 56 bits
On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar wrote: > On a separate note, while reviewing the latest patch I see there is some risk > of using the unflushed relfilenumber in GetNewRelFileNumber() function. > Basically, in the current code, the flushing logic is tightly coupled with > the logging new relfilenumber logic and that might not work with all the > values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD. So the idea is we need to > keep the flushing logic separate from the logging, I am working on the idea > and I will post the patch soon. I have fixed the issue, so now we will track nextRelFileNumber, loggedRelFileNumber and flushedRelFileNumber. So whenever nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more relfilenumbers. And whenever nextRelFileNumber reaches the flushedRelFileNumber then we will do XlogFlush for WAL upto the last loggedRelFileNumber. Ideally flushedRelFileNumber should always be VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can avoid tracking the flushedRelFileNumber. But I feel keeping track of the flushedRelFileNumber looks cleaner and easier to understand. For more details refer to the code in GetNewRelFileNumber(). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From d04432467d34ebdc95934ab85409b1fad037c6cd Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Fri, 26 Aug 2022 10:20:18 +0530 Subject: [PATCH v17] Widen relfilenumber from 32 bits to 56 bits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently relfilenumber is 32 bits wide and that has a risk of wraparound so the relfilenumber can be reused. And to guard against the relfilenumber reuse there is some complicated hack which leaves a 0-length tombstone file around until the next checkpoint. And when we allocate a new relfilenumber we also need to loop to check the on disk conflict. As part of this patch we are making the relfilenumber 56 bits wide and there will be no provision for wraparound. So after this change we will be able to get rid of the 0-length tombstone file and the loop for checking the on-disk conflict of the relfilenumbers. The reason behind making it 56 bits wide instead of directly making 64 bits wide is that if we make it 64 bits wide then the size of the BufferTag will be increased which will increase the memory usage and that may also impact the performance. So in order to avoid that, inside the buffer tag, we will use 8 bits for the fork number and 56 bits for the relfilenumber. --- contrib/pg_buffercache/Makefile| 4 +- .../pg_buffercache/pg_buffercache--1.3--1.4.sql| 30 contrib/pg_buffercache/pg_buffercache.control | 2 +- contrib/pg_buffercache/pg_buffercache_pages.c | 39 +++- contrib/pg_prewarm/autoprewarm.c | 4 +- contrib/pg_walinspect/expected/pg_walinspect.out | 4 +- contrib/pg_walinspect/sql/pg_walinspect.sql| 4 +- contrib/test_decoding/expected/rewrite.out | 2 +- contrib/test_decoding/sql/rewrite.sql | 2 +- doc/src/sgml/catalogs.sgml | 2 +- doc/src/sgml/pgbuffercache.sgml| 2 +- doc/src/sgml/storage.sgml | 5 +- src/backend/access/gin/ginxlog.c | 2 +- src/backend/access/rmgrdesc/gistdesc.c | 2 +- src/backend/access/rmgrdesc/heapdesc.c | 2 +- src/backend/access/rmgrdesc/nbtdesc.c | 2 +- src/backend/access/rmgrdesc/seqdesc.c | 2 +- src/backend/access/rmgrdesc/xlogdesc.c | 21 ++- src/backend/access/transam/README | 5 +- src/backend/access/transam/varsup.c| 199 - src/backend/access/transam/xlog.c | 50 ++ src/backend/access/transam/xlogprefetcher.c| 14 +- src/backend/access/transam/xlogrecovery.c | 6 +- src/backend/access/transam/xlogutils.c | 12 +- src/backend/backup/basebackup.c| 2 +- src/backend/catalog/catalog.c | 95 -- src/backend/catalog/heap.c | 25 +-- src/backend/catalog/index.c| 11 +- src/backend/catalog/storage.c | 8 + src/backend/commands/tablecmds.c | 12 +- src/backend/commands/tablespace.c | 2 +- src/backend/nodes/gen_node_support.pl | 4 +- src/backend/replication/logical/decode.c | 1 + src/backend/replication/logical/reorderbuffer.c| 2 +- src/backend/storage/file/reinit.c | 28 +-- src/backend/storage/freespace/fsmpage.c| 2 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/backend/storage/smgr/md.c | 7 + src/backend/storage/smgr/smg
Re: [BUG] wrong FK constraint name when colliding name on ATTACH
On 2022-Sep-09, Amit Langote wrote: > Yes, the loop in GetForeignKeyActionTriggers() needs this: > > + /* Only ever look at "action" triggers on the PK side. */ > + if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_PK) > + continue; > > Likewise, GetForeignKeyActionTriggers() needs this: > > + /* Only ever look at "check" triggers on the FK side. */ > + if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_FK) > + continue; > > We evidently missed this in f4566345cf40b0. Ouch. Thank you, pushed. > On Fri, Sep 9, 2022 at 4:54 AM Tom Lane wrote: > > I'd counsel dropping the early-exit optimization; it doesn't > > save much I expect, and it evidently hides bugs. Or maybe > > make it conditional on !USE_ASSERT_CHECKING. > > While neither of these functions are called in hot paths, I am > inclined to keep the early-exit bit in non-assert builds. I kept it that way. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
Re: SI-read predicate locks on materialized views
On Tue, Jul 26, 2022 at 3:31 PM Richard Guo wrote: > > > On Tue, Jul 26, 2022 at 3:44 PM Yugo NAGATA wrote: >> >> If such two transactions run concurrently, a write skew anomaly occurs, >> and the result of order_summary refreshed in T1 will not contain the >> record inserted in T2. Yes we do have write skew anomaly. I think the patch looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Implementing Incremental View Maintenance
Hello huyajun, I'm sorry for delay in my response. On Tue, 26 Jul 2022 12:00:26 +0800 huyajun wrote: > I read your patch and think this processing is greet, but there is a risk of > deadlock. > Although I have not thought of a suitable processing method for the time > being, > it is also acceptable for truncate scenarios.The deadlock scene is as follows: > > Mv define is: select * from base_a,base_b; > S1: truncate base_a; ― only AccessExclusiveLock base_a and not run into after > trigger > S2: insert into base_b; ― The update has been completed and the incremental > refresh is started in the after trigger,RowExclusive on base_b and > ExclusiveLock on mv > S1: continue truncate mv, wait for AccessExclusiveLock on mv, wait for S2 > S2: continue refresh mv, wait for AccessShardLock on base_a, wait for S1 > So deadlock occurred Hmm, this deadlock scenario is possible, indeed. One idea to resolve it is to acquire RowExclusive locks on all base tables in the BEFORE trigger. If so, S2 can not progress its process because it waits for a RowExclusive lock on base_b, and it can not acquire ExeclusiveLock on mv before S1 finishes. > I also found some new issues that I would like to discuss with you Thank you so much for your massive bug reports! > 1. Concurrent DML causes imv data error, case like below > Setup: > Create table t( a int); > Insert into t select 1 from generate_series(1,3); > create incremental materialized view s as select count(*) from t; > > S1: begin;delete from t where ctid in (select ctid from t limit 1); > S2: begin;delete from t where ctid in (select ctid from t limit 1 offset 1); > S1: commit; > S2: commit; > > After this, The count data of s becomes 2 but correct data is 1. > I found out that the problem is probably because to our use of ctid update > Consider user behavior unrelated to imv: > > Create table t( a int); > Insert into t select 1; > s1: BEGIN > s1: update t set a = 2 where ctid in (select ctid from t); -- UPDATE 1 > s2: BEGIN > s2: update t set a = 3 where ctid in (select ctid from t); -- wait row lock > s1: COMMIT > s2: -- UPDATE 0 -- ctid change so can't UPDATE one rows > So we lost the s2 update > > 2. Sometimes it will crash when the columns of the created materialized view > do not match > Create table t( a int); > create incremental materialized view s(z) as select sum(1) as a, sum(1) as b > from t; > > The problem should be that colNames in rewriteQueryForIMMV does not consider > this situation > > 3. Sometimes no error when the columns of the created materialized view do > not match > Create table t( a int); > create incremental materialized view s(y,z) as select count(1) as b from t; > > But the hidden column of IMV is overwritten to z which will cause refresh > failed. > > The problem should be that checkRuleResultList we should only skip imv hidden > columns check > > 4. A unique index should not be created in the case of a Cartesian product > > create table base_a (i int primary key, j varchar); > create table base_b (i int primary key, k varchar); > INSERT INTO base_a VALUES > (1,10), > (2,20), > (3,30), > (4,40), > (5,50); > INSERT INTO base_b VALUES > (1,101), > (2,102), > (3,103), > (4,104); > CREATE incremental MATERIALIZED VIEW s as > select base_a.i,base_a.j from base_a,base_b; ― create error because of unique > index I am working on above issues (#1-#4) now, and I'll respond on each later. > 5. Besides, I would like to ask you if you have considered implementing an > IMV with delayed refresh? > The advantage of delayed refresh is that it will not have much impact on > write performance Yes, I've been thinking to implement deferred maintenance since the beginning of this IVM project. However, we've decided to start from immediate maintenance, and will plan to propose deferred maintenance to the core after the current patch is accepted. (I plan to implement this feature in pg_ivm extension module first, though.) > I probably have some ideas about it now, do you think it works? > 1. After the base table is updated, the delayed IMV's after trigger is used > to record the delta > information in another table similar to the incremental log of the base table > 2. When incremental refresh, use the data in the log instead of the data in > the trasient table > of the after trigger > 3. We need to merge the incremental information in advance to ensure that the > base_table > after transaction filtering UNION ALL old_delta is the state before the base > table is updated > Case like below: > Create table t( a int); > ―begin to record log > Insert into t select 1; ― newlog: 1 oldlog: empty > Delete from t; ―newlog:1, oldlog:1 > ― begin to incremental refresh > Select * from t where xmin < xid or (xmin = xid and cmin < cid); ― empty > So this union all oldlog is not equal to before the base table is updated > We need merge the incremental log in advance to make newlog: empty, oldlog: > empty > > If implemented, incre
Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~
On Fri, Sep 09, 2022 at 10:55:55AM +0900, Michael Paquier wrote: > Thanks. This set of simplifications is too good to let go, and I have > a window to look after the buildfarm today and tomorrow, which should > be enough to take action if need be. Hence, I have applied the > patch. Now, let's see what the buildfarm tells us ;) Based on what I can see, the Windows animals seem to have digested 47bd0b3 (cygwin, MinGW and MSVC), so I think that we are good. -- Michael signature.asc Description: PGP signature
Re: New docs chapter on Transaction Management and related changes
On Thu, 8 Sept 2022 at 08:42, Alvaro Herrera wrote: > > On 2022-Sep-06, Simon Riggs wrote: > > > On Tue, 6 Sept 2022 at 17:19, Erik Rijkers wrote: > > > > > > Op 06-09-2022 om 17:16 schreef Simon Riggs: > > > > New chapter on transaction management, plus a few related changes. > > > > > > > > Markup and links are not polished yet, so please comment initially on > > > > the topics, descriptions and wording. > > I think the concept of XID epoch should be described also, in or near > the paragraph that talks about wrapping around and switching between > int32 and int64. Right now it's a bit unclear why/how that works. Will do -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Doc fix and adjustment for MERGE command
On 2022-Sep-08, Vik Fearing wrote: > On 9/7/22 22:51, Vitaly Burovoy wrote: > > Hello hackers! > > > > Reading docs for the MERGE statement I've found a little error: a > > semicolon in middle of a statement and absence of a semicolon in the end > > of it. > > > > Key words in subqueries are written in uppercase everywhere in the docs > > but not in an example for MERGE. I think it should be adjusted too. > I agree with both of these patches (especially the semicolon part which is > not subjective). OK, pushed both together. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: [PATCH]Feature improvement for MERGE tab completion
On 2022-09-09 11:18, bt22kawamotok wrote: I created a patch for improving MARGE tab completion. Currently there is a problem with "MERGE INTO dst as d Using src as s ON d.key = s.key WHEN " is typed, "MATCHED" and "NOT MATCHED" is not completed. There is also a problem that typing "MERGE INTO a AS " completes "USING". This patch solves the above problems. Thanks for the patch! else if (TailMatches("USING", MatchAny, "ON", MatchAny, "WHEN")) COMPLETE_WITH("MATCHED", "NOT MATCHED"); else if (TailMatches("USING", MatchAny, "AS", MatchAny, "ON", MatchAny, "WHEN")) COMPLETE_WITH("MATCHED", "NOT MATCHED"); else if (TailMatches("USING", MatchAny, MatchAny, "ON", MatchAny, "WHEN")) COMPLETE_WITH("MATCHED", "NOT MATCHED"); I thought it would be better to describe this section as follows, summarizing the conditions else if (TailMatches("USING", MatchAny, "ON", MatchAny, "WHEN") || TailMatches("USING", MatchAny, "AS", MatchAny, "ON", MatchAny, "WHEN") || TailMatches("USING", MatchAny, MatchAny, "ON", MatchAny, "WHEN")) COMPLETE_WITH("MATCHED", "NOT MATCHED"); There are similar redundancies in the tab completion of MERGE statement, so why not fix that as well? -- Regards, -- Shinya Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: MERGE and parsing with prepared statements
On 2022-Aug-12, Simon Riggs wrote: > Sorry, but I disagree with this chunk in the latest commit, > specifically, changing the MATCHED from after to before the NOT > MATCHED clause. > > The whole point of the second example was to demonstrate that the > order of the MATCHED/NOT MATCHED clauses made no difference. > > By changing the examples so they are the same, the sentence at line > 573 now makes no sense. Not sure how to fix this. We could add put the WHEN clauses in the order they were, and add another phrase to that paragraph to explain more explicitly that the order is not relevant (while at the same time explaining that if you have multiple WHEN MATCHED clauses, the relative order of those *is* relevant). -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "At least to kernel hackers, who really are human, despite occasional rumors to the contrary" (LWN.net)
Re: Doc fix and adjustment for MERGE command
On 2022-09-09 11:54Z, Alvaro Herrera wrote: On 2022-Sep-08, Vik Fearing wrote: On 9/7/22 22:51, Vitaly Burovoy wrote: Hello hackers! Reading docs for the MERGE statement I've found a little error: a semicolon in middle of a statement and absence of a semicolon in the end of it. Key words in subqueries are written in uppercase everywhere in the docs but not in an example for MERGE. I think it should be adjusted too. I agree with both of these patches (especially the semicolon part which is not subjective). OK, pushed both together. Thank you! =) -- Best regards, Vitaly Burovoy
Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.
On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma wrote: > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat > wrote: > > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma > > wrote: > > > > > > Hi All, > > > > > > The logically decoded data are sent to the logical subscriber at the time > > > of transaction commit, assuming that the data is small. However, before > > > the transaction commit is performed, the LSN representing the data that > > > is yet to be received by the logical subscriber appears in the > > > confirmed_flush_lsn column of pg_replication_slots catalog. Isn't the > > > information seen in the confirmed_flush_lsn column while the transaction > > > is in progress incorrect ? esp considering the description given in the > > > pg doc for this column. > > > > > > Actually, while the transaction is running, the publisher keeps on > > > sending keepalive messages containing LSN of the last decoded data saved > > > in reorder buffer and the subscriber responds with the same LSN as the > > > last received LSN which is then updated as confirmed_flush_lsn by the > > > publisher. I think the LSN that we are sending with the keepalive message > > > should be the one representing the transaction begin message, not the LSN > > > of the last decoded data which is yet to be sent. Please let me know if I > > > am missing something here. > > > > The transactions with commit lsn < confirmed_flush_lsn are confirmed > > to be received (and applied by the subscriber. Setting LSN > > corresponding to a WAL record within a transaction in progress as > > confirmed_flush should be ok. Since the transactions are interleaved > > in WAL stream, it's quite possible that LSNs of some WAL records of an > > inflight transaction are lesser than commit LSN of some another > > transaction. So setting commit LSN of another effectively same as > > setting it to any of the LSNs of any previous WAL record irrespective > > of the transaction that it belongs to. > > Thank you Ashutosh for the explanation. I still feel that the > documentation on confirmed_flush_lsn needs some improvement. It > actually claims that all the data before the confirmed_flush_lsn has > been received by the logical subscriber, but that's not the case. It > actually means that all the data belonging to the transactions with > commit lsn < confirmed_flush_lsn has been received and applied by the > subscriber. So setting confirmed_flush_lsn to the lsn of wal records > generated by running transaction might make people think that the wal > records belonging to previous data of the same running transaction has > already been received and applied by the subscriber node, but that's > not true. > Can you please point to the documentation. It's true that it needs to be clarified. But what you are saying may not be entirely true in case of streamed transaction. In that case we might send logically decoded changes of an ongoing transaction as well. They may even get applied but not necessarily committed. It's a bit complicated. :) -- Best Wishes, Ashutosh Bapat
Re: Summary function for pg_buffercache
Hi hackers, I also added documentation changes into the patch. You can find it attached. I would appreciate any feedback about this pg_buffercache_summary function. Best, Melih From 82e92d217dd240a9b6c1184cf29d4718343558b8 Mon Sep 17 00:00:00 2001 From: Melih Mutlu Date: Tue, 9 Aug 2022 16:42:23 +0300 Subject: [PATCH] Added pg_buffercache_summary function Adds pg_buffercache_summary() function into pg_buffercache extension for retrieving summary information about overall shared_buffer usage. --- contrib/pg_buffercache/Makefile | 3 +- .../expected/pg_buffercache.out | 9 ++ .../pg_buffercache--1.3--1.4.sql | 13 +++ contrib/pg_buffercache/pg_buffercache.control | 2 +- contrib/pg_buffercache/pg_buffercache_pages.c | 80 - contrib/pg_buffercache/sql/pg_buffercache.sql | 5 + doc/src/sgml/pgbuffercache.sgml | 110 +- 7 files changed, 215 insertions(+), 7 deletions(-) create mode 100644 contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile index d74b3e853c..d6b58d4da9 100644 --- a/contrib/pg_buffercache/Makefile +++ b/contrib/pg_buffercache/Makefile @@ -7,7 +7,8 @@ OBJS = \ EXTENSION = pg_buffercache DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \ - pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql + pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \ + pg_buffercache--1.3--1.4.sql PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time" REGRESS = pg_buffercache diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out index 138556efc9..6994761d0a 100644 --- a/contrib/pg_buffercache/expected/pg_buffercache.out +++ b/contrib/pg_buffercache/expected/pg_buffercache.out @@ -8,3 +8,12 @@ from pg_buffercache; t (1 row) +select used_buffers + unused_buffers > 0, +dirty_buffers < used_buffers, +pinned_buffers < used_buffers +from pg_buffercache_summary(); + ?column? | ?column? | ?column? +--+--+-- + t| t| t +(1 row) + diff --git a/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql new file mode 100644 index 00..02800dbf18 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql @@ -0,0 +1,13 @@ +/* contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.4'" to load this file. \quit + +CREATE FUNCTION pg_buffercache_summary() +RETURNS TABLE (used_buffers int4, unused_buffers int4, dirty_buffers int4, +pinned_buffers int4, avg_usagecount real) +AS 'MODULE_PATHNAME', 'pg_buffercache_summary' +LANGUAGE C PARALLEL SAFE; + +-- Don't want these to be available to public. +REVOKE ALL ON FUNCTION pg_buffercache_summary() FROM PUBLIC; diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control index 8c060ae9ab..a82ae5f9bb 100644 --- a/contrib/pg_buffercache/pg_buffercache.control +++ b/contrib/pg_buffercache/pg_buffercache.control @@ -1,5 +1,5 @@ # pg_buffercache extension comment = 'examine the shared buffer cache' -default_version = '1.3' +default_version = '1.4' module_pathname = '$libdir/pg_buffercache' relocatable = true diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index c5754ea9fa..89f8a2e834 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -17,6 +17,7 @@ #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 #define NUM_BUFFERCACHE_PAGES_ELEM 9 +#define NUM_BUFFERCACHE_SUMMARY_ELEM 5 PG_MODULE_MAGIC; @@ -43,7 +44,6 @@ typedef struct int32 pinning_backends; } BufferCachePagesRec; - /* * Function context for data persisting over repeated calls. */ @@ -53,12 +53,12 @@ typedef struct BufferCachePagesRec *record; } BufferCachePagesContext; - /* * Function returning data from the shared buffer cache - buffer number, * relation node/tablespace/database/blocknum and dirty indicator. */ PG_FUNCTION_INFO_V1(pg_buffercache_pages); +PG_FUNCTION_INFO_V1(pg_buffercache_summary); Datum pg_buffercache_pages(PG_FUNCTION_ARGS) @@ -237,3 +237,79 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) else SRF_RETURN_DONE(funcctx); } + +Datum +pg_buffercache_summary(PG_FUNCTION_ARGS) +{ + Datum result; + TupleDesc tupledesc; + HeapTuple tuple; + Datum values[NUM_BUFFERCACHE_SUMMARY_ELEM]; + bool nulls[NUM_BUFFERCACHE_SUMMARY_ELEM]; + + int32 used_buffers = 0; + int32 unused_buffers = 0; + int32 dirty_buffers = 0; + int32 pinned_buffers = 0; + float avg_usagecount = 0; + + /* Construct a tuple descriptor for the result rows. */ + tupledesc = CreateTemplateTupleDesc(NUM_BUFFERCACHE_SUMM
Avoid overhead with fprintf related functions
Based on work in [1]. According to https://cplusplus.com/reference/cstdio/fprintf/ The use of fprintf is related to the need to generate a string based on a format, which should be different from "%s". Since fprintf has overhead when parsing the "format" parameter, plus all the trouble of checking the va_arg parameters. I think this is one of the low fruits available and easy to reap. By replacing fprintf with its equivalents, fputs and fputc, we avoid overhead and increase security [2] and [3]. The downside is a huge big churm, which unfortunately will occur. But, IHMO, I think the advantages are worth it. Note that behavior remains the same, since fputs and fputc do not change the expected behavior of fprintf. A small performance gain is expected, mainly for the client, since there are several occurrences in some critical places, such as (usr/src/fe_utils/print.c). Patch attached. This pass check-world. regards, Ranier Vilela [1] https://www.postgresql.org/message-id/CAApHDvp2THseLvCc%2BTcYFBC7FKHpHTs1JyYmd2JghtOVhb5WGA%40mail.gmail.com [2] https://stackoverflow.com/questions/20837989/fprintf-stack-buffer-overflow [3] https://bufferoverflows.net/format-string-vulnerability-what-when-and-how/ fprintf_fixes.patch Description: Binary data
Re: why can't a table be part of the same publication as its schema
On Fri, Sep 9, 2022 at 5:21 AM Amit Kapila wrote: > So, why shouldn't a "FOR ALL TABLES IN SCHEMA" publication follow a > similar behavior? Surely that is not the same case at all. If you're publishing everything, there's no point in also having a specific list of things that you want published, but when you're publishing only some things, there is. If my wife tells me to wash everything in the laundry basket and also my nice pants, and I discover that my nice pants already happen to be in the laundry basket, I do not tell her: ERROR: my nice pants are already in the laundry basket It feels like a mistake to me that there's any catalog representation at all for a table that is published because it is part of a schema. The way a feature like this should work is that the schema should be labelled as published, and we should discover which tables are part of it at any given time as we go. We shouldn't need separate catalog entries for each table in the schema just because the schema is published. But if we do have such catalog entries, surely there should be a difference between the catalog entry that gets created when the table is individually published and the one that gets created when the containing schema is published. We have such tracking in other cases (coninhcount, conislocal; attinhcount, attislocal). In my opinion, this shouldn't have been committed working the way it does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: HOT chain validation in verify_heapam()
On Wed, Sep 7, 2022 at 2:49 AM Robert Haas wrote: > > But here's one random idea: add a successor[] array and an lp_valid[] > array. In the first loop, set lp_valid[offset] = true if it passes the > check_lp() checks, and set successor[A] = B if A redirects to B or has > a CTID link to B, without matching xmin/xmax. Then, in a second loop, > iterate over the successor[] array. If successor[A] = B && lp_valid[A] > && lp_valid[B], then check whether A.xmax = B.xmin; if so, then > complain if predecessor[B] is already set, else set predecessor[B] = > A. Then, in the third loop, iterate over the predecessor array just as > you're doing now. Then it's clear that we do the lp_valid checks > exactly once for every offset that might need them, and in order. And > it's also clear that the predecessor-based checks can never happen > unless the lp_valid checks passed for both of the offsets involved. > > > Approach of introducing a successor array is good but I see one overhead with having both successor and predecessor array, that is, we will traverse each offset on page thrice(one more for original loop on offset) and with each offset we have to retrieve /reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not much overhead as they are all preprocessors but there will be some overhead. How about having new array(initialised with LP_NOT_CHECKED) of enum LPStatus as below typedef enum LPStatus { LP_NOT_CHECKED, LP_VALID, LP_NOT_VALID }LPStatus; and validating and setting with proper status at three places 1) while validating Redirect Tuple 2) while validating populating predecessor array and 3) at original place of "sanity check" something like: " if (lpStatus[rdoffnum] == LP_NOT_CHECKED) { ctx.offnum = rdoffnum; if (!check_lp(&ctx, ItemIdGetLength(rditem), ItemIdGetOffset(rditem))) { lpStatus[rdoffnum] = LP_NOT_VALID; continue; } lpStatus[rdoffnum] = LP_VALID; } else if (lpStatus[rdoffnum] == LP_NOT_VALID) continue; " thoughts? -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
"Drouvot, Bertrand" writes: > Agree that it seems unlikely but maybe we could add a new GUC to turn > the regex usage on the hba file on/off (and use off as the default)? I think that will just add useless complication. regards, tom lane
Re: [PATCH v1] fix potential memory leak in untransformRelOptions
On 2022-Sep-01, Tom Lane wrote: > Junwang Zhao writes: > > result = lappend(result, makeDefElem(pstrdup(s), val, -1)); > > + pfree(s); > > I wonder why it's pstrdup'ing s in the first place. Yeah, I think both the pstrdups in that function are useless. The DefElems can just point to the correct portion of the (already pstrdup'd by TextDatumGetCString) copy of optiondatums[i]. We modify that copy to install \0 in the place where the = is, and that copy is not freed anywhere. diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 609329bb21..0aa4b334ab 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -1357,9 +1357,9 @@ untransformRelOptions(Datum options) if (p) { *p++ = '\0'; - val = (Node *) makeString(pstrdup(p)); + val = (Node *) makeString(p); } - result = lappend(result, makeDefElem(pstrdup(s), val, -1)); + result = lappend(result, makeDefElem(s, val, -1)); } return result; I think these pstrdups were already not necessary when the function was added in 265f904d8f25, because textout() was already known to return a palloc'ed copy of its input; but later 220db7ccd8c8 made this contract even more explicit. Keeping 's' and removing the pstrdups better uses memory, because we have a single palloc'ed chunk per option rather than two. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
RE: why can't a table be part of the same publication as its schema
On Friday, September 9, 2022 9:57 PM Robert Haas wrote: > > On Fri, Sep 9, 2022 at 5:21 AM Amit Kapila wrote: > > So, why shouldn't a "FOR ALL TABLES IN SCHEMA" publication follow a > > similar behavior? Hi > > It feels like a mistake to me that there's any catalog representation at all > for a > table that is published because it is part of a schema. > The way a feature like this should work is that the schema should be labelled > as > published, and we should discover which tables are part of it at any given > time as > we go. We shouldn't need separate catalog entries for each table in the schema > just because the schema is published. IIRC, the feature currently works almost the same as you described. It doesn't create entry for tables that are published via its schema level, it only record the published schema and check which tables are part of it. Sorry, If I misunderstand your points or missed something. Best regards, Hou zj
Re: Introduce wait_for_subscription_sync for TAP tests
Amit Kapila writes: > Pushed. Recently a number of buildfarm animals have failed at the same place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: # Failed test '2x3000 rows in t' # at t/100_bugs.pl line 149. # got: '9000' # expected: '6000' # Looks like you failed 1 test of 7. [09:30:56] t/100_bugs.pl .. This was the last commit to touch that test script. I'm thinking maybe it wasn't adjusted quite correctly? On the other hand, since I can't find any similar failures before the last 48 hours, maybe there is some other more-recent commit to blame. Anyway, something is wrong there. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2022-09-09%2012%3A03%3A46 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2022-09-09%2011%3A16%3A36 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-09-09%2010%3A33%3A19 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-09-08%2010%3A56%3A59
Re: Summary function for pg_buffercache
Hi Melih, > I would appreciate any feedback/comment on this change. Another benefit of pg_buffercache_summary() you didn't mention is that it allocates much less memory than pg_buffercache_pages() does. Here is v3 where I added this to the documentation. The patch didn't apply to the current master branch with the following error: ``` pg_buffercache_pages.c:286:19: error: no member named 'rlocator' in 'struct buftag' if (bufHdr->tag.rlocator.relNumber != InvalidOid) ~~~ ^ 1 error generated. ``` I fixed this too. Additionally, the patch was pgindent'ed and some typos were fixed. However I'm afraid you can't examine BufferDesc's without taking locks. This is explicitly stated in buf_internals.h: """ Buffer header lock (BM_LOCKED flag) must be held to EXAMINE or change TAG, state or wait_backend_pgprocno fields. """ Let's consider this code again (this is after my fix): ``` if (RelFileNumberIsValid(BufTagGetRelNumber(bufHdr))) { /* ... */ } ``` When somebody modifies relNumber concurrently (e.g. calls ClearBufferTag()) this will cause an undefined behaviour. I suggest we focus on saving the memory first and then think about the performance, if necessary. -- Best regards, Aleksander Alekseev v3-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: HOT chain validation in verify_heapam()
On Fri, Sep 9, 2022 at 10:00 AM Himanshu Upadhyaya wrote: > Approach of introducing a successor array is good but I see one overhead with > having both successor and predecessor array, that is, we will traverse each > offset on page thrice(one more for original loop on offset) and with each > offset we have to retrieve > /reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not > much overhead as they are all preprocessors but there will be some overhead. > How about having new array(initialised with LP_NOT_CHECKED) of enum LPStatus > as below > > typedef enum LPStatus > { > LP_NOT_CHECKED, > LP_VALID, > LP_NOT_VALID > }LPStatus; > > and validating and setting with proper status at three places > 1) while validating Redirect Tuple > 2) while validating populating predecessor array and > 3) at original place of "sanity check" Well, having to duplicate the logic in three places doesn't seem all that clean to me. Admittedly, I haven't tried implementing my proposal, so maybe that doesn't come out very clean either. I don't know. But I think having the code be clearly correct here is the most important thing, not shaving a few CPU cycles here or there. It's not even clear to me that your way would be cheaper, because an "if" statement is certainly not free, and in fact is probably more expensive than an extra call to PageGetItem() or PageGetItemId(). Branches are usually more expensive than math. But actually I doubt that it matters much either way. I think the way to figure out whether we have a performance problem is to write the code in the stylistically best way and then test it. There may be no problem at all, and if there is a problem, it may not be where we think it will be. In short, let's apply Knuth's optimization principle. -- Robert Haas EDB: http://www.enterprisedb.com
Re: configure --with-uuid=bsd fails on NetBSD
Hi, On 8/26/22 19:21, Tom Lane wrote: Nazir Bilal Yavuz writes: Based on these discussions, I attached a patch. I think the right fix is to call uuid_create and then actually check the version field of the result. This avoids breaking what need not be broken, and it'd also guard against comparable problems on other platforms (so don't blame NetBSD specifically in the message, either). I updated my patch. I checked version field in 'uuid_generate_internal' function instead of checking it in 'uuid_generate_v1' and 'uuid_generate_v1mc' functions, but I have some questions: 1 - Should it be checked only for '--with-uuid=bsd' option? 1.1 - If it is needed to be checked only for '--with-uuid=bsd', should just NetBSD be checked? 2 - Should it error out without including current UUID version in the error message? General error message could mask if the 'uuid_create' function starts to produce UUIDs other than version-4. Regards, Nazir Bilal Yavuz diff --git a/contrib/uuid-ossp/expected/uuid_ossp_1.out b/contrib/uuid-ossp/expected/uuid_ossp_1.out new file mode 100644 index 00..e9db9d9b1b --- /dev/null +++ b/contrib/uuid-ossp/expected/uuid_ossp_1.out @@ -0,0 +1,111 @@ +CREATE EXTENSION "uuid-ossp"; +SELECT uuid_nil(); + uuid_nil +-- + ---- +(1 row) + +SELECT uuid_ns_dns(); + uuid_ns_dns +-- + 6ba7b810-9dad-11d1-80b4-00c04fd430c8 +(1 row) + +SELECT uuid_ns_url(); + uuid_ns_url +-- + 6ba7b811-9dad-11d1-80b4-00c04fd430c8 +(1 row) + +SELECT uuid_ns_oid(); + uuid_ns_oid +-- + 6ba7b812-9dad-11d1-80b4-00c04fd430c8 +(1 row) + +SELECT uuid_ns_x500(); + uuid_ns_x500 +-- + 6ba7b814-9dad-11d1-80b4-00c04fd430c8 +(1 row) + +-- some quick and dirty field extraction functions +-- this is actually timestamp concatenated with clock sequence, per RFC 4122 +CREATE FUNCTION uuid_timestamp_bits(uuid) RETURNS varbit AS +$$ SELECT ('x' || substr($1::text, 15, 4) || substr($1::text, 10, 4) || + substr($1::text, 1, 8) || substr($1::text, 20, 4))::bit(80) + & x'0FFF3FFF' $$ +LANGUAGE SQL STRICT IMMUTABLE; +CREATE FUNCTION uuid_version_bits(uuid) RETURNS varbit AS +$$ SELECT ('x' || substr($1::text, 15, 2))::bit(8) & '' $$ +LANGUAGE SQL STRICT IMMUTABLE; +CREATE FUNCTION uuid_reserved_bits(uuid) RETURNS varbit AS +$$ SELECT ('x' || substr($1::text, 20, 2))::bit(8) & '1100' $$ +LANGUAGE SQL STRICT IMMUTABLE; +CREATE FUNCTION uuid_multicast_bit(uuid) RETURNS bool AS +$$ SELECT (('x' || substr($1::text, 25, 2))::bit(8) & '0001') != '' $$ +LANGUAGE SQL STRICT IMMUTABLE; +CREATE FUNCTION uuid_local_admin_bit(uuid) RETURNS bool AS +$$ SELECT (('x' || substr($1::text, 25, 2))::bit(8) & '0010') != '' $$ +LANGUAGE SQL STRICT IMMUTABLE; +CREATE FUNCTION uuid_node(uuid) RETURNS text AS +$$ SELECT substr($1::text, 25) $$ +LANGUAGE SQL STRICT IMMUTABLE; +-- Ideally, the multicast bit would never be set in V1 output, but the +-- UUID library may fall back to MC if it can't get the system MAC address. +-- Also, the local-admin bit might be set (if so, we're probably inside a VM). +-- So we can't test either bit here. +SELECT uuid_version_bits(uuid_generate_v1()), + uuid_reserved_bits(uuid_generate_v1()); +ERROR: uuid_create function is not generating version-1 UUIDs +-- Although RFC 4122 only requires the multicast bit to be set in V1MC style +-- UUIDs, our implementation always sets the local-admin bit as well. +SELECT uuid_version_bits(uuid_generate_v1mc()), + uuid_reserved_bits(uuid_generate_v1mc()), + uuid_multicast_bit(uuid_generate_v1mc()), + uuid_local_admin_bit(uuid_generate_v1mc()); +ERROR: uuid_create function is not generating version-1 UUIDs +-- timestamp+clock sequence should be monotonic increasing in v1 +SELECT uuid_timestamp_bits(uuid_generate_v1()) < uuid_timestamp_bits(uuid_generate_v1()); +ERROR: uuid_create function is not generating version-1 UUIDs +SELECT uuid_timestamp_bits(uuid_generate_v1mc()) < uuid_timestamp_bits(uuid_generate_v1mc()); +ERROR: uuid_create function is not generating version-1 UUIDs +-- Ideally, the node value is stable in V1 addresses, but OSSP UUID +-- falls back to V1MC behavior if it can't get the system MAC address. +SELECT CASE WHEN uuid_multicast_bit(uuid_generate_v1()) AND + uuid_local_admin_bit(uuid_generate_v1()) THEN + true -- punt, no test + ELSE + uuid_node(uuid_generate_v1()) = uuid_node(uuid_generate_v1()) + END; +ERROR: uuid_create function is not generating version-1 UUIDs +-- In any case, V1MC node addresses should be random. +SELECT uuid_node(uuid_generate_v1()) <> uuid_node(uuid_generate_v
Remove redundant code in pl_exec.c
Hi, hackers I found there are some redundant code in pl_exec.c, plpgsql_param_eval_generic_ro is same as plpgsql_param_eval_generic except it invokes MakeExpandedObjectReadOnly. IMO, we can invoke plpgsql_param_eval_generic in plpgsql_param_eval_generic_ro to avoid the redundant. Is there something I missed? Any thoughts? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index 7bd2a9fff1..543419d3da 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -6673,34 +6673,7 @@ static void plpgsql_param_eval_generic_ro(ExprState *state, ExprEvalStep *op, ExprContext *econtext) { - ParamListInfo params; - PLpgSQL_execstate *estate; - int dno = op->d.cparam.paramid - 1; - PLpgSQL_datum *datum; - Oid datumtype; - int32 datumtypmod; - - /* fetch back the hook data */ - params = econtext->ecxt_param_list_info; - estate = (PLpgSQL_execstate *) params->paramFetchArg; - Assert(dno >= 0 && dno < estate->ndatums); - - /* now we can access the target datum */ - datum = estate->datums[dno]; - - /* fetch datum's value */ - exec_eval_datum(estate, datum, - &datumtype, &datumtypmod, - op->resvalue, op->resnull); - - /* safety check -- needed for, eg, record fields */ - if (unlikely(datumtype != op->d.cparam.paramtype)) - ereport(ERROR, -(errcode(ERRCODE_DATATYPE_MISMATCH), - errmsg("type of parameter %d (%s) does not match that when preparing the plan (%s)", - op->d.cparam.paramid, - format_type_be(datumtype), - format_type_be(op->d.cparam.paramtype; + plpgsql_param_eval_generic(state, op, econtext); /* force the value to read-only */ *op->resvalue = MakeExpandedObjectReadOnly(*op->resvalue,
Re: why can't a table be part of the same publication as its schema
On Fri, Sep 9, 2022 at 10:29 AM houzj.f...@fujitsu.com wrote: > IIRC, the feature currently works almost the same as you described. It doesn't > create entry for tables that are published via its schema level, it only > record > the published schema and check which tables are part of it. Oh, well if that's the case, that is great news. But then I don't understand Amit's comment from before: > Yes, because otherwise, there was confusion while dropping the objects > from publication. Consider in the above case, if we would have allowed > it and then the user performs ALTER PUBLICATION p1 DROP ALL TABLES IN > SCHEMA s1, then (a) shall we remove both schema s1 and a table that is > separately added (s1.t1) from that schema, or (b) just remove schema > s1? I believe that (b) is the correct behavior, so I assumed that this issue must be some difficulty in implementing it, like a funny catalog representation. Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1 { ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we used this ALL TABLES IN SCHEMA language. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove redundant code in pl_exec.c
Japin Li writes: > I found there are some redundant code in pl_exec.c, > plpgsql_param_eval_generic_ro is same as plpgsql_param_eval_generic > except it invokes MakeExpandedObjectReadOnly. Which is exactly why it's NOT redundant. > IMO, we can invoke plpgsql_param_eval_generic in plpgsql_param_eval_generic_ro > to avoid the redundant. I don't like this particularly --- it puts way too much premium on the happenstance that the MakeExpandedObjectReadOnly call is the very last step in the callback function. If that needed to change, we'd have a mess. regards, tom lane
Re: is_superuser is not documented
On Fri, Sep 9, 2022, at 2:28 AM, bt22kawamotok wrote: > is_superuser function checks whether a user is a superuser or not, and > is commonly used. However, is_superuser is not documented and is set to > UNGROUPED in guc.c. I think is_superuser should be added to the > documentation and set to PRESET OPTIONS.What are you thought on this? There is no such function. Are you referring to the GUC? I agree that it should be added to the documentation. The main reason is that it is reported by ParameterStatus along with some other GUCs described in the Preset Options section. postgres=# \dfS is_superuser List of functions Schema | Name | Result data type | Argument data types | Type +--+--+-+-- (0 rows) postgres=# SHOW is_superuser; is_superuser -- on (1 row) Do you mind writing a patch? -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Remove redundant code in pl_exec.c
On Fri, 09 Sep 2022 at 23:34, Tom Lane wrote: > Japin Li writes: >> IMO, we can invoke plpgsql_param_eval_generic in >> plpgsql_param_eval_generic_ro >> to avoid the redundant. > > I don't like this particularly --- it puts way too much premium on > the happenstance that the MakeExpandedObjectReadOnly call is the > very last step in the callback function. If that needed to change, > we'd have a mess. > Sorry, I don't get your mind. Could you explain it more? Thanks in advance! -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: pg_upgrade generated files in subdir follow-up
On Wed, Aug 31, 2022 at 04:41:24PM +0200, Daniel Gustafsson wrote: > > On 31 Aug 2022, at 15:59, Tom Lane wrote: > > > > Daniel Gustafsson writes: > >> Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate > >> subdir, but there are a few left which are written to cwd. The thread > >> resulting in that patch doesn't discuss these files specifically so it > >> seems > >> they are just an oversight. Unless I'm missing something. > > > >> Should something the attached be applied to ensure all generated files are > >> placed in the subdirectory? > > > > It certainly looks inconsistent ATM. I wondered if maybe the plan was to > > put routine output into the log directory but problem-reporting files > > into cwd --- but that isn't what's happening now. The script files are intended to stay where they are, and the error files are intended to move under the subdir, to allow for their easy removal, per Tom's request. > Right, check_proper_datallowconn and check_for_isn_and_int8_passing_mismatch > and a few other check functions already place error reporting in the subdir. It looks like I may have grepped for fprintf or similar, and missed checking output_path. I updated your patach to put the logic inside check_for_data_types_usage(), which is shorter, and seems to simplify doing what's intended into the future. -- Justin >From 36aaf5470d5b0bb0ad2a322880e526b8c5f87066 Mon Sep 17 00:00:00 2001 From: Daniel Gustafsson Date: Wed, 31 Aug 2022 14:09:06 +0200 Subject: [PATCH 1/2] pg_upgrade generated files in subdir follow-up Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate subdir, but there are a few left which are written to cwd. The thread resulting in that patch doesn't discuss these files specifically so it seems they are just an oversight. Unless I'm missing something. Should something the attached be applied to ensure all generated files are placed in the subdirectory? -- Daniel Gustafsson https://vmware.com/ --- src/bin/pg_upgrade/check.c | 12 +--- src/bin/pg_upgrade/version.c | 12 +--- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index f4969bcdad7..f1bc1e68868 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -1218,7 +1218,9 @@ check_for_composite_data_type_usage(ClusterInfo *cluster) prep_status("Checking for system-defined composite types in user tables"); - snprintf(output_path, sizeof(output_path), "tables_using_composite.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_composite.txt"); /* * Look for composite types that were made during initdb *or* belong to @@ -1275,7 +1277,9 @@ check_for_reg_data_type_usage(ClusterInfo *cluster) prep_status("Checking for reg* data types in user tables"); - snprintf(output_path, sizeof(output_path), "tables_using_reg.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_reg.txt"); /* * Note: older servers will not have all of these reg* types, so we have @@ -1328,7 +1332,9 @@ check_for_jsonb_9_4_usage(ClusterInfo *cluster) prep_status("Checking for incompatible \"jsonb\" data type"); - snprintf(output_path, sizeof(output_path), "tables_using_jsonb.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_jsonb.txt"); if (check_for_data_type_usage(cluster, "pg_catalog.jsonb", output_path)) { diff --git a/src/bin/pg_upgrade/version.c b/src/bin/pg_upgrade/version.c index 064d23797c5..dc19fc6ec84 100644 --- a/src/bin/pg_upgrade/version.c +++ b/src/bin/pg_upgrade/version.c @@ -183,7 +183,9 @@ old_9_3_check_for_line_data_type_usage(ClusterInfo *cluster) prep_status("Checking for incompatible \"line\" data type"); - snprintf(output_path, sizeof(output_path), "tables_using_line.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_line.txt"); if (check_for_data_type_usage(cluster, "pg_catalog.line", output_path)) { @@ -221,7 +223,9 @@ old_9_6_check_for_unknown_data_type_usage(ClusterInfo *cluster) prep_status("Checking for invalid \"unknown\" user columns"); - snprintf(output_path, sizeof(output_path), "tables_using_unknown.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_unknown.txt"); if (check_for_data_type_usage(cluster, "pg_catalog.unknown", output_path)) { @@ -364,7 +368,9 @@ old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster) prep_status("Checking for invalid \"sql_identifier\" user columns"); - snprintf(output_path, sizeof(output_path), "tables_using_sql_identifier.txt"); + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "tables_using_sql_identifier.txt"); if (check_for_data_type_usage(cluster, "information_schema.sql_ide
Re: Avoid overhead with fprintf related functions
On Fri, Sep 09, 2022 at 10:45:37AM -0300, Ranier Vilela wrote: > Based on work in [1]. > According to https://cplusplus.com/reference/cstdio/fprintf/ > The use of fprintf is related to the need to generate a string based on a > format, which should be different from "%s". > Since fprintf has overhead when parsing the "format" parameter, plus all > the trouble of checking the va_arg parameters. > I think this is one of the low fruits available and easy to reap. > By replacing fprintf with its equivalents, fputs and fputc, > we avoid overhead and increase security [2] and [3]. > > The downside is a huge big churm, which unfortunately will occur. > But, IHMO, I think the advantages are worth it. > Note that behavior remains the same, since fputs and fputc do not change > the expected behavior of fprintf. > > A small performance gain is expected, mainly for the client, since there > are several occurrences in some critical places, such as > (usr/src/fe_utils/print.c). I agree with David [0]. But if you can demonstrate a performance gain, perhaps it's worth considering a subset of these changes in hot paths. [0] https://postgr.es/m/CAApHDvp2THseLvCc%2BTcYFBC7FKHpHTs1JyYmd2JghtOVhb5WGA%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add tracking of backend memory allocated to pg_stat_activity
Greetings, * Drouvot, Bertrand (bdrou...@amazon.com) wrote: > On 9/1/22 3:28 AM, Kyotaro Horiguchi wrote: > >At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby > >wrote in > >>On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > >>>Attached is a patch to > >>>Add tracking of backend memory allocated > > Thanks for the patch. > > + 1 on the idea. Glad folks are in support of the general idea. > >>>+ proargmodes => > >>>'{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', > >>In the past, there was concern about making pg_stat_activity wider by > >>adding information that's less-essential than what's been there for > >>years. This is only an int64, so it's not "wide", but I wonder if > >>there's another way to expose this information? Like adding backends to > >The view looks already too wide to me. I don't want the numbers for > >metrics are added to the view. > > +1 for a dedicated view. A dedicated view with a single column in it hardly seems sensible. I'd also argue that this particular bit of information is extremely useful and therefore worthy of being put directly into pg_stat_activity. I could see a dedicated view possibly *also* being added later if/when we provide a more detailed break-down of how the memory is being used but that's a whole other thing and I'm not even 100% sure we'll ever actually get there, as you can already poke a backend and have it dump out the memory context-level information on an as-needed basis. > While we are at it, what do you think about also recording the max memory > allocated by a backend? (could be useful and would avoid sampling for which > there is no guarantee to sample the max anyway). What would you do with that information..? By itself, it doesn't strike me as useful. Perhaps it'd be interesting to grab the max required for a particular query in pg_stat_statements or such but again, that's a very different thing. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] Fix alter subscription concurrency errors
On 2022-Aug-26, Jelte Fennema wrote: > I indeed don't think this problem is unique to subscriptions, but it seems > better to at least have this problem in a few places less (not making perfect > the enemy of good). > > If someone has a more generic way of solving this for other commands too, > then that sounds great, but if not then slowly chipping away at these cases > seems better than keeping the status quo. > > Attached is a new patch where ALTER SUBSCRIPTION ... OWNER TO ... can > now also be executed concurrently with the other subscription commands. Would it work to use get_object_address() instead? That would save having to write a lookup-and-lock function with a retry loop for each object type. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Add tracking of backend memory allocated to pg_stat_activity
Greetings, * Kyotaro Horiguchi (horikyota@gmail.com) wrote: > At Tue, 06 Sep 2022 17:10:49 -0400, Reid Thompson > wrote in > > I'm open to guidance on testing for performance degradation. I did > > note some basic pgbench comparison numbers in the thread regarding > > limiting backend memory allocations. > > Yeah.. That sounds good.. > > (I have a patch that is stuck at benchmarking on slight possible > degradation caused by a branch (or indirect call) on a hot path > similary to this one. The test showed fluctuation that is not clearly > distinguishable between noise and degradation by running the target > functions in a busy loop..) Just to be clear- this path is (hopefully) not *super* hot as we're only tracking actual allocations (that is- malloc() calls), this isn't changing anything for palloc() calls that aren't also needing to do a malloc(), and we already try to reduce the amount of malloc() calls we're doing by allocating more and more each time we run out in a given context. While I'm generally supportive of doing some benchmarking around this, I don't think the bar is as high as it would be if we were actually changing the cost of routine palloc() or such calls. Thanks, Stephen signature.asc Description: PGP signature
Re: ICU for global collation
In pg14: |postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C; |ERROR: conflicting or redundant options |DETAIL: LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE. In pg15: |postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE "en_US.UTF-8" LOCALE "en_US.UTF-8" ; |CREATE DATABASE f2553d430 actually relaxed the restriction by removing this check: - if (dlocale && (dcollate || dctype)) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), -errmsg("conflicting or redundant options"), -errdetail("LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE."))); But isn't the right fix to do the corresponding thing in createdb (relaxing the frontend restriction rather than reverting its relaxation in the backend). diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c index e523e58b218..5b80e56dfd9 100644 --- a/src/bin/scripts/createdb.c +++ b/src/bin/scripts/createdb.c @@ -159,15 +159,10 @@ main(int argc, char *argv[]) exit(1); } - if (locale) - { - if (lc_ctype) - pg_fatal("only one of --locale and --lc-ctype can be specified"); - if (lc_collate) - pg_fatal("only one of --locale and --lc-collate can be specified"); + if (locale && !lc_ctype) lc_ctype = locale; + if (locale && !lc_collate) lc_collate = locale; - } if (encoding) { BTW it's somewhat crummy that it uses a string comparison, so if you write "UTF8" without a dash, it says this; it took me a few minutes to see the difference... postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE "en_US.UTF8" LOCALE "en_US.UTF8"; ERROR: new collation (en_US.UTF8) is incompatible with the collation of the template database (en_US.UTF-8)
Re: configure --with-uuid=bsd fails on NetBSD
Nazir Bilal Yavuz writes: > I updated my patch. I checked version field in 'uuid_generate_internal' > function instead of checking it in 'uuid_generate_v1' and > 'uuid_generate_v1mc' functions, but I have some questions: Yeah, that seems like the right place. I tweaked the code to check strbuf not str just so we aren't making unnecessary assumptions about the length of what is returned. strbuf[14] is guaranteed to exist, str[14] maybe not. > 1 - Should it be checked only for '--with-uuid=bsd' option? > 1.1 - If it is needed to be checked only for '--with-uuid=bsd', > should just NetBSD be checked? I don't see any reason not to check in the BSD code path --- it's a cheap enough test. On the other hand, in the other code paths there is no evidence that it's necessary, and we'd find out soon enough if it becomes necessary. > 2 - Should it error out without including current UUID version in the > error message? General error message could mask if the 'uuid_create' > function starts to produce UUIDs other than version-4. Yeah, I thought reporting the actual version digit was worth doing, and made it do so. Pushed with those changes and doc updates. I did not push the variant expected-file. I think the entire point here is that we are *not* deeming the new NetBSD implementation acceptable, so allowing it to pass regression tests is the wrong thing. regards, tom lane
Re: Add the ability to limit the amount of memory that can be allocated to backends.
Greetings, * David Rowley (dgrowle...@gmail.com) wrote: > On Thu, 1 Sept 2022 at 04:52, Reid Thompson > wrote: > > Add the ability to limit the amount of memory that can be allocated to > > backends. > > Are you aware that relcache entries are stored in backend local memory > and that once we've added a relcache entry for a relation that we have > no current code which attempts to reduce the memory consumption used > by cache entries when there's memory pressure? Short answer to this is yes, and that's an issue, but it isn't this patch's problem to deal with- that's an issue that the relcache system needs to be changed to address. > It seems to me that if we had this feature as you propose that a > backend could hit the limit and stay there just from the memory > requirements of the relation cache after some number of tables have > been accessed from the given backend. It's not hard to imagine a > situation where the palloc() would start to fail during parse, which > might make it quite infuriating for anyone trying to do something > like: Agreed that this could happen but I don't imagine it to be super likely- and even if it does, this is probably a better position to be in as the backend could then be disconnected from and would then go away and its memory free'd, unlike the current OOM-killer situation where we crash and go through recovery. We should note this in the documentation though, sure, so that administrators understand how this can occur and can take action to address it. > I think a better solution to this problem would be to have "memory > grants", where we configure some amount of "pool" memory that backends > are allowed to use for queries. The planner would have to add the > expected number of work_mem that the given query is expected to use > and before that query starts, the executor would have to "checkout" > that amount of memory from the pool and return it when finished. If > there is not enough memory in the pool then the query would have to > wait until enough memory is available. This creates a deadlocking > hazard that the deadlock detector would need to be made aware of. Sure, that also sounds great and a query acceptance system would be wonderful. If someone is working on that with an expectation of it landing before v16, great. Otherwise, I don't see it as relevant to the question about if we should include this feature or not, and I'm not even sure that we'd refuse this feature even if we already had an acceptance system as a stop-gap should we guess wrong and not realize it until it's too late. Thanks, Stephen signature.asc Description: PGP signature
Re: is_superuser is not documented
"Euler Taveira" writes: > On Fri, Sep 9, 2022, at 2:28 AM, bt22kawamotok wrote: >> is_superuser function checks whether a user is a superuser or not, and >> is commonly used. However, is_superuser is not documented and is set to >> UNGROUPED in guc.c. I think is_superuser should be added to the >> documentation and set to PRESET OPTIONS.What are you thought on this? > There is no such function. Are you referring to the GUC? I agree that it > should > be added to the documentation. If you look at guc.c, it kind of seems intentional that it's undocumented: /* Not for general use --- used by SET SESSION AUTHORIZATION */ {"is_superuser", PGC_INTERNAL, UNGROUPED, gettext_noop("Shows whether the current user is a superuser."), NULL, GUC_REPORT | GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE }, &session_auth_is_superuser, false, NULL, NULL, NULL On the other hand, it seems pretty silly that it's GUC_REPORT if we want to consider it private. I've not checked the git history, but I bet that flag was added later with no thought about context. If we are going to document this then we should at least remove the GUC_NO_SHOW_ALL flag and rewrite the comment. I wonder whether the GUC_NO_RESET_ALL flag is needed either --- seems like the PGC_INTERNAL context protects it sufficiently. regards, tom lane
Re: Remove redundant code in pl_exec.c
Japin Li writes: > On Fri, 09 Sep 2022 at 23:34, Tom Lane wrote: >> I don't like this particularly --- it puts way too much premium on >> the happenstance that the MakeExpandedObjectReadOnly call is the >> very last step in the callback function. If that needed to change, >> we'd have a mess. > Sorry, I don't get your mind. Could you explain it more? Thanks in advance! This refactoring cannot support the situation where there is more code to execute after MakeExpandedObjectReadOnly. regards, tom lane
Re: Switching XLog source from archive to streaming when primary available
On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote: > On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi > wrote: >> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart >> wrote in >> > My general point is that we should probably offer some basic preventative >> > measure against flipping back and forth between streaming and archive >> > recovery while making zero progress. As I noted, maybe that's as simple as >> > having WaitForWALToBecomeAvailable() attempt to restore a file from archive >> > at least once before the new parameter forces us to switch to streaming >> > replication. There might be other ways to handle this. >> >> +1. > > Hm. In that case, I think we can get rid of timeout based switching > mechanism and have this behaviour - the standby can attempt to switch > to streaming mode from archive, say, after fetching 1, 2 or a > configurable number of WAL files. In fact, this is the original idea > proposed by Satya in this thread. IMO the timeout approach would be more intuitive for users. When it comes to archive recovery, "WAL segment" isn't a standard unit of measure. WAL segment size can differ between clusters, and WAL files can have different amounts of data or take different amounts of time to replay. So I think it would be difficult for the end user to decide on a value. However, even the timeout approach has this sort of problem. If your parameter is set to 1 minute, but the current archive takes 5 minutes to recover, you won't really be testing streaming replication once a minute. That would likely need to be documented. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add tracking of backend memory allocated to pg_stat_activity
On Fri, Sep 09, 2022 at 12:34:15PM -0400, Stephen Frost wrote: > > While we are at it, what do you think about also recording the max memory > > allocated by a backend? (could be useful and would avoid sampling for which > > there is no guarantee to sample the max anyway). FYI, that's already kind-of available from getrusage: $ psql ts -c "SET log_executor_stats=on; SET client_min_messages=debug; SELECT a, COUNT(1) FROM generate_series(1,99) a GROUP BY 1;" |wc LOG: EXECUTOR STATISTICS ... ! 194568 kB max resident size Note that max rss counts things allocated outside postgres (like linked libraries). > What would you do with that information..? By itself, it doesn't strike > me as useful. Perhaps it'd be interesting to grab the max required for > a particular query in pg_stat_statements or such but again, that's a > very different thing. log_executor_stats is at level "debug", so it's not great to enable it for a single session, and painful to think about enabling it globally. This would be a lot friendlier. Storing the maxrss per backend somewhere would be useful (and avoid the issue of "sampling" with top), after I agree that it ought to be exposed to a view. For example, it might help to determine whether (and which!) backends are using large multiple of work_mem, and then whether that can be increased. If/when we had a "memory budget allocator", this would help to determine how to set its GUCs, maybe to see "which backends are using the work_mem that are precluding this other backend from using efficient query plan". I wonder if it's better to combine these two threads into one. The 0001 patch of course can be considered independently from the 0002 patch, as usual. Right now, there's different parties on both threads ... -- Justin
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On Sat, Sep 03, 2022 at 11:40:03PM -0400, Reid Thompson wrote: > > > + 0, 0, INT_MAX, > > > + NULL, NULL, NULL > > I think this needs a maximum like INT_MAX/1024/1024 > > Is this noting that we'd set a ceiling of 2048MB? The reason is that you're later multiplying it by 1024*1024, so you need to limit it to avoid overflowing. Compare with min_dynamic_shared_memory, Log_RotationSize, maintenance_work_mem, autovacuum_work_mem. typo: Explicitely +errmsg("request will exceed postgresql.conf defined max_total_backend_memory limit (%lu > %lu)", I wouldn't mention postgresql.conf - it could be in postgresql.auto.conf, or an include file, or a -c parameter. Suggest: allocation would exceed max_total_backend_memory limit... + ereport(LOG, errmsg("decrease reduces reported backend memory allocated below zero; setting reported to 0")); Suggest: deallocation would decrease backend memory below zero; + {"max_total_backend_memory", PGC_SIGHUP, RESOURCES_MEM, Should this be PGC_SU_BACKEND to allow a superuser to set a higher limit (or no limit)? There's compilation warning under mingw cross compile due to sizeof(long). See d914eb347 and other recent commits which I guess is the current way to handle this. http://cfbot.cputube.org/reid-thompson.html For performance test, you'd want to check what happens with a large number of max_connections (and maybe a large number of clients). TPS isn't the only thing that matters. For example, a utility command might sometimes do a lot of allocations (or deallocations), or a "parameterized nested loop" may loop over over many outer tuples and reset for each. There's also a lot of places that reset to a "per-tuple" context. I started looking at its performance, but nothing to show yet. Would you keep people copied on your replies ("reply all") ? Otherwise I (at least) may miss them. I think that's what's typical on these lists (and the list tool is smart enough not to send duplicates to people who are direct recipients). -- Justin
Re: is_superuser is not documented
I wrote: > On the other hand, it seems pretty silly that it's GUC_REPORT if > we want to consider it private. I've not checked the git history, > but I bet that flag was added later with no thought about context. > > If we are going to document this then we should at least remove > the GUC_NO_SHOW_ALL flag and rewrite the comment. I wonder whether > the GUC_NO_RESET_ALL flag is needed either --- seems like the > PGC_INTERNAL context protects it sufficiently. BTW, "session_authorization" has a subset of these same issues: /* Not for general use --- used by SET SESSION AUTHORIZATION */ {"session_authorization", PGC_USERSET, UNGROUPED, gettext_noop("Sets the session user name."), NULL, GUC_IS_NAME | GUC_REPORT | GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE | GUC_NOT_WHILE_SEC_REST }, &session_authorization_string, NULL, check_session_authorization, assign_session_authorization, NULL I wonder why this one is marked USERSET where the other is not. You'd think both of them need similar special-casing about how to handle SET. regards, tom lane
Re: Possible crash on standby
On Fri, Sep 09, 2022 at 05:29:49PM +0900, Kyotaro Horiguchi wrote: > This is because WaitForWALToBecomeAvailable doesn't call > XLogSHutdownWalRcv() when walreceiver has been stopped before we reach > the WalRcvStreaming() call cited above. But we need to set > InstasllXLogFileSegmentActive to false even in that case, since no one > other than startup process does that. Nice find. > Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might > need to correct the dependencies between the flag and walreceiver > state, but it not mandatory because XLogShutdownWalRcv() is designed > so that it can be called even after walreceiver is stopped. I don't > have a clear memory about why we do that at the time, though, but > recovery check runs successfully with this. I suppose the alternative would be to set InstallXLogFileSegmentActive to false in an 'else' block, but that doesn't seem necessary if XLogShutdownWalRcv() is safe to call unconditionally. So, unless there is a bigger problem that I'm not seeing, +1 for your patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Error "initial slot snapshot too large" in create replication slot
On Mon, Jul 18, 2022 at 10:55 PM Kyotaro Horiguchi wrote: > Thanks! Just rebased. Hi, I mentioned this patch to Andres in conversation, and he expressed a concern that there might be no guarantee that we retain enough CLOG to look up XIDs. Presumably this wouldn't be an issue when the snapshot doesn't get marked suboverflowed, but what if it does? Adding Andres in the hopes that he may comment further. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Possible crash on standby
On Fri, Sep 9, 2022 at 2:00 PM Kyotaro Horiguchi wrote: > > Hello. > > While I played with some patch, I met an assertion failure. > > #2 0x00b350e0 in ExceptionalCondition ( > conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()", > errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c", > lineNumber=4190) at assert.c:69 > #3 0x00586f9c in XLogFileRead (segno=61, emode=13, tli=1, > source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190 > #4 0x005871d2 in XLogFileReadAnyTLI (segno=61, emode=13, > source=XLOG_FROM_ANY) at xlogrecovery.c:4296 > #5 0x0058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360, > randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1, > replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727 > > This is replayable by the following steps. > > 1. insert a sleep(1) in WaitForWALToBecomeAvailable(). > >* WAL that we restore from archive. > >*/ > > + sleep(1); > > if (WalRcvStreaming()) > > XLogShutdownWalRcv(); > > 2. create a primary with archiving enabled. > > 3. create a standby with recovering from the primary's archive and > unconnectable primary_conninfo. > > 4. start the primary. > > 5. switch wal on the primary. > > 6. Kaboom. > > This is because WaitForWALToBecomeAvailable doesn't call > XLogSHutdownWalRcv() when walreceiver has been stopped before we reach > the WalRcvStreaming() call cited above. But we need to set > InstasllXLogFileSegmentActive to false even in that case, since no one > other than startup process does that. > > Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might > need to correct the dependencies between the flag and walreceiver > state, but it not mandatory because XLogShutdownWalRcv() is designed > so that it can be called even after walreceiver is stopped. I don't > have a clear memory about why we do that at the time, though, but > recovery check runs successfully with this. > > This code was introduced at PG12. I think it is a duplicate of [1]. I have tested the above use-case with the patch at [1] and it fixes the issue. [1] https://www.postgresql.org/message-id/CALj2ACXPn_xePphnh88qmoQqqW%2BE2KEOdxGL%2BD-o9o7_XNGkkw%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Summary function for pg_buffercache
On Fri, Sep 09, 2022 at 05:36:45PM +0300, Aleksander Alekseev wrote: > However I'm afraid you can't examine BufferDesc's without taking > locks. This is explicitly stated in buf_internals.h: Yeah, when I glanced at this patch earlier, I wondered about this. > I suggest we focus on saving the memory first and then think about the > performance, if necessary. +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Possible crash on standby
On Fri, Sep 09, 2022 at 10:51:10PM +0530, Bharath Rupireddy wrote: > I think it is a duplicate of [1]. I have tested the above use-case > with the patch at [1] and it fixes the issue. I added this thread to the existing commitfest entry. Thanks for pointing this out. https://commitfest.postgresql.org/39/3814 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Switching XLog source from archive to streaming when primary available
On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart wrote: > > On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote: > > On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi > > wrote: > >> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart > >> wrote in > >> > My general point is that we should probably offer some basic preventative > >> > measure against flipping back and forth between streaming and archive > >> > recovery while making zero progress. As I noted, maybe that's as simple > >> > as > >> > having WaitForWALToBecomeAvailable() attempt to restore a file from > >> > archive > >> > at least once before the new parameter forces us to switch to streaming > >> > replication. There might be other ways to handle this. > >> > >> +1. > > > > Hm. In that case, I think we can get rid of timeout based switching > > mechanism and have this behaviour - the standby can attempt to switch > > to streaming mode from archive, say, after fetching 1, 2 or a > > configurable number of WAL files. In fact, this is the original idea > > proposed by Satya in this thread. > > IMO the timeout approach would be more intuitive for users. When it comes > to archive recovery, "WAL segment" isn't a standard unit of measure. WAL > segment size can differ between clusters, and WAL files can have different > amounts of data or take different amounts of time to replay. How about the amount of WAL bytes fetched from the archive after which a standby attempts to connect to primary or enter streaming mode? Of late, we've changed some GUCs to represent bytes instead of WAL files/segments, see [1]. > So I think it > would be difficult for the end user to decide on a value. However, even > the timeout approach has this sort of problem. If your parameter is set to > 1 minute, but the current archive takes 5 minutes to recover, you won't > really be testing streaming replication once a minute. That would likely > need to be documented. If we have configurable WAL bytes instead of timeout for standby WAL source switch from archive to primary, we don't have the above problem right? [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: START_REPLICATION SLOT causing a crash in an assert build
On Wed, Sep 07, 2022 at 12:39:08PM -0700, Andres Freund wrote: > Hi, > > On 2022-09-06 18:40:49 -0500, Jaime Casanova wrote: > > I'm not sure what is causing this, but I have seen this twice. The > > second time without activity after changing the set of tables in a > > PUBLICATION. > > Can you describe the steps to reproduce? > I'm still trying to determine that > Which git commit does this happen on? > 6e55ea79faa56db85a2b6c5bf94cee8acf8bfdb8 (Stamp 15beta4) > > > gdb says that debug_query_string contains: > > > > """ > > START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', > > publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" > > LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"') > > """ > > > > attached the backtrace. > > > > > #2 0x5559bfd4f0ed in ExceptionalCondition ( > > conditionName=0x5559bff30e20 "namestrcmp(&statent->slotname, > > NameStr(slot->data.name)) == 0", errorType=0x5559bff30e0d > > "FailedAssertion", fileName=0x5559bff30dbb "pgstat_replslot.c", > > lineNumber=89) at assert.c:69 > > what are statent->slotname and slot->data.name? > slot->data.name seems to be the replication_slot record, and statent->slotname comes from the in shared memory stats for that slot. And the assert happens when &statent->slotname.data comes empty, which is not frequent but it happens from time to time btw, while I'm looking at this I found that we can drop a publication while there are active subscriptions pointing to it, is that something we should allow? anyway, that is not the cause of this because the replication slot actually exists. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: small windows psqlrc re-wording
I wrote: > On testing that in HEAD, I read > Both the system-wide startup file and the user's personal startup file > can be made psql-version-specific by appending a dash and the > PostgreSQL major or minor release number to the file name, for example > ~/.psqlrc-16 or ~/.psqlrc-16devel. > That's a little confusing but it's actually accurate, because what > process_psqlrc_file appends is the string PG_VERSION, so in a devel > branch or beta release there's a non-numeric "minor release". > I'm inclined to go ahead and do it like that. I decided that what I found jarring about that was the use of "release number" with a non-numeric version, so I changed it to "release identifier" and pushed. regards, tom lane
Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition
On Sat, Aug 06, 2022 at 04:09:28PM -0700, Peter Geoghegan wrote: > What are the chances that both thresholds will be crossed at *exactly* > (not approximately) the same time in a real world case, where the > table isn't tiny (tiny relative to the "autovacuum_naptime quantum")? > This is a very narrow case. The threshold wouldn't need to be crossed within autovacuum_naptime, since all the workers might be busy. Consider autovacuum delay, analyze on long/wide tables, multiple extended stats objects, or autovacuum parameters which are changed off-hours by a cronjob. > It might make sense to *always* show how close we were to hitting each > of the thresholds, including the ones that we didn't end up hitting > (we may come pretty close quite often, which seems like it might > matter). But showing multiple conditions together just because the > planets aligned (we hit multiple thresholds together) emphasizes the > low-level mechanism, which is pretty far removed from anything that > matters. You might as well pick either threshold at random once this > happens -- even an expert won't be able to tell the difference. I don't have strong feelings about it; I'm just pointing out that the two of the conditions aren't actually exclusive. It seems like it could be less confusing to show both. Consider someone who is trying to *reduce* how often autovacuum runs, or give priority to some tables by raising the thresholds for other tables. -- Justin
Re: why can't a table be part of the same publication as its schema
> On Sep 9, 2022, at 8:18 AM, Robert Haas wrote: > > Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1 > { ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we > used this ALL TABLES IN SCHEMA language. The conversation, as I recall, was that "ADD SCHEMA foo" would only mean all tables in foo, until publication of other object types became supported, at which point "ADD SCHEMA foo" would suddenly mean more than it did before. People might find that surprising, so the "ALL TABLES IN" was intended to future-proof against surprising behavioral changes. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition
On Fri, Sep 9, 2022 at 11:11 AM Justin Pryzby wrote: > > It might make sense to *always* show how close we were to hitting each > > of the thresholds, including the ones that we didn't end up hitting > > (we may come pretty close quite often, which seems like it might > > matter). But showing multiple conditions together just because the > > planets aligned (we hit multiple thresholds together) emphasizes the > > low-level mechanism, which is pretty far removed from anything that > > matters. You might as well pick either threshold at random once this > > happens -- even an expert won't be able to tell the difference. > > I don't have strong feelings about it; I'm just pointing out that the > two of the conditions aren't actually exclusive. Fair enough. I'm just pointing out that the cutoffs are continuous for all practical purposes, even if there are cases where they seem kind of discrete, due only to implementation details (e.g. autovacuum_naptime stuff). Displaying only one reason for triggering an autovacuum is consistent with the idea that the cutoffs are continuous. It's not literally true that they're continuous, but it might as well be. I think of it as similar to how it's not literally true that a coin toss is always either heads or tails, though it might as well be true. Sure, even a standard fair coin toss might result in the coin landing on its side. That'll probably never happen even once, but if does: just flip the coin again! The physical coin toss was never the important part. > It seems like it could be less confusing to show both. Consider someone who > is > trying to *reduce* how often autovacuum runs, or give priority to some tables > by raising the thresholds for other tables. My objection to that sort of approach is that it suggests a difference in what each VACUUM actually does -- as if autovacuum.c actually decided on a particular runtime behavior for the VACUUM up front, based on its own considerations that come from statistics about the workload. I definitely want to avoid creating that false impression in the minds of users. -- Peter Geoghegan
Re: preserve timestamps when installing headers
On 11/01/2022 00:03, Tom Lane wrote: Peter Eisentraut writes: I don't think preserving timestamps should be the default behavior, but I would support organizing things so that additional options can be passed to "install" to make it do whatever the user prefers. But that won't work if some installations don't go through install. +1. We just bumped into this with Neon, where we have a build script that generates Rust bindings from the PostgreSQL header files. The build script runs "make install", and because that changes the mtime even if there were no changes to the headers, the bindings are also regenerated every time. So I fear we're optimizing for a case that stopped being mainstream a decade or more back. I could get behind switching the code back to using $(INSTALL) for this, and then offering some way to inject user-selected switches into the $(INSTALL) invocations. That wouldn't need much more than another gmake macro. (Does there need to be a way to inject such switches only into header installations, or is it OK to do it across the board?) Here's a patch to switch back to $(INSTALL). With that, you can do: ./configure INSTALL="/usr/bin/install -C" [ wanders away wondering how this'd affect the meson conversion project ] If anything, I guess this will help, by making the Makefile a bit less special. - HeikkiFrom c0305f950e45ed905cc8145879d0b48390c387fa Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 9 Sep 2022 21:25:26 +0300 Subject: [PATCH 1/1] Use normal install program to install server headers. Commit a7032690f9 replaced $(INSTALL) with plain "cp" for installing the server header files. It sped up "make install" significantly, because the old logic called $(INSTALL) separately for every header file, whereas plain "cp" could copy all the files in one command. However, we have long since made it a requirement that $(INSTALL) can also install multiple files in one command, see commit f1c5247563. Switch back to $(INSTALL). Discussion: https://www.postgresql.org/message-id/200503252305.j2PN52m23610%40candle.pha.pa.us Discussion: https://www.postgresql.org/message-id/2415283.1641852217%40sss.pgh.pa.us --- src/include/Makefile | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/src/include/Makefile b/src/include/Makefile index 0b4cab9bb1..1e50400617 100644 --- a/src/include/Makefile +++ b/src/include/Makefile @@ -48,22 +48,15 @@ install: all installdirs $(INSTALL_DATA) utils/errcodes.h '$(DESTDIR)$(includedir_server)/utils' $(INSTALL_DATA) utils/fmgroids.h '$(DESTDIR)$(includedir_server)/utils' $(INSTALL_DATA) utils/fmgrprotos.h '$(DESTDIR)$(includedir_server)/utils' -# We don't use INSTALL_DATA for performance reasons --- there are a lot of files -# (in fact, we have to take some pains to avoid overlength shell commands here) - cp $(srcdir)/*.h '$(DESTDIR)$(includedir_server)'/ + $(INSTALL_DATA) $(srcdir)/*.h '$(DESTDIR)$(includedir_server)' for dir in $(SUBDIRS); do \ - cp $(srcdir)/$$dir/*.h '$(DESTDIR)$(includedir_server)'/$$dir/ || exit; \ + $(INSTALL_DATA) $(srcdir)/$$dir/*.h '$(DESTDIR)$(includedir_server)'/$$dir || exit; \ done ifeq ($(vpath_build),yes) for file in catalog/schemapg.h catalog/system_fk_info.h catalog/pg_*_d.h parser/gram.h storage/lwlocknames.h utils/probes.h; do \ - cp $$file '$(DESTDIR)$(includedir_server)'/$$file || exit; \ + $(INSTALL_DATA) $$file '$(DESTDIR)$(includedir_server)'/$$file || exit; \ done endif - cd '$(DESTDIR)$(includedir_server)' && chmod $(INSTALL_DATA_MODE) *.h - for dir in $(SUBDIRS); do \ - cd '$(DESTDIR)$(includedir_server)'/$$dir || exit; \ - chmod $(INSTALL_DATA_MODE) *.h || exit; \ - done installdirs: $(MKDIR_P) '$(DESTDIR)$(includedir)/libpq' '$(DESTDIR)$(includedir_internal)/libpq' -- 2.30.2
Re: [BUG] Storage declaration in ECPG
Andrey Sokolov writes: > [ v2-0001-Fix-storage-declaration-in-ECPG.patch ] Pushed. I didn't think a whole new test case was appropriate, either from the patch-footprint or test-runtime standpoint, so I just added a couple of declarations to preproc/variable.pgc. regards, tom lane
Re: Expand palloc/pg_malloc API
Peter Eisentraut writes: > I have updated this patch set to rename the _obj() functions to > _object(), and I have dropped the _ptrtype() variants. > I have also split the patch to put the new API and the example uses into > separate patches. This patch set seems fine to me, so I've marked it Ready for Committer. I think serious consideration should be given to back-patching the 0001 part (that is, addition of the macros). Otherwise we'll have to remember not to use these macros in code intended for back-patch, and that'll be mighty annoying once we are used to them. regards, tom lane
Re: Expand palloc/pg_malloc API
Robert Haas writes: > On Fri, Aug 12, 2022 at 3:31 AM Peter Eisentraut > wrote: >> (Personally, I have always been a bit suspicious about using the name >> palloc() without memory context semantics in frontend code, but I guess >> this is wide-spread now.) > I think it would be a good thing to add memory context support to the > frontend. We could just put everything in a single context for > starters, and then frontend utilities that wanted to create other > contexts could do so. Perhaps, but I think we should have at least one immediate use-case for multiple contexts in frontend. Otherwise it's just useless extra code. The whole point of memory contexts in the backend is that we have well-defined lifespans for certain types of allocations (executor state, function results, etc); but it's not very clear to me that the same concept will be helpful in any of our frontend programs. > There are difficulties, though. For instance, memory contexts are > nodes, and have a NodeTag. And I'm pretty sure we don't want frontend > code to know about all the backend node types. My suspicion is that > memory context types really shouldn't be node types, but right now, > they are. Whether that's the correct view or not, this kind of problem > means it's not a simple lift-and-shift to move the memory context code > into src/common. Someone would need to spend some time thinking about > how to engineer it. I don't really think that's much of an issue. We could replace the nodetag fields with some sort of magic number and have just as much wrong-pointer safety as in the backend. What I do take issue with is moving the code into src/common. I think we'd be better off just writing a distinct implementation for frontend. For one thing, it's not apparent to me that aset.c is a good allocator for frontend (and the other two surely are not). This is all pretty off-topic for Peter's patch, though. regards, tom lane
Re: Summary function for pg_buffercache
Hi hackers, > > I suggest we focus on saving the memory first and then think about the > > performance, if necessary. > > +1 I made a mistake in v3 cfbot complained about. It should have been: ``` if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag))) ``` Here is the corrected patch. -- Best regards, Aleksander Alekseev v4-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: configure --with-uuid=bsd fails on NetBSD
Hi, On 2022-09-09 12:48:38 -0400, Tom Lane wrote: > Pushed with those changes and doc updates. I did not push the > variant expected-file. I think the entire point here is that > we are *not* deeming the new NetBSD implementation acceptable, > so allowing it to pass regression tests is the wrong thing. What do we gain from the regression test failing exactly this way, given that we know it's a problem? It just makes it harder to run tests. How about we add it as variant file, but via the resultmap mechanism? That way we wouldn't silently accept the same bug on other platforms, but can still run the test without needing to manually filter out bogus netbsd results? Greetings, Andres Freund
Re: configure --with-uuid=bsd fails on NetBSD
Andres Freund writes: > On 2022-09-09 12:48:38 -0400, Tom Lane wrote: >> Pushed with those changes and doc updates. I did not push the >> variant expected-file. I think the entire point here is that >> we are *not* deeming the new NetBSD implementation acceptable, >> so allowing it to pass regression tests is the wrong thing. > What do we gain from the regression test failing exactly this way, given that > we know it's a problem? It tells people not to use --with-uuid=bsd on those NetBSD versions. They can either do without uuid-ossp, or install ossp or e2fs. ("Do without" is not much of a hardship, now that we have gen_random_uuid() in core.) IMV a substantial part of the point of the regression tests is to let end users and/or packagers verify that they have a non-broken installation. Hiding a problem by making the tests not fail basically breaks that use-case. If we had, say, a known openssl security bug that was exposed by our test cases, would you advocate dumbing down the tests to not expose the bug? > It just makes it harder to run tests. Harder for who? AFAICT there is nobody but me routinely running full tests on NetBSD, else we'd have found this problem much earlier. I've got my animals configured not to use --with-uuid (not much of a lift considering that's the buildfarm's default). End of problem. regards, tom lane
Re: Introduce wait_for_subscription_sync for TAP tests
On Fri, Sep 9, 2022 at 11:31 PM Tom Lane wrote: > > Amit Kapila writes: > > Pushed. > > Recently a number of buildfarm animals have failed at the same > place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > > # Failed test '2x3000 rows in t' > # at t/100_bugs.pl line 149. > # got: '9000' > # expected: '6000' > # Looks like you failed 1 test of 7. > [09:30:56] t/100_bugs.pl .. > > This was the last commit to touch that test script. I'm thinking > maybe it wasn't adjusted quite correctly? On the other hand, since > I can't find any similar failures before the last 48 hours, maybe > there is some other more-recent commit to blame. Anyway, something > is wrong there. It seems that this commit is innocent as it changed only how to wait. Rather, looking at the logs, the tablesync worker errored out at an interesting point: 022-09-09 09:30:19.630 EDT [631b3feb.840:13] pg_16400_sync_16392_7141371862484106124 ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 2022-09-09 09:30:19.630 EDT [631b3feb.840:14] pg_16400_sync_16392_7141371862484106124 STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0 (proto_version '3', origin 'any', publication_names '"testpub"') ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 2022-09-09 09:30:19.631 EDT [631b3feb.26e8:2] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef is relevant. Regards, -- Masahiko Sawada
Re: Avoid overhead with fprintf related functions
Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart < nathandboss...@gmail.com> escreveu: > On Fri, Sep 09, 2022 at 10:45:37AM -0300, Ranier Vilela wrote: > > Based on work in [1]. > > According to https://cplusplus.com/reference/cstdio/fprintf/ > > The use of fprintf is related to the need to generate a string based on a > > format, which should be different from "%s". > > Since fprintf has overhead when parsing the "format" parameter, plus all > > the trouble of checking the va_arg parameters. > > I think this is one of the low fruits available and easy to reap. > > By replacing fprintf with its equivalents, fputs and fputc, > > we avoid overhead and increase security [2] and [3]. > > > > The downside is a huge big churm, which unfortunately will occur. > > But, IHMO, I think the advantages are worth it. > > Note that behavior remains the same, since fputs and fputc do not change > > the expected behavior of fprintf. > > > > A small performance gain is expected, mainly for the client, since there > > are several occurrences in some critical places, such as > > (usr/src/fe_utils/print.c). > > I agree with David [0]. But if you can demonstrate a performance gain, > perhaps it's worth considering a subset of these changes in hot paths. > Simple benchmark with David sort example. 1. make data create table t (a bigint not null, b bigint not null, c bigint not null, d bigint not null, e bigint not null, f bigint not null); insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB! vacuum freeze t; 2. client run \timing on \pset pager off select * from t limit 100; head: Time: 418,210 ms Time: 419,588 ms Time: 424,713 ms fprintf patch: Time: 416,919 ms Time: 416,246 ms Time: 416,237 ms regards, Ranier Vilela
Re: Introduce wait_for_subscription_sync for TAP tests
Masahiko Sawada writes: > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane wrote: >> Recently a number of buildfarm animals have failed at the same >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: >> >> # Failed test '2x3000 rows in t' >> # at t/100_bugs.pl line 149. >> # got: '9000' >> # expected: '6000' >> # Looks like you failed 1 test of 7. >> [09:30:56] t/100_bugs.pl .. >> >> This was the last commit to touch that test script. I'm thinking >> maybe it wasn't adjusted quite correctly? On the other hand, since >> I can't find any similar failures before the last 48 hours, maybe >> there is some other more-recent commit to blame. Anyway, something >> is wrong there. > It seems that this commit is innocent as it changed only how to wait. Yeah. I was wondering if it caused us to fail to wait somewhere, but I concur that's not all that likely. > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > is relevant. Noting that the errors have only appeared in the past couple of days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f (Fix recovery_prefetch with low maintenance_io_concurrency). regards, tom lane
Re: Avoid overhead with fprintf related functions
Em sex., 9 de set. de 2022 às 10:45, Ranier Vilela escreveu: > Based on work in [1]. > According to https://cplusplus.com/reference/cstdio/fprintf/ > The use of fprintf is related to the need to generate a string based on a > format, which should be different from "%s". > Since fprintf has overhead when parsing the "format" parameter, plus all > the trouble of checking the va_arg parameters. > I think this is one of the low fruits available and easy to reap. > By replacing fprintf with its equivalents, fputs and fputc, > we avoid overhead and increase security [2] and [3]. > > The downside is a huge big churm, which unfortunately will occur. > But, IHMO, I think the advantages are worth it. > Note that behavior remains the same, since fputs and fputc do not change > the expected behavior of fprintf. > > A small performance gain is expected, mainly for the client, since there > are several occurrences in some critical places, such as > (usr/src/fe_utils/print.c). > > Patch attached. > This pass check-world. > Rechecked for the hundredth time. One typo. regards, Ranier Vilela diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c index 32d5444831..4672d8b2a4 100644 --- a/contrib/oid2name/oid2name.c +++ b/contrib/oid2name/oid2name.c @@ -422,7 +422,7 @@ sql_exec(PGconn *conn, const char *todo, bool quiet) fprintf(stdout, "%*s", length[j] + 2, PQfname(res, j)); l += length[j] + 2; } - fprintf(stdout, "\n"); + fputc('\n', stdout); pad = (char *) pg_malloc(l + 1); memset(pad, '-', l); pad[l] = '\0'; @@ -435,7 +435,7 @@ sql_exec(PGconn *conn, const char *todo, bool quiet) { for (j = 0; j < nfields; j++) fprintf(stdout, "%*s", length[j] + 2, PQgetvalue(res, i, j)); - fprintf(stdout, "\n"); + fputc('\n', stdout); } /* cleanup */ diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c index 58d32ba946..ed1ab65940 100644 --- a/contrib/pg_trgm/trgm_regexp.c +++ b/contrib/pg_trgm/trgm_regexp.c @@ -2201,7 +2201,7 @@ printSourceNFA(regex_t *regex, TrgmColorInfo *colors, int ncolors) /* dot -Tpng -o /tmp/source.png < /tmp/source.gv */ FILE *fp = fopen("/tmp/source.gv", "w"); - fprintf(fp, "%s", buf.data); + fputs(buf.data, fp); fclose(fp); } @@ -2263,7 +2263,7 @@ printTrgmNFA(TrgmNFA *trgmNFA) /* dot -Tpng -o /tmp/transformed.png < /tmp/transformed.gv */ FILE *fp = fopen("/tmp/transformed.gv", "w"); - fprintf(fp, "%s", buf.data); + fputs(buf.data, fp); fclose(fp); } @@ -2354,7 +2354,7 @@ printTrgmPackedGraph(TrgmPackedGraph *packedGraph, TRGM *trigrams) /* dot -Tpng -o /tmp/packed.png < /tmp/packed.gv */ FILE *fp = fopen("/tmp/packed.gv", "w"); - fprintf(fp, "%s", buf.data); + fputs(buf.data, fp); fclose(fp); } diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c index 264b879bd3..e51ce8df57 100644 --- a/contrib/vacuumlo/vacuumlo.c +++ b/contrib/vacuumlo/vacuumlo.c @@ -133,7 +133,7 @@ vacuumlo(const char *database, const struct _param *param) { fprintf(stdout, "Connected to database \"%s\"\n", database); if (param->dry_run) - fprintf(stdout, "Test run: no large objects will be removed!\n"); + fputs("Test run: no large objects will be removed!\n", stdout); } res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7a710e6490..343555e095 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8641,7 +8641,7 @@ do_pg_backup_stop(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) * Transfer remaining lines including label and start timeline to * history file. */ - fprintf(fp, "%s", remaining); + fputs(remaining, fp); fprintf(fp, "STOP TIME: %s\n", strfbuf); fprintf(fp, "STOP TIMELINE: %u\n", stoptli); if (fflush(fp) || ferror(fp) || FreeFile(fp)) diff --git a/src/backend/optimizer/geqo/geqo_misc.c b/src/backend/optimizer/geqo/geqo_misc.c index 890ac363e9..0d024f227f 100644 --- a/src/backend/optimizer/geqo/geqo_misc.c +++ b/src/backend/optimizer/geqo/geqo_misc.c @@ -114,17 +114,17 @@ print_edge_table(FILE *fp, Edge *edge_table, int num_gene) int i, j; - fprintf(fp, "\nEDGE TABLE\n"); + fputs("\nEDGE TABLE\n", fp); for (i = 1; i <= num_gene; i++) { fprintf(fp, "%d :", i); for (j = 0; j < edge_table[i].unused_edges; j++) fprintf(fp, " %d", edge_table[i].edge_list[j]); - fprintf(fp, "\n"); + fputc('\n', fp); } - fprintf(fp, "\n"); + fputc('\n', fp); fflush(fp); } diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index e75611fdd5..36808438c8 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -5601,10 +5601,10 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname) return false; } - fprintf(fp, "%s", fullprogname); + fputs(fullprogname, fp); for (i = 1; i < argc; i++) fprintf(fp
Re: Avoid overhead with fprintf related functions
Ranier Vilela writes: > Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart < > nathandboss...@gmail.com> escreveu: >> I agree with David [0]. But if you can demonstrate a performance gain, >> perhaps it's worth considering a subset of these changes in hot paths. > head: > Time: 418,210 ms > Time: 419,588 ms > Time: 424,713 ms > fprintf patch: > Time: 416,919 ms > Time: 416,246 ms > Time: 416,237 ms That is most certainly not enough gain to justify a large amount of code churn. In fact, given that this is probably pretty platform-dependent and you've checked only one platform, I don't think I'd call this a sufficient case for even a one-line change. regards, tom lane
Re: Introduce wait_for_subscription_sync for TAP tests
On Sat, Sep 10, 2022 at 6:45 AM Tom Lane wrote: > > Masahiko Sawada writes: > > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane wrote: > >> Recently a number of buildfarm animals have failed at the same > >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > >> > >> # Failed test '2x3000 rows in t' > >> # at t/100_bugs.pl line 149. > >> # got: '9000' > >> # expected: '6000' > >> # Looks like you failed 1 test of 7. > >> [09:30:56] t/100_bugs.pl .. > >> > >> This was the last commit to touch that test script. I'm thinking > >> maybe it wasn't adjusted quite correctly? On the other hand, since > >> I can't find any similar failures before the last 48 hours, maybe > >> there is some other more-recent commit to blame. Anyway, something > >> is wrong there. > > > It seems that this commit is innocent as it changed only how to wait. > > Yeah. I was wondering if it caused us to fail to wait somewhere, > but I concur that's not all that likely. > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > is relevant. > > Noting that the errors have only appeared in the past couple of > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > (Fix recovery_prefetch with low maintenance_io_concurrency). Probably I found the cause of this failure[1]. The commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef didn't fix the problem properly. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj%3DsLUOn4Gd2GjUAKG-fw%40mail.gmail.com -- Masahiko Sawada
Re: Avoid overhead with fprintf related functions
Em sex., 9 de set. de 2022 às 18:53, Tom Lane escreveu: > Ranier Vilela writes: > > Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart < > > nathandboss...@gmail.com> escreveu: > >> I agree with David [0]. But if you can demonstrate a performance gain, > >> perhaps it's worth considering a subset of these changes in hot paths. > > > head: > > Time: 418,210 ms > > Time: 419,588 ms > > Time: 424,713 ms > > > fprintf patch: > > Time: 416,919 ms > > Time: 416,246 ms > > Time: 416,237 ms > > That is most certainly not enough gain to justify a large amount > of code churn. In fact, given that this is probably pretty > platform-dependent and you've checked only one platform, I don't > think I'd call this a sufficient case for even a one-line change. > Of course, base these changes not on performance gain, but on correct style and increased security. But out-vote is out-vote, case closed. Regards, Ranier Vilela
Re: Introduce wait_for_subscription_sync for TAP tests
On Sat, Sep 10, 2022 at 9:45 AM Tom Lane wrote: > Masahiko Sawada writes: > > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane wrote: > >> Recently a number of buildfarm animals have failed at the same > >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > >> > >> # Failed test '2x3000 rows in t' > >> # at t/100_bugs.pl line 149. > >> # got: '9000' > >> # expected: '6000' > >> # Looks like you failed 1 test of 7. > >> [09:30:56] t/100_bugs.pl .. > >> > >> This was the last commit to touch that test script. I'm thinking > >> maybe it wasn't adjusted quite correctly? On the other hand, since > >> I can't find any similar failures before the last 48 hours, maybe > >> there is some other more-recent commit to blame. Anyway, something > >> is wrong there. > > > It seems that this commit is innocent as it changed only how to wait. > > Yeah. I was wondering if it caused us to fail to wait somewhere, > but I concur that's not all that likely. > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > is relevant. > > Noting that the errors have only appeared in the past couple of > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > (Fix recovery_prefetch with low maintenance_io_concurrency). Yeah, I also just spotted the coincidence of those failures while monitoring the build farm. I'll look into this later today. My initial suspicion is that there was pre-existing code here that was (incorrectly?) relying on the lack of error reporting in that case. But maybe I misunderstood and it was incorrect to report the error for some reason that was not robustly covered with tests.
Re: configure --with-uuid=bsd fails on NetBSD
Hi, On 2022-09-09 17:31:40 -0400, Tom Lane wrote: > Harder for who? AFAICT there is nobody but me routinely running > full tests on NetBSD, else we'd have found this problem much earlier. Bilal's report was caused by automating testing on netbsd (and openbsd) as well, as part of the meson stuff. I also occasionally run the tests in a VM. But as you say: > I've got my animals configured not to use --with-uuid (not much of > a lift considering that's the buildfarm's default). End of problem. Fair enough. Greetings, Andres Freund
Re: Switching XLog source from archive to streaming when primary available
On Fri, Sep 09, 2022 at 11:07:00PM +0530, Bharath Rupireddy wrote: > On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart > wrote: >> IMO the timeout approach would be more intuitive for users. When it comes >> to archive recovery, "WAL segment" isn't a standard unit of measure. WAL >> segment size can differ between clusters, and WAL files can have different >> amounts of data or take different amounts of time to replay. > > How about the amount of WAL bytes fetched from the archive after which > a standby attempts to connect to primary or enter streaming mode? Of > late, we've changed some GUCs to represent bytes instead of WAL > files/segments, see [1]. Well, for wal_keep_size, using bytes makes sense. Given you know how much disk space you have, you can set this parameter accordingly to avoid retaining too much of it for standby servers. For your proposed parameter, it's not so simple. The same setting could have wildly different timing behavior depending on the server. I still think that a timeout is the most intuitive. >> So I think it >> would be difficult for the end user to decide on a value. However, even >> the timeout approach has this sort of problem. If your parameter is set to >> 1 minute, but the current archive takes 5 minutes to recover, you won't >> really be testing streaming replication once a minute. That would likely >> need to be documented. > > If we have configurable WAL bytes instead of timeout for standby WAL > source switch from archive to primary, we don't have the above problem > right? If you are going to stop replaying in the middle of a WAL archive, then maybe. But I don't think I'd recommend that. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
On Thu, Sep 8, 2022 at 5:46 PM Tom Lane wrote: > Jacob Champion writes: > > I think you're going to have to address backwards compatibility > > concerns. Today, I can create a role named "/a", and I can put that > > into the HBA without quoting it. I'd be unamused if, after an upgrade, > > my rule suddenly matched any role name containing an 'a'. > > Meh ... that concern seems overblown to me. I guess it's possible > that somebody has an HBA entry that looks like that, but it doesn't > seem very plausible. Note that we made this exact same change in > pg_ident.conf years ago, and AFAIR we got zero complaints. What percentage of users actually use pg_ident maps? My assumption would be that a change to pg_hba would affect many more people, but then I don't have any proof that there are users with role names that look like that to begin with. I won't pound the table with it. > > Speaking of partial matches, should this feature allow them? Maybe > > rules should have to match the entire username instead, and sidestep > > the inevitable "I forgot to anchor my regex" problems? > > I think the pg_ident.conf precedent is binding on us here. If we > make this one work differently, nobody's going to thank us for it, > they're just going to wonder "did the left hand not know what the > right hand already did?" Hmm... yeah, I suppose. From the other direction, it'd be bad to train users that unanchored regexes are safe in pg_hba only to take those guardrails off in pg_ident. I will tuck that away as a potential behavior change, for a different thread. Thanks, --Jacob
Re: predefined role(s) for VACUUM and ANALYZE
On Wed, Sep 07, 2022 at 04:15:23PM -0700, Mark Dilger wrote: > Ok, now I'm a bit lost. If I want to use Nathan's feature to create a role > to vacuum and analyze my database on a regular basis, how does per-relation > granularity help me? If somebody creates a new table and doesn't grant those > privileges to the role, doesn't that break the usage case? To me, > per-relation granularity sounds useful, but orthogonal, to this feature. I think there is room for both per-relation privileges and new predefined roles. My latest patch set [0] introduces both. [0] https://postgr.es/m/20220908055035.GA2100193%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Avoid overhead with fprintf related functions
On Fri, Sep 09, 2022 at 05:53:54PM -0400, Tom Lane wrote: > Ranier Vilela writes: >> Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart < >> nathandboss...@gmail.com> escreveu: >>> I agree with David [0]. But if you can demonstrate a performance gain, >>> perhaps it's worth considering a subset of these changes in hot paths. > >> head: >> Time: 418,210 ms >> Time: 419,588 ms >> Time: 424,713 ms > >> fprintf patch: >> Time: 416,919 ms >> Time: 416,246 ms >> Time: 416,237 ms > > That is most certainly not enough gain to justify a large amount > of code churn. In fact, given that this is probably pretty > platform-dependent and you've checked only one platform, I don't > think I'd call this a sufficient case for even a one-line change. Agreed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
[Commitfest 2022-09] First week is over
Just a reminder that the first week of "September 2022 commitfest" is over, As of now, there are "295" patches in total. Out of these 295 patches, "29" patches required committer attention, and 188 patches need reviews. I think we need more reviewers to low down the number. I will keep sending the emails for the reminder and change the status of the patches entries accordingly. Thanks to everyone who is participating in commitfest. -- Ibrar Ahmed. Senior Software Engineer, PostgreSQL Consultant.
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
On 8/19/22 01:12, Drouvot, Bertrand wrote: > + wstr = palloc((strlen(tok->string + 1) + 1) * sizeof(pg_wchar)); > > + wlen = pg_mb2wchar_with_len(tok->string + 1, > > + wstr, strlen(tok->string + 1)); The (tok->string + 1) construction comes up often enough that I think it should be put in a `regex` variable or similar. That would help my eyes with the (strlen(tok->string + 1) + 1) construction, especially. I noticed that for pg_ident, we precompile the regexes per-line and reuse those in child processes. Whereas here we're compiling, using, and then discarding the regex for each check. I think the example set by the pg_ident code is probably the one to follow, unless you have a good reason not to. > +# Testing with regular expression for username > > +reset_pg_hba($node, '/^.*md.*$', 'password'); > > +test_role($node, 'md5_role', 'password from pgpass and regular expression > for username', 0); > + IMO the coverage for this patch needs to be filled out. Negative test cases are more important than positive ones for security-related code. Other than that, and Tom's note on potentially expanding this to other areas, I think this is a pretty straightforward patch. Thanks, --Jacob
Re: [RFC] building postgres with meson - v12
Hi, On 2022-08-31 11:11:54 -0700, Andres Freund wrote: > > If the above are addressed, I think this will be just about at the > > point where the above patches can be committed. > > Woo! There was a lot less progress over the last ~week than I had hoped. The reason is that I was trying to figure out the reason for the occasional failures of ecpg tests getting compiled when building on windows in CI, with msbuild. I went into many layers of rabbitholes while investigating. Wasting an absurd amount of time. The problem: Occasionally ecpg test files would fail to compile, exiting with -1073741819: C:\BuildTools\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5): error MSB8066: Custom build for 'C:\cirrus\build\meson-private\custom_target.rule' exited with code -1073741819. [c:\cirrus\build\src\interfaces\ecpg\test\sql\3701597@@twophase.c@cus.vcxproj] -1073741819 is 0xc005, which in turn is STATUS_ACCESS_VIOLATION, i.e. a segfault. This happens in roughly 1/3 of the builds, but with "streaks" of not happening and more frequently happening. However, despite our CI images having a JIT debugger configured (~coredump handler), no crash report was triggered. The problem never occurs in my windows VM. At first I thought that might be because it's an assertion failure or such, which only causes a dump when a bunch of magic is done (see main.c). But despite adding all the necessary magic to ecpg.exe, no dump. Unfortunately, adding debug output reduces the frequency of the issue substantially. Eventually I figured out that it's not actually ecpg.exe that is crashing. It is meson's python wrapper around built binaries as part of the build (for setting PATH, working directory, without running into cmd.exe issues). A modified meson wrapper showed that ecpg.exe completes successfully. The only thing the meson wrapper does after running the command is to call sys.exit(returncode), and I had printed out the returncode, which is 0. I looked through a lot of the python code, to see why no crashdump and no details are forthcoming. There weren't any relevant SetErrorMode(SEM_NOGPFAULTERRORBOX) calls. I tried to set PYTHONFAULTHANDLER, but still no stack trace. Next I suspected that cmd.exe might be crashing and causing the problem. Modified meson to add 'echo %ERRORLEVEL%' to the msbuild custombuild. Which indeed shows the STATUS_ACCESS_VIOLATION returncode after running python. So it's not cmd.exe. The problem even persisted when replacing meson's sys.exit() with os._exit(), which indeed just calls _exit(). I tried to reproduce the problem using a python with debugging enabled. The problem doesn't occur despite quite a few runs. I found scattered other reports of this problem happening on windows. Went down a few more rabbitholes. Too boring to repeat here. At this point I finally figured out that the reason the crash reports don't happen is that everythin started by cirrus-ci on windows has an errormode of SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX | SEM_NOOPENFILEERRORBOX. A good bit later I figured out that while cirrus-ci isn't intentionally setting that, golang does so *unconditionally* on windows: https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/signal_windows.go#L14 https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/os_windows.go#L553 Argh. I should have checked what the error mode is earlier, but this is just very sneaky. So I modified meson to change the errormode and tried to reproduce the issue again, to finally get a stackdump. And tried again. And again. Without a single relevant failure (I saw tests fail in ways that are discussed on the list, but that's irrelevant here). I've run this through enough attempts by now that I'm quite confident that the problem does not occur when the errormode does not include SEM_NOOPENFILEERRORBOX. I'll want a few more runs to be certain, but... Given that the problem appears to happen after _exit() is called, and only when SEM_NOOPENFILEERRORBOX is not set, it seems likely to be an OS / C runtime bug. Presumably it's related to something that python does first, but I don't see how anything could justify crashing only if SEM_NOOPENFILEERRORBOX is set (rather than the opposite). I have no idea how to debug this further, given that the problem is quite rare (can't attach a debugger and wait), only happens when crashdumps are prevented from happening (so no idea where it crashes) and is made less common by debug printfs. So for now the best way forward I can see is to change the error mode for CI runs. Which is likely a good idea anyway, so we can see crashdumps for binaries other than postgres.exe (which does SetErrorMode() internally). I managed to do so by setting CIRRUS_SHELL to a python wrapper around cmd.exe that does SetErrorMode(). I'm sure there's easier ways, but I couldn't figure out any. I'd like to reclaim my time. But I'm afraid nobody will be
Re: Summary function for pg_buffercache
Hi Aleksander and Nathan, Thanks for your comments. Aleksander Alekseev , 9 Eyl 2022 Cum, 17:36 tarihinde şunu yazdı: > However I'm afraid you can't examine BufferDesc's without taking > locks. This is explicitly stated in buf_internals.h: > > """ > Buffer header lock (BM_LOCKED flag) must be held to EXAMINE or change > TAG, state or wait_backend_pgprocno fields. > """ > I wasn't aware of this explanation. Thanks for pointing it out. When somebody modifies relNumber concurrently (e.g. calls > ClearBufferTag()) this will cause an undefined behaviour. > I thought that it wouldn't really be a problem even if relNumber is modified concurrently, since the function does not actually rely on the actual values. I'm not sure about what undefined behaviour could harm this badly. It seemed to me that it would read an invalid relNumber in the worst case scenario. But I'm not actually familiar with buffer related parts of the code, so I might be wrong. And I'm okay with taking header locks if necessary. In the attached patch, I added buffer header locks just before examining tag as follows: + buf_state = LockBufHdr(bufHdr); > + > + /* Invalid RelFileNumber means the buffer is unused */ > + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag))) > + { > ... > + } > ... > + UnlockBufHdr(bufHdr, buf_state); > > > I suggest we focus on saving the memory first and then think about the > > > performance, if necessary. > > > > +1 > I again did the same quick benchmarking, here are the numbers with locks. postgres=# show shared_buffers; shared_buffers 16GB (1 row) postgres=# SELECT relfilenode <> 0 AS is_valid, isdirty, count(*) FROM pg_buffercache GROUP BY relfilenode <> 0, isdirty; is_valid | isdirty | count --+-+- t| f | 256 | | 2096876 t| t | 20 (3 rows) Time: 1024.456 ms (00:01.024) postgres=# select * from pg_buffercache_summary(); used_buffers | unused_buffers | dirty_buffers | pinned_buffers | avg_usagecount --++---++ 282 |2096870 |20 | 0 | 3.4574468 (1 row) Time: 33.074 ms Yes, locks slowed pg_buffercache_summary down. But there is still quite a bit of performance improvement, plus memory saving as you mentioned. > Here is the corrected patch. > Also thanks for corrections. Best, Melih v5-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: cataloguing NOT NULL constraints
Hi, w.r.t. the while loop in findNotNullConstraintAttnum(): + if (multiple == NULL) + break; I think `pfree(arr)` should be called before breaking. + if (constraint->cooked_expr != NULL) + return tryExtractNotNullFromNode(stringToNode(constraint->cooked_expr), rel); + else + return tryExtractNotNullFromNode(constraint->raw_expr, rel); nit: the `else` keyword is not needed. + if (isnull) + elog(ERROR, "null conbin for constraint %u", conForm->oid); It would be better to expand `conbin` so that the user can better understand the error. Cheers >
Re: why can't a table be part of the same publication as its schema
On Fri, Sep 9, 2022 at 8:48 PM Robert Haas wrote: > > On Fri, Sep 9, 2022 at 10:29 AM houzj.f...@fujitsu.com > wrote: > > IIRC, the feature currently works almost the same as you described. It > > doesn't > > create entry for tables that are published via its schema level, it only > > record > > the published schema and check which tables are part of it. > > Oh, well if that's the case, that is great news. > Yes, the feature works as you and Hou-San have mentioned. > But then I don't > understand Amit's comment from before: > > > Yes, because otherwise, there was confusion while dropping the objects > > from publication. Consider in the above case, if we would have allowed > > it and then the user performs ALTER PUBLICATION p1 DROP ALL TABLES IN > > SCHEMA s1, then (a) shall we remove both schema s1 and a table that is > > separately added (s1.t1) from that schema, or (b) just remove schema > > s1? > > I believe that (b) is the correct behavior, so I assumed that this > issue must be some difficulty in implementing it, like a funny catalog > representation. > No, it was because of syntax. IIRC, during development, Greg Nancarrow raised a point [1] that a user can expect the individually added tables for a schema which is also part of the publication to also get dropped when she specifies DROP ALL TABLES IN SCHEMA. IIRC, originally, the patch had a behavior (b) but then changed due to discussion around this point. But now that it seems you and others don't feel that was right, we can change back to (b) as I think that shouldn't be difficult to achieve. > Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1 > { ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we > used this ALL TABLES IN SCHEMA language. > It was exactly due to the reason Mark had mentioned in the email [2]. [1] - https://www.postgresql.org/message-id/CAJcOf-fTRZ3HiA5xU0-O-PT390A7wuUUkjP8uX3aQJLBsJNVmw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/596EA671-66DF-4285-8560-0270DC062353%40enterprisedb.com -- With Regards, Amit Kapila.
pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
The sequence of events leading up to this: 0) Yesterday I upgraded an internal VM to pg15b4 using PGDG RPMs; It's the same VM that hit the prefetch_recovery bug which was fixed by adb466150. I don't think that should've left it in a weird state (since recovery was sucessful when prefetch was disabled, and the cluser worked fine until now). 1) This evening, I started running sqlsmith connected to the postgres DB that has some logging tables in it; 2) There was a lot of swapping, and the backend was finally killed due to OOM when sqlsmith tried to run a terrible query involving database_to_xml(); 3) lots of backends started crashing with SIGABRT; 4) I was simultaneously compiling pg14b4 to run with with -DRELCACHE_FORCE_RELEASE and installing it into /usr/local. I don't *think* running libraries would've been overwritten, and that shouldn't have affected the running instance anyway... I got a lot (100+) of SIGABRT. I suspect all the backtraces are the same. My hypothesis is that the OOM crash caused bad xmax to be written (or rather, recovery didn't cause it to be written correctly?). We may have hit a race condition due to heavy swapping. (gdb) bt #0 0x7fb8a22f31f7 in raise () from /lib64/libc.so.6 #1 0x7fb8a22f48e8 in abort () from /lib64/libc.so.6 #2 0x0098f9be in ExceptionalCondition (conditionName=conditionName@entry=0x9fada4 "TransactionIdIsValid(xmax)", errorType=errorType@entry=0x9ed217 "FailedAssertion", fileName=fileName@entry=0x9fad90 "heapam_visibility.c", lineNumber=lineNumber@entry=1353) at assert.c:69 #3 0x004fd4d7 in HeapTupleSatisfiesVacuumHorizon (htup=htup@entry=0x7ffc225a87e0, buffer=buffer@entry=5100, dead_after=dead_after@entry=0x7ffc225a87d0) at heapam_visibility.c:1353 #4 0x00501702 in heap_prune_satisfies_vacuum (buffer=5100, tup=0x7ffc225a87e0, prstate=0x7ffc225a8a50) at pruneheap.c:504 #5 heap_page_prune (relation=relation@entry=0x7fb8a50c3438, buffer=buffer@entry=5100, vistest=vistest@entry=0xec7890 , old_snap_xmin=, old_snap_ts=, nnewlpdead=nnewlpdead@entry=0x7ffc225a964c, off_loc=off_loc@entry=0x0) at pruneheap.c:351 #6 0x00502326 in heap_page_prune_opt (relation=0x7fb8a50c3438, buffer=buffer@entry=5100) at pruneheap.c:209 #7 0x004f3ae3 in heapgetpage (sscan=sscan@entry=0x199b1d0, page=page@entry=2892) at heapam.c:415 #8 0x004f44c2 in heapgettup_pagemode (scan=scan@entry=0x199b1d0, dir=, nkeys=0, key=0x0) at heapam.c:1120 #9 0x004f5abe in heap_getnextslot (sscan=0x199b1d0, direction=, slot=0x1967be8) at heapam.c:1352 #10 0x006de16b in table_scan_getnextslot (slot=0x1967be8, direction=ForwardScanDirection, sscan=) at ../../../src/include/access/tableam.h:1046 #11 SeqNext (node=node@entry=0x1967a38) at nodeSeqscan.c:80 #12 0x006b109a in ExecScanFetch (recheckMtd=0x6de0f0 , accessMtd=0x6de100 , node=0x1967a38) at execScan.c:133 #13 ExecScan (node=0x1967a38, accessMtd=0x6de100 , recheckMtd=0x6de0f0 ) at execScan.c:199 #14 0x006add88 in ExecProcNodeInstr (node=0x1967a38) at execProcnode.c:479 #15 0x006a6182 in ExecProcNode (node=0x1967a38) at ../../../src/include/executor/executor.h:259 #16 ExecutePlan (execute_once=, dest=0x1988448, direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, planstate=0x1967a38, estate=0x19677a0) at execMain.c:1636 #17 standard_ExecutorRun (queryDesc=0x1996e80, direction=, count=0, execute_once=) at execMain.c:363 #18 0x7fb8960913bd in pgss_ExecutorRun (queryDesc=0x1996e80, direction=ForwardScanDirection, count=0, execute_once=) at pg_stat_statements.c:1010 #19 0x7fb895c6f781 in explain_ExecutorRun (queryDesc=0x1996e80, direction=ForwardScanDirection, count=0, execute_once=) at auto_explain.c:320 #20 0x0084976e in PortalRunSelect (portal=portal@entry=0x18fed30, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x1988448) at pquery.c:924 #21 0x0084af4f in PortalRun (portal=0x18fed30, count=9223372036854775807, isTopLevel=, run_once=, dest=0x1988448, altdest=0x1988448, qc=0x7ffc225a9ce0) at pquery.c:768 #22 0x0084679b in exec_simple_query (query_string=0x186d8a0 "SELECT alarm_id, alarm_disregard FROM alarms WHERE alarm_ack_time IS NULL AND alarm_clear_time IS NULL AND alarm_office = 'ETHERNET'") at postgres.c:1250 #23 0x0084848a in PostgresMain (dbname=, username=) at postgres.c:4581 #24 0x00495afe in BackendRun (port=, port=) at postmaster.c:4504 #25 BackendStartup (port=0x1894250) at postmaster.c:4232 #26 ServerLoop () at postmaster.c:1806 #27 0x007b0c60 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1868280) at postmaster.c:1478 #28 0x004976a6 in main (argc=3, argv=0x1868280) at main.c:202 < 2022-09-09 19:44:03.329 CDT >LOG: server process (PID 8949) was terminated by signal 6: Aborted < 2022-09-09 19:44:03.329 CDT >DETA
Re: Assertion failure in WaitForWALToBecomeAvailable state machine
On Mon, Aug 15, 2022 at 11:30 AM Bharath Rupireddy wrote: > > On Thu, Aug 11, 2022 at 10:06 PM Bharath Rupireddy > wrote: > > > > Today I encountered the assertion failure [2] twice while working on > > another patch [1]. The pattern seems to be that the walreceiver got > > killed or crashed and set it's state to WALRCV_STOPPING or > > WALRCV_STOPPED by the team the WAL state machine moves to archive and > > hence the following XLogShutdownWalRcv() code will not get hit: > > > > /* > > * Before we leave XLOG_FROM_STREAM state, make sure > > that > > * walreceiver is not active, so that it won't overwrite > > * WAL that we restore from archive. > > */ > > if (WalRcvStreaming()) > > ShutdownWalRcv(); > > > > I agree with Kyotaro-san to reset InstallXLogFileSegmentActive before > > entering into the archive mode. Hence I tweaked the code introduced by > > the following commit a bit, the result v1 patch is attached herewith. > > Please review it. > > I added it to the current commitfest to not lose track of it: > https://commitfest.postgresql.org/39/3814/. Today, I spent some more time on this issue, I modified the v1 patch posted upthread a bit - now resetting the InstallXLogFileSegmentActive only when the WAL source switched to archive, not every time in archive mode. I'm attaching v2 patch here with, please review it further. Just for the records - there's another report of the assertion failure at [1], many thanks to Kyotaro-san for providing consistent reproducible steps. [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v2-0001-Avoid-race-condition-in-resetting-XLogCtl-Install.patch Description: Binary data
Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
On Fri, Sep 09, 2022 at 09:06:37PM -0500, Justin Pryzby wrote: > #0 0x7fb8a22f31f7 in raise () from /lib64/libc.so.6 > #1 0x7fb8a22f48e8 in abort () from /lib64/libc.so.6 > #2 0x0098f9be in ExceptionalCondition > (conditionName=conditionName@entry=0x9fada4 "TransactionIdIsValid(xmax)", > errorType=errorType@entry=0x9ed217 "FailedAssertion", > fileName=fileName@entry=0x9fad90 "heapam_visibility.c", > lineNumber=lineNumber@entry=1353) at assert.c:69 > #3 0x004fd4d7 in HeapTupleSatisfiesVacuumHorizon > (htup=htup@entry=0x7ffc225a87e0, buffer=buffer@entry=5100, > dead_after=dead_after@entry=0x7ffc225a87d0) at heapam_visibility.c:1353 > #4 0x00501702 in heap_prune_satisfies_vacuum (buffer=5100, > tup=0x7ffc225a87e0, prstate=0x7ffc225a8a50) at pruneheap.c:504 > #5 heap_page_prune (relation=relation@entry=0x7fb8a50c3438, > buffer=buffer@entry=5100, vistest=vistest@entry=0xec7890 , > old_snap_xmin=, old_snap_ts=, > nnewlpdead=nnewlpdead@entry=0x7ffc225a964c, off_loc=off_loc@entry=0x0) at > pruneheap.c:351 > #6 0x00502326 in heap_page_prune_opt (relation=0x7fb8a50c3438, > buffer=buffer@entry=5100) at pruneheap.c:209 > #7 0x004f3ae3 in heapgetpage (sscan=sscan@entry=0x199b1d0, > page=page@entry=2892) at heapam.c:415 > #8 0x004f44c2 in heapgettup_pagemode (scan=scan@entry=0x199b1d0, > dir=, nkeys=0, key=0x0) at heapam.c:1120 > #9 0x004f5abe in heap_getnextslot (sscan=0x199b1d0, > direction=, slot=0x1967be8) at heapam.c:1352 > #10 0x006de16b in table_scan_getnextslot (slot=0x1967be8, > direction=ForwardScanDirection, sscan=) at > ../../../src/include/access/tableam.h:1046 > #11 SeqNext (node=node@entry=0x1967a38) at nodeSeqscan.c:80 > #12 0x006b109a in ExecScanFetch (recheckMtd=0x6de0f0 , > accessMtd=0x6de100 , node=0x1967a38) at execScan.c:133 > #13 ExecScan (node=0x1967a38, accessMtd=0x6de100 , > recheckMtd=0x6de0f0 ) at execScan.c:199 > #14 0x006add88 in ExecProcNodeInstr (node=0x1967a38) at > execProcnode.c:479 > #15 0x006a6182 in ExecProcNode (node=0x1967a38) at > ../../../src/include/executor/executor.h:259 > #16 ExecutePlan (execute_once=, dest=0x1988448, > direction=, numberTuples=0, sendTuples=true, > operation=CMD_SELECT, use_parallel_mode=, planstate=0x1967a38, > estate=0x19677a0) > at execMain.c:1636 > #17 standard_ExecutorRun (queryDesc=0x1996e80, direction=, > count=0, execute_once=) at execMain.c:363 > #18 0x7fb8960913bd in pgss_ExecutorRun (queryDesc=0x1996e80, > direction=ForwardScanDirection, count=0, execute_once=) at > pg_stat_statements.c:1010 > #19 0x7fb895c6f781 in explain_ExecutorRun (queryDesc=0x1996e80, > direction=ForwardScanDirection, count=0, execute_once=) at > auto_explain.c:320 > #20 0x0084976e in PortalRunSelect (portal=portal@entry=0x18fed30, > forward=forward@entry=true, count=0, count@entry=9223372036854775807, > dest=dest@entry=0x1988448) at pquery.c:924 > #21 0x0084af4f in PortalRun (portal=0x18fed30, > count=9223372036854775807, isTopLevel=, run_once= out>, dest=0x1988448, altdest=0x1988448, qc=0x7ffc225a9ce0) at pquery.c:768 > #22 0x0084679b in exec_simple_query (query_string=0x186d8a0 "SELECT > alarm_id, alarm_disregard FROM alarms WHERE alarm_ack_time IS NULL AND > alarm_clear_time IS NULL AND alarm_office = 'ETHERNET'") at postgres.c:1250 > #23 0x0084848a in PostgresMain (dbname=, > username=) at postgres.c:4581 > #24 0x00495afe in BackendRun (port=, port= out>) at postmaster.c:4504 > #25 BackendStartup (port=0x1894250) at postmaster.c:4232 > #26 ServerLoop () at postmaster.c:1806 > #27 0x007b0c60 in PostmasterMain (argc=argc@entry=3, > argv=argv@entry=0x1868280) at postmaster.c:1478 > #28 0x004976a6 in main (argc=3, argv=0x1868280) at main.c:202 > > I saved a copy of the data dir, but I'm going to have to bring the DB back > online soon. I don't know if I can get a page image easily since GDB is a bit > busted (won't show source code) and I can't remember how I fixed that last > time... Actually gdb seems to be fine, except it doesn't work well on the corefile. Breakpoint 4, HeapTupleSatisfiesVacuumHorizon (htup=htup@entry=0x7ffcbdc89970, buffer=buffer@entry=690, dead_after=dead_after@entry=0x7ffcbdc898fc) at heapam_visibility.c:1196 1196{ (gdb) n 1197HeapTupleHeader tuple = htup->t_data; (gdb) p *tuple $21 = {t_choice = {t_heap = {t_xmin = 3779887563, t_xmax = 133553150, t_field3 = {t_cid = 103, t_xvac = 103}}, t_datum = {datum_len_ = -515079733, datum_typmod = 133553150, datum_typeid = 103}}, t_ctid = {ip_blkid = { bi_hi = 0, bi_lo = 30348}, ip_posid = 11}, t_infomask2 = 8225, t_infomask = 4423, t_hoff = 32 ' ', t_bits = 0x2aaab37ebe0f "\303\377\367\377\001"} (gdb) n 1350Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)); (gdb) p xmax $20 = 0 (gdb) n 1353Assert(TransactionIdIsValid(xmax
Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
Hi, That’s interesting, dig into it for a while but not too much progress. Maybe we could add some logs to print MultiXactMembers’ xid and status if xid is 0. Inside MultiXactIdGetUpdateXid() ``` nmembers = GetMultiXactIdMembers(xmax, &members, false, false); if (nmembers > 0) { int i; for (i = 0; i < nmembers; i++) { /* Ignore lockers */ if (!ISUPDATE_from_mxstatus(members[i].status)) continue; /* there can be at most one updater */ Assert(update_xact == InvalidTransactionId); update_xact = members[i].xid; // log here if xid is invalid #ifndef USE_ASSERT_CHECKING /* * in an assert-enabled build, walk the whole array to ensure * there's no other updater. */ break; #endif } pfree(members); } // and here if didn’t update update_xact at all (it shouldn’t happen as designed) return update_xact; ``` That will help a little if we can reproduce it. And could we see multixact reply in logs if db does recover? Regards, Zhang Mingli
Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
On Sat, Sep 10, 2022 at 12:07:30PM +0800, Zhang Mingli wrote: > That’s interesting, dig into it for a while but not too much progress. > > Maybe we could add some logs to print MultiXactMembers’ xid and status if xid > is 0. > > Inside MultiXactIdGetUpdateXid() > > ``` > nmembers = GetMultiXactIdMembers(xmax, &members, false, false); > > if (nmembers > 0) > { > int i; > > for (i = 0; i < nmembers; i++) > { > /* Ignore lockers */ > if (!ISUPDATE_from_mxstatus(members[i].status)) > continue; > > /* there can be at most one updater */ > Assert(update_xact == InvalidTransactionId); > update_xact = members[i].xid; > > // log here if xid is invalid > #ifndef USE_ASSERT_CHECKING > > /* > * in an assert-enabled build, walk the whole array to ensure > * there's no other updater. > */ > break; > #endif > } > > pfree(members); > } > // and here if didn’t update update_xact at all (it shouldn’t happen as > designed) Yeah. I added assertions for the above case inside the loop, and for this one, and this fails right before "return". TRAP: FailedAssertion("update_xact != InvalidTransactionId", File: "src/backend/access/heap/heapam.c", Line: 6939, PID: 4743) It looks like nmembers==2, both of which are lockers and being ignored. > And could we see multixact reply in logs if db does recover? Do you mean waldump or ?? BTW, after a number of sigabrt's, I started seeing these during recovery: < 2022-09-09 19:44:04.180 CDT >LOG: unexpected pageaddr 1214/AF0FE000 in log segment 0001121400B4, offset 1040384 < 2022-09-09 23:20:50.830 CDT >LOG: unexpected pageaddr 1214/CF65C000 in log segment 0001121400D8, offset 6668288 -- Justin
Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
The OOM was at: < 2022-09-09 19:34:24.043 CDT >LOG: server process (PID 14841) was terminated by signal 9: Killed The first SIGABRT was at: < 2022-09-09 19:37:31.650 CDT >LOG: server process (PID 7363) was terminated by signal 6: Aborted And I've just found a bunch of "interesting" logs between the two: < 2022-09-09 19:36:48.505 CDT telsasoft >ERROR: MultiXactId 133553150 has not been created yet -- apparent wraparound < 2022-09-09 19:36:48.505 CDT telsasoft >STATEMENT: SELECT alarm_id, alarm_disregard FROM alarms WHERE alarm_ack_time IS NULL AND alarm_clear_time IS NULL AND alarm_office = 'ETHERNET' < 2022-09-09 19:36:48.788 CDT telsasoft >ERROR: could not access status of transaction 3779944583 < 2022-09-09 19:36:48.788 CDT telsasoft >DETAIL: Could not read from file "pg_subtrans/E14D" at offset 98304: read too few bytes. ... < 2022-09-09 19:37:08.550 CDT telsasoft >ERROR: MultiXactId 133553156 has not been created yet -- apparent wraparound ... < 2022-09-09 19:37:13.792 CDT telsasoft >ERROR: could not access status of transaction 3779946306 < 2022-09-09 19:37:13.792 CDT telsasoft >DETAIL: Could not read from file "pg_subtrans/E14D" at offset 98304: read too few bytes. ... < 2022-09-09 19:37:19.682 CDT telsasoft >ERROR: could not access status of transaction 3779946306 < 2022-09-09 19:37:19.682 CDT telsasoft >DETAIL: Could not read from file "pg_subtrans/E14D" at offset 98304: read too few bytes. < 2022-09-09 19:37:19.682 CDT telsasoft >CONTEXT: while locking tuple (11755,5) in relation "alarms_null" ... < 2022-09-09 19:37:25.835 CDT telsasoft >ERROR: MultiXactId 133553154 has not been created yet -- apparent wraparound BTW, if I load the datadir backup to crash it, I see: t_infomask = 4423, which is: ; 0x1000+0x0100+0x0040+0x0004+0x0002+0x0001 4423 src/include/access/htup_details.h-#define HEAP_HASNULL 0x0001 /* has null attribute(s) */ src/include/access/htup_details.h-#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ src/include/access/htup_details.h-#define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */ src/include/access/htup_details.h-#define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */ src/include/access/htup_details.h-#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ src/include/access/htup_details.h:#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */ I was I could say what autovacuum had been doing during that period, but unfortunately I have "log_autovacuum_min_duration = 9s"...