Re: Syncrep and improving latency due to WAL throttling
On Sat, Jan 28, 2023 at 6:06 AM Tomas Vondra wrote: > > > > > That's not the sole goal, from my end: I'd like to avoid writing out + > > flushing the WAL in too small chunks. Imagine a few concurrent vacuums or > > COPYs or such - if we're unlucky they'd each end up exceeding their > > "private" > > limit close to each other, leading to a number of small writes of the > > WAL. Which could end up increasing local commit latency / iops. > > > > If we instead decide to only ever flush up to something like > > last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ > > > > we'd make sure that the throttling mechanism won't cause a lot of small > > writes. > > > > I'm not saying we shouldn't do this, but I still don't see how this > could make a measurable difference. At least assuming a sensible value > of the throttling limit (say, more than 256kB per backend), and OLTP > workload running concurrently. That means ~64 extra flushes/writes per > 16MB segment (at most). Yeah, a particular page might get unlucky and be > flushed by multiple backends, but the average still holds. Meanwhile, > the OLTP transactions will generate (at least) an order of magnitude > more flushes. I think measuring the number of WAL flushes with and without this feature that the postgres generates is great to know this feature effects on IOPS. Probably it's even better with variations in synchronous_commit_flush_wal_after or the throttling limits. On Sat, Jan 28, 2023 at 6:08 AM Tomas Vondra wrote: > > On 1/27/23 22:19, Andres Freund wrote: > > Hi, > > > > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: > >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund wrote: > >> > >>> Huh? Why did you remove the GUC? > >> > >> After reading previous threads, my optimism level of getting it ever > >> in shape of being widely accepted degraded significantly (mainly due > >> to the discussion of wider category of 'WAL I/O throttling' especially > >> in async case, RPO targets in async case and potentially calculating > >> global bandwidth). > > > > I think it's quite reasonable to limit this to a smaller scope. Particularly > > because those other goals are pretty vague but ambitious goals. IMO the > > problem with a lot of the threads is precisely that that they aimed at a > > level > > of generallity that isn't achievable in one step. > > > > +1 to that Okay, I agree to limit the scope first to synchronous replication. Although v2 is a WIP patch, I have some comments: 1. Coding style doesn't confirm to the Postgres standard: +/* + * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section + */ 80-line char limit +void HandleXLogDelayPending() A new line missing between function return type and functin name +elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", +backendWalInserted, +LSN_FORMAT_ARGS(XactLastThrottledRecEnd), +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage)); Indentation issue - space needed in the next lines after elog(WARNING,.. +elog(WARNING, "throttling WAL down on this session - end (%f ms)", timediff); 80-line char limit, timediff); be on the new line. +//RESUME_INTERRUPTS(); +//HOLD_INTERRUPTS(); Multi-line comments are used elsewhere in the code. Better to run pgindent on the patch. 2. It'd be better to add a TAP test hitting the WAL throttling. 3. We generally don't need timings to be calculated explicitly when we emit before and after log messages. If needed one can calculate the wait time from timestamps of the log messages. If it still needs an explicit time difference which I don't think is required, because we have a special event and before/after log messages, I think it needs to be under track_wal_io_timing to keep things simple. 4. XLogInsertRecord() can be a hot path for high-write workload, so effects of adding anything in it needs to be measured. So, it's better to run benchmarks with this feature enabled and disabled. 5. Missing documentation of this feature and the GUC. I think we can describe extensively why we'd need a feature of this kind in the documentation for better adoption or usability. 6. Shouldn't the max limit be MAX_KILOBYTES? +&synchronous_commit_flush_wal_after, +0, 0, 1024L*1024L, 7. Can this GUC name be something like synchronous_commit_wal_throttle_threshold to better reflect what it does for a backend? +{"synchronous_commit_flush_wal_after", PGC_USERSET, REPLICATION_SENDING, 8. A typo - s/confiration/confirmation +gettext_noop("Sets the maximum logged WAL in kbytes, after which wait for sync commit confiration even without commit "), 9. This "Sets the maximum logged WAL in kbytes, after which wait for sync commit confiration even without commit " better be something like below? "Sets the maximum amount of WAL in kilobytes a backend generates after which it waits for synchronous commit confiration even without commit " 1
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
On Mon, Jan 30, 2023 at 6:36 PM Andres Freund wrote: > On 2023-01-30 15:22:34 +1300, Thomas Munro wrote: > > On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote: > > > out-of-order hazard > > > > I've been trying to understand how that could happen, but my CPU-fu is > > weak. Let me try to write an argument for why it can't happen, so > > that later I can look back at how stupid and naive I was. We have A > > B, and if the CPU sees no dependency and decides to execute B A > > (pipelined), shouldn't an interrupt either wait for the whole > > schemozzle to commit first (if not in a hurry), or nuke it, handle the > > IPI and restart, or something? > > In a core local view, yes, I think so. But I don't think that's how it can > work on multi-core, and even more so, multi-socket machines. Imagine how it'd > influence latency if every interrupt on any CPU would prevent all out-of-order > execution on any CPU. Good. Yeah, I was talking only about a single thread/core. > > After an hour of reviewing randoma > > slides from classes on out-of-order execution and reorder buffers and > > the like, I think the term for making sure that interrupts run with > > the illusion of in-order execution maintained is called "precise > > interrupts", and it is expected in all modern architectures, after the > > early OoO pioneers lost their minds trying to program without it. I > > guess generally you want that because it would otherwise run your > > interrupt handler in a completely uncertain environment, and > > specifically in this case it would reach our signal handler which > > reads A's output (waiting) and writes to B's input (is_set), so B IPI > > A surely shouldn't be allowed? > > Userspace signals aren't delivered synchronously during hardware interrupts > afaik - and I don't think they even possibly could be (after all the process > possibly isn't scheduled). Yeah, they're not synchronous and the target might not even be running. BUT if a suitable thread is running then AFAICT an IPI is delivered to that sucker to get it running the handler ASAP, at least on the three OSes I looked at. (See breadcrumbs below). > I think what you're talking about with precise interrupts above is purely > about the single-core view, and mostly about hardware interrupts for faults > etc. The CPU will unwind state from speculatively executed code etc on > interrupt, sure - but I think that's separate from guaranteeing that you can't > have stale cache contents *due to work by another CPU*. Yeah. I get the cache problem, a separate issue that does indeed look pretty dodgy. I guess I wrote my email out-of-order: at the end I speculated that cache coherency probably can't explain this failure at least in THAT bit of the source, because of that funky extra self-SetLatch(). I just got spooked by the mention of out-of-order execution and I wanted to chase it down and straighten out my understanding. > I'm not even sure that userspace signals are generally delivered via an > immediate hardware interrupt, or whether they're processed at the next > scheduler tick. After all, we know that multiple signals are coalesced, which > certainly isn't compatible with synchronous execution. But it could be that > that just happens when the target of a signal is not currently scheduled. FreeBSD: By default, they are when possible, eg if the process is currently running a suitable thread. You can set sysctl kern.smp.forward_signal_enabled=0 to turn that off, and then it works more like the way you imagined (checking for pending signals at various arbitrary times, not sure). See tdsigwakeup() -> forward_signal() -> ipi_cpu(). Linux: Well it certainly smells approximately similar. See signal_wake_up_state() -> kick_process() -> smp_send_reschedule() -> smp_cross_call() -> __ipi_send_mask(). The comment for kick_process() explains that it's using the scheduler IPI to get signals handled ASAP. Darwin: ... -> cpu_signal() -> something that talks about IPIs Coalescing is happening not only at the pending signal level (an invention of the OS), and then for the inter-processor wakeups there is also interrupt coalescing. It's latches all the way down.
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Mon, Jan 30, 2023 at 12:38 PM Kyotaro Horiguchi wrote: > > At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila > wrote in > > > > > > The GUC is not stored in a catalog, but.. oh... it is multiplied by > > > 1000. > > > > Which part of the patch you are referring to here? Isn't the check in > > Where recovery_min_apply_delay is used. It is allowed to be set up to > INT_MAX but it is used as: > > > delayUntil = TimestampTzPlusMilliseconds(xtime, > > recovery_min_apply_delay); > > Where the macro is defined as: > > > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) > > Which can lead to overflow, which is practically harmless. > But here tz is always TimestampTz (which is int64), so do, we need to worry? > > the function defGetMinApplyDelay() sufficient to ensure that the > > 'delay' value stored in the catalog will always be lesser than > > INT_MAX? > > I'm concerned about cases where INT_MAX is wider than int32. If we > don't assume such cases, I'm fine with INT_MAX there. > I am not aware of such cases. Anyway, if any such case is discovered then we need to change the checks in defGetMinApplyDelay(), right? If so, then I think it is better to keep it as it is unless we know that this could be an issue on some platform. -- With Regards, Amit Kapila.
Re: Check lateral references within PHVs for memoize cache keys
On Tue, Jan 24, 2023 at 10:07 AM David Rowley wrote: > I'm surprised to see that it's only Memoize that ever makes use of > lateral_vars. I'd need a bit more time to process your patch, but one > additional thought I had was that I wonder if the following code is > still needed in nodeMemoize.c > > if (bms_nonempty_difference(outerPlan->chgParam, node->keyparamids)) > cache_purge_all(node); > > Ideally, that would be an Assert failure, but possibly we should > probably still call cache_purge_all(node) after Assert(false) so that > at least we'd not start returning wrong results if we've happened to > miss other cache keys. I thought maybe something like: Hmm, I think this code is still needed because the parameter contained in the subplan below a Memoize node may come from parent plan, as in the test query added in 411137a42. EXPLAIN (COSTS OFF) SELECT unique1 FROM tenk1 t0 WHERE unique1 < 3 AND EXISTS ( SELECT 1 FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.hundred WHERE t0.ten = t1.twenty AND t0.two <> t2.four OFFSET 0); QUERY PLAN Index Scan using tenk1_unique1 on tenk1 t0 Index Cond: (unique1 < 3) Filter: (SubPlan 1) SubPlan 1 -> Nested Loop -> Index Scan using tenk1_hundred on tenk1 t2 Filter: (t0.two <> four) -> Memoize Cache Key: t2.hundred Cache Mode: logical -> Index Scan using tenk1_unique1 on tenk1 t1 Index Cond: (unique1 = t2.hundred) Filter: (t0.ten = twenty) (13 rows) Currently we don't have a way to add Params of uplevel vars to Memoize cache keys. So I think we still need to call cache_purge_all() each time uplevel Params change. Thanks Richard
Re: Deadlock between logrep apply worker and tablesync worker
On Mon, 30 Jan 2023 at 13:00, houzj.f...@fujitsu.com wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila > wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both > > > > places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the > error > should not happen as no concurrent drop can be performed. This scenario is possible while creating subscription, apply worker will try to drop the replication origin if the state is SUBREL_STATE_SYNCDONE. Table sync worker will set the state to SUBREL_STATE_SYNCDONE and update the relation state before calling replorigin_drop_by_name. Since the transaction is committed by table sync worker, the state is visible to apply worker, now apply worker will parallelly try to drop the replication origin in this case. There is a race condition in this case, one of the process table sync worker or apply worker will acquire the lock and drop the replication origin, the other process will get the lock after the process drops the origin and commits the transaction. Now the other process will try to drop the replication origin once it acquires the lock and get the error(from replorigin_drop_guts): cache lookup failed for replication origin with ID. Concurrent drop is possible in this case. Regards, Vignesh
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi Dmitry, On 1/27/23 16:18, Dmitry Dolgov wrote: As I've noted off-list, there was noticeable difference in the dumped bitcode, which I haven't noticed since we were talking mostly about differences between executions of the same query. Thanks for the clarification and also thanks for helping with this effort. -- David Geier (ServiceNow)
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
> On 30/01/23 11:01, John Naylor wrote: > Since David referred to L3 size as the starting point of a possible configuration parameter, that's actually cache-conscious. Okay, makes sense. I am correcting error on my part. > I'm not close enough to this thread to guess at the right direction (although I hope related work will help), but I have a couple general remarks: > 1. In my experience, academic papers like to test sorting with register-sized numbers or strings. Our sorttuples are bigger, have complex comparators, and can fall back to fields in the full tuple. > 2. That paper is over 20 years old. If it demonstrated something genuinely useful, some of those concepts would likely be implemented in the real-world somewhere. Looking for evidence of that might be a good exercise. > 3. 20 year-old results may not carry over to modern hardware. > 4. Open source software is more widespread in the academic world now than 20 years ago, so papers with code (maybe even the author's github) are much more useful to us in my view. > 5. It's actually not terribly hard to make sorting faster for some specific cases -- the hard part is keeping other inputs of interest from regressing. > 6. The bigger the change, the bigger the risk of regressing somewhere. Thanks John, these inputs are actually what I was looking for. I will do more research based on these inputs and build up my understanding. Regards, Ankit
RE: Logical replication timeout problem
On Mon, Jan 30, 2023 at 14:55 PM Amit Kapila wrote: > On Mon, Jan 30, 2023 at 10:36 AM wangw.f...@fujitsu.com > wrote: > > > > On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 > wrote: > > > On Sun, Jan 29, 2023 3:41 PM wangw.f...@fujitsu.com > > > wrote: > > > > Yes, I think you are right. > > Fixed this problem. > > > > + /* > + * Trying to send keepalive message after every change has some > + * overhead, but testing showed there is no noticeable overhead if > + * we do it after every ~100 changes. > + */ > +#define CHANGES_THRESHOLD 100 > + > + if (++changes_count < CHANGES_THRESHOLD) > + return; > ... > + changes_count = 0; > > I think it is better to have this threshold-related code in that > caller as we have in the previous version. Also, let's modify the > comment as follows:" > It is possible that the data is not sent to downstream for a long time > either because the output plugin filtered it or there is a DDL that > generates a lot of data that is not processed by the plugin. So, in > such cases, the downstream can timeout. To avoid that we try to send a > keepalive message if required. Trying to send a keepalive message > after every change has some overhead, but testing showed there is no > noticeable overhead if we do it after every ~100 changes." Changed as suggested. I also removed the comment atop the function update_progress_txn_cb_wrapper to be consistent with the nearby *_cb_wrapper functions. Attach the new patch. Regards, Wang Wei v10-0001-Fix-the-logical-replication-timeout-during-proce.patch Description: v10-0001-Fix-the-logical-replication-timeout-during-proce.patch
Re: Making Vars outer-join aware
On Tue, Jan 24, 2023 at 4:38 AM Tom Lane wrote: > Richard, are you planning to review this any more? I'm getting > a little antsy to get it committed. For such a large patch, > it's surprising it's had so few conflicts to date. Sorry for the delayed reply. I don't have any more review comments at the moment, except a nitpicking one. In optimizer/README at line 729 there is a query as SELECT * FROM a LEFT JOIN (SELECT * FROM b WHERE b.z = 1) ss ON (a.x = b.y) WHERE a.x = 1; I think it should be SELECT * FROM a LEFT JOIN (SELECT * FROM b WHERE b.z = 1) ss ON (a.x = ss.y) WHERE a.x = 1; I have no objection to get it committed. Thanks Richard
RE: Time delayed LR (WAS Re: logical replication restrictions)
On Saturday, January 28, 2023 1:28 PM I wrote: > Attached the updated patch v24. Hi, I've conducted the rebase affected by the commit(1e8b61735c) by renaming the GUC to logical_replication_mode accordingly, because it's utilized in the TAP test of this time-delayed LR feature. There is no other change for this version. Kindly have a look at the attached v25. Best Regards, Takamichi Osumi v25-0001-Time-delayed-logical-replication-subscriber.patch Description: v25-0001-Time-delayed-logical-replication-subscriber.patch
MacOS: xsltproc fails with "warning: failed to load external entity"
Hi hackers, I'm having some difficulties building the documentation on MacOS. I'm using ./full-build.sh script from [1] repository. It worked just fine for many years but since recently it started to fail like this: ``` /usr/bin/xsltproc --path . --stringparam pg.version '16devel' /Users/eax/projects/c/pgscripts/../postgresql/doc/src/sgml/stylesheet.xsl postgres-full.xml error : Unknown IO error warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl"; compilation error: file /Users/eax/projects/c/pgscripts/../postgresql/doc/src/sgml/stylesheet.xsl line 6 element import xsl:import : unable to load http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl error : Unknown IO error /Users/eax/projects/c/postgresql/doc/src/sgml/stylesheet-html-common.xsl:4: warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/common/entities.ent"; %common.entities; ^ Entity: line 1: %common.entities; ^ [...] ``` This is not a network problem. I can download chunk.xsl with wget and also build the documentation on my Linux laptop. I've tried `brew reinstall` and also: ``` ./configure ... XMLLINT="xmllint --nonet" XSLTPROC="xsltproc --nonet" ``` ... as suggested by the documentation [2] but it didn't change anything. I checked the archive of pgsql-hackers@ but was unable to find anything relevant. I'm using MacOS Monterey 12.6.2. ``` $ brew info docbook ==> docbook: stable 5.1 (bottled) ... $ brew info docbook-xsl ==> docbook-xsl: stable 1.79.2 (bottled) ... ``` At this point I could use a friendly piece of advice from the community. [1]: https://github.com/afiskon/pgscripts/ [2]: https://www.postgresql.org/docs/15/docguide-toolsets.html -- Best regards, Aleksander Alekseev
Re: old_snapshot_threshold bottleneck on replica
On Fri, 27 Jan 2023 at 18:18, Robert Haas wrote: > > Interesting, but it's still not entirely clear to me from reading the > comments why we should think that this is safe. > In overall, I think this is safe, because we do not change algorithm here. More specific, threshold_timestamp have only used in a few cases: 1). To get the latest value by calling GetOldSnapshotThresholdTimestamp. This will work, since we only change the sync type here from the spinlock to an atomic. 2). In TransactionIdLimitedForOldSnapshots, but here no changes in the behaviour will be done. Sync model will be the save as before the patch. 3). In SnapshotTooOldMagicForTest, which is a stub to make old_snapshot_threshold tests appear "working". But no coherence with the threshold_xid here. So, we have a two use cases for the threshold_timestamp: a). When the threshold_timestamp is used in conjunction with the threshold_xid. We must use spinlock to sync. b). When the threshold_timestamp is used without conjunction with the threshold_xid. In this case, we use atomic values. -- Best regards, Maxim Orlov.
Re: Add SHELL_EXIT_CODE to psql
Unfortunately, there is a fail in FreeBSD https://cirrus-ci.com/task/6466749487382528 Maybe, this patch is need to be rebased? -- Best regards, Maxim Orlov.
Re: Considering additional sort specialisation functions for PG16
I wrote: > On Thu, Jan 26, 2023 at 7:15 PM David Rowley wrote: > > I think the slower sorts I found in [2] could also be partially caused > > by the current sort specialisation comparators re-comparing the > > leading column during a tie-break. I've not gotten around to disabling > > the sort specialisations to see if and how much this is a factor for > > that test. > > Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this: > > /* Compare the leading sort key */ > compare = ApplySortComparator(...); > if (compare != 0) > return compare; > > /* Compare additional sort keys */ > ... > > The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys. I gave this a go, and it turns out we don't need any extra branches in the inlined comparators -- the new fallbacks are naturally written to account for the "!onlyKey" case. If the first sortkey was abbreviated, call its full comparator, otherwise skip to the next sortkey (if there isn't one, we shouldn't have gotten here). The existing comparetup functions try the simple case and then call the fallback (which could be inlined for them but I haven't looked). Tests pass, but I'm not sure yet if we need more tests. I don't have a purpose-built benchmark at the moment, but I'll see if any of my existing tests exercise this code path. I can also try the window function case unless someone beats me to it. -- John Naylor EDB: http://www.enterprisedb.com From 65b94223f8bdb726b97d695f0040c12197d5e65d Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 30 Jan 2023 17:10:00 +0700 Subject: [PATCH v1] Split out fallback functionality from comparetup* functions Previously, if a specialized comparator find equal datum1 keys, the comparetup function would call the full comparator on the datum before proceeding with either the unabbreviated first key or the second key. Add a comparetup_fallback field for these that call special fallback functions. The existing comparetup functions just call ApplySortComparator where we can, than call our new fallback. --- src/backend/utils/sort/tuplesort.c | 6 +- src/backend/utils/sort/tuplesortvariants.c | 114 - src/include/utils/tuplesort.h | 6 ++ 3 files changed, 96 insertions(+), 30 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 9ca9835aab..54f64d6c2d 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -513,7 +513,7 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_fallback(a, b, state); } #if SIZEOF_DATUM >= 8 @@ -537,7 +537,7 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_fallback(a, b, state); } #endif @@ -561,7 +561,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_fallback(a, b, state); } /* diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c index eb6cfcfd00..e65ac49d9d 100644 --- a/src/backend/utils/sort/tuplesortvariants.c +++ b/src/backend/utils/sort/tuplesortvariants.c @@ -47,18 +47,24 @@ static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups, int count); static int comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_heap_fallback(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static void writetup_heap(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup); static void readtup_heap(Tuplesortstate *state, SortTuple *stup, LogicalTape *tape, unsigned int len); static int comparetup_cluster(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_cluster_fallback(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape, S
Re: Deadlock between logrep apply worker and tablesync worker
On Mon, 30 Jan 2023 at 13:00, houzj.f...@fujitsu.com wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila > wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both > > > > places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the > error > should not happen as no concurrent drop can be performed. > > To make it simpler, one idea is to move the code that getting the tuple from > system cache to the replorigin_drop_by_name(). After locking the origin, we > can try to get the tuple and do the existence check, and we can reuse > this tuple to perform origin delete. In this approach we only need to check > origin existence once after locking. BTW, if we do this, then we'd better > rename the > replorigin_drop_guts() to something like replorigin_state_clear() as the > function > only clear the in-memory information after that. > > The code could be like: > > --- > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > ... > /* > * Lock the origin to prevent concurrent drops. We keep the lock > until the > * end of transaction. > */ > LockSharedObject(ReplicationOriginRelationId, roident, 0, > AccessExclusiveLock); > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > if (!HeapTupleIsValid(tuple)) > { > if (!missing_ok) > elog(ERROR, "cache lookup failed for replication > origin with ID %d", > roident); > > return; > } > > replorigin_state_clear(rel, roident, nowait); > > /* > * Now, we can delete the catalog entry. > */ > CatalogTupleDelete(rel, &tuple->t_self); > ReleaseSysCache(tuple); > > CommandCounterIncrement(); > ... +1 for this change as it removes the redundant check which is not required. I will post an updated version for this. Regards, Vignesh
RE: Time delayed LR (WAS Re: logical replication restrictions)
On Monday, January 30, 2023 12:02 PM Kyotaro Horiguchi wrote: > At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" > wrote in > > On Friday, January 27, 2023 8:00 PM Amit Kapila > wrote: > > > So, you have changed min_apply_delay from int64 to int32, but you > > > haven't mentioned the reason for the same? We use 'int' for the > > > similar parameter recovery_min_apply_delay, so, ideally, it makes > > > sense but still better to tell your reason explicitly. > > Yes. It's because I thought I need to make this feature consistent with the > recovery_min_apply_delay. > > This feature handles the range same as the recovery_min_apply delay > > from 0 to INT_MAX now so should be adjusted to match it. > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where that > actually happens, though.) We can use PG_INT32_MAX instead. > > IMHO, I think we don't use int as a catalog column and I agree that > int32 is sufficient since I don't think more than 49 days delay is practical. > On > the other hand, maybe I wouldn't want to use int32 for intermediate > calculations. Hi, Horiguchi-san. Thanks for your comments ! IIUC, in the last sentence, you proposed the type of SubOpts min_apply_delay should be change to "int". But I couldn't find actual harm of the current codes, because we anyway insert the SubOpts value to the catalog after holding it in SubOpts. Also, it seems there is no explicit rule where we should use "int" local variables for "int32" system catalog values internally. I had a look at other variables for int32 system catalog members and either looked fine. So, I'd like to keep the current code as it is, until actual harm is found. The latest patch can be seen in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373E26884C385EFFFB8965FEDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Re: Generating code for query jumbling through gen_node_support.pl
On Fri, Jan 27, 2023 at 11:59:47AM +0900, Michael Paquier wrote: > Using that, I can compile the following results for various cases (-O2 > and compute_query_id=on): > query | mode | iterations | avg_runtime_ns | > avg_jumble_ns > -++++--- > begin | string | 5000 |4.53116 | > 4.54 > begin | jumble | 5000 | 30.94578 | > 30.94 > commit | string | 5000 |4.76004 | > 4.74 > commit | jumble | 5000 |31.4791 | > 31.48 > create table 1 column | string | 5000 |7.22836 | > 7.08 > create table 1 column | jumble | 5000 | 152.10852 | > 151.96 > create table 5 columns | string | 5000 | 12.43412 | > 12.28 > create table 5 columns | jumble | 5000 | 352.88976 | > 349.1 > create table 20 columns | string |500 | 49.591 | > 48.2 > create table 20 columns | jumble |500 | 2272.4066 | > 2271 > drop table 1 column | string | 5000 |6.70538 | > 6.56 > drop table 1 column | jumble | 5000 | 50.38 | > 50.24 > drop table 5 columns| string | 5000 |6.88256 | > 6.74 > drop table 5 columns| jumble | 5000 | 50.02898 | > 49.9 > SET work_mem| string | 5000 |7.28752 | > 7.28 > SET work_mem| jumble | 5000 | 91.66588 | > 91.64 > (16 rows) Just to close the loop here, I have done more measurements to compare the jumble done for some DMLs and some SELECTs between HEAD and the patch (forgot to post some last Friday). Both methods show comparable results: query | mode | iterations | avg_runtime_ns | avg_jumble_ns --++++--- insert table 10 cols | master | 5000 | 377.17878 |377.04 insert table 10 cols | jumble | 5000 | 409.47924 |409.34 insert table 20 cols | master | 5000 | 692.94924 | 692.8 insert table 20 cols | jumble | 5000 | 710.0901 |709.96 insert table 5 cols | master | 5000 | 232.44308 | 232.3 insert table 5 cols | jumble | 5000 | 253.49854 |253.36 select 10 cols | master | 5000 | 449.13608 |383.36 select 10 cols | jumble | 5000 | 491.61912 |323.86 select 5 cols| master | 5000 |277.477 |277.46 select 5 cols| jumble | 5000 | 323.88152 |323.86 (10 rows) The averages are in ns, so the difference does not bother me much. There may be some noise mixed in that ;) (Attached is the tweak I have applied on HEAD to get some numbers.) -- Michael diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index 16084842a3..c66090bde3 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -101,10 +101,14 @@ CleanQuerytext(const char *query, int *location, int *len) JumbleState * JumbleQuery(Query *query, const char *querytext) { +#define MAX_QUERY_COMPUTE 5000 + JumbleState *jstate = NULL; Assert(IsQueryIdEnabled()); + elog(WARNING, "start JumbleQuery"); + if (query->utilityStmt) { query->queryId = compute_utility_query_id(querytext, @@ -114,18 +118,26 @@ JumbleQuery(Query *query, const char *querytext) else { jstate = (JumbleState *) palloc(sizeof(JumbleState)); - - /* Set up workspace for query jumbling */ jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE); - jstate->jumble_len = 0; jstate->clocations_buf_size = 32; jstate->clocations = (LocationLen *) palloc(jstate->clocations_buf_size * sizeof(LocationLen)); - jstate->clocations_count = 0; - jstate->highest_extern_param_id = 0; - /* Compute query ID and mark the Query node with it */ - JumbleQueryInternal(jstate, query); + for (int i = 0; i < MAX_QUERY_COMPUTE ; i++) + { + /* Set up workspace for query jumbling */ + memset(jstate->jumble, 0, sizeof(JUMBLE_SIZE)); + jstate->jumble_len = 0; + jstate->clocations_buf_size = 32; + memset(jstate->clocations, 0, + jstate->clocations_buf_size * sizeof(LocationLen)); + + jstate->clocations_count = 0; + jstate->highest_extern_p
Re: [DOCS] Stats views and functions not in order?
On 30.01.23 07:12, Peter Smith wrote: Meanwhile, this pagination topic has strayed far away from the original $SUBJECT, so I guess since there is nothing else pending this thread's CF entry [1] can just be marked as "Committed" now? done
Re: Generating code for query jumbling through gen_node_support.pl
On 27.01.23 04:07, Michael Paquier wrote: On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote: There are a couple of repetitive comments, like "typmod and collation information are irrelevant for the query jumbling". This applies to all nodes, so we don't need to repeat it for a number of nodes (and then not mention it for other nodes). Maybe there should be a central place somewhere that describes "these kinds of fields should normally be ignored". The place that would make the most sense to me to centralize this knowlegde is nodes.h itself, because that's where the node attributes are defined? Either that or src/backend/nodes/README.
Re: Generating code for query jumbling through gen_node_support.pl
On 27.01.23 03:59, Michael Paquier wrote: At the end, that would be unnoticeable for the average user, I guess, but here are the numbers I get on my laptop :) Personally, I think we do not want the two jumble methods in parallel. Maybe there are other opinions. I'm going to set this thread as "Ready for Committer". Either wait a bit for more feedback on this topic, or just go ahead with either solution. We can leave it as a semi-open item for reconsideration later.
Re: Allow logical replication to copy tables in binary format
Hi Bharath, Thanks for reviewing. Bharath Rupireddy , 18 Oca 2023 Çar, 10:17 tarihinde şunu yazdı: > On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu > wrote: > 1. The performance numbers posted upthread [1] look impressive for the > use-case tried, that's a 2.25X improvement or 55.6% reduction in > execution times. However, it'll be great to run a few more varied > tests to confirm the benefit. > Sure, do you have any specific test case or suggestion in mind? > 2. It'll be great to capture the perf report to see if the time spent > in copy_table() is reduced with the patch. > Will provide that in another email soon. > 3. I think blending initial table sync's binary copy option with > data-type level binary send/receive is not good. Moreover, data-type > level binary send/receive has its own restrictions per 9de77b5453. > IMO, it'll be good to have a new option, say copy_data_format synonyms > with COPY command's FORMAT option but with only text (default) and > binary values. > Added a "copy_format" option for subscriptions with text as default value. So it would be possible to create a binary subscription but copy tables in text format to avoid restrictions that you're concerned about. > 4. Why to add tests to existing 002_types.pl? Can we add a new file > with all the data types covered? > Since 002_types.pl is where the replication of data types are covered. I thought it would be okay to test replication with the binary option in that file. Sure, we can add a new file with different data types for testing subscriptions with binary option. But then I feel like it would have too many duplicated lines with 002_types.pl. If you think that 002_types.pl lacks some data types needs to be tested, then we should add those into 002_types.pl too whether we test subscription with binary option in that file or some other place, right? > 5. It's not clear to me as to why these rows were removed in the patch? > -(1, '{1, 2, 3}'), > -(2, '{2, 3, 1}'), > (3, '{3, 2, 1}'), > (4, '{4, 3, 2}'), > (5, '{5, NULL, 3}'); > > -- test_tbl_arrays > INSERT INTO tst_arrays (a, b, c, d) VALUES > -('{1, 2, 3}', '{"a", "b", "c"}', '{1.1, 2.2, 3.3}', '{"1 > day", "2 days", "3 days"}'), > -('{2, 3, 1}', '{"b", "c", "a"}', '{2.2, 3.3, 1.1}', '{"2 > minutes", "3 minutes", "1 minute"}'), > Previously, it wasn't actually testing the initial table sync since all tables were empty when subscription was created. I just simply split the data initially inserted to test initial table sync. With this patch, it inserts the first two rows for all data types before subscriptions get created. You can see these lines: > +# Insert initial test data > +$node_publisher->safe_psql( > + 'postgres', qq( > + -- test_tbl_one_array_col > + INSERT INTO tst_one_array (a, b) VALUES > + (1, '{1, 2, 3}'), > + (2, '{2, 3, 1}'); > + > + -- test_tbl_arrays > + INSERT INTO tst_arrays (a, b, c, d) VALUES > 6. BTW, the subbinary description is missing in pg_subscription docs > https://www.postgresql.org/docs/devel/catalog-pg-subscription.html? > - Specifies whether the subscription will request the publisher to > - send the data in binary format (as opposed to text). > + Specifies whether the subscription will copy the initial data to > + synchronize relations in binary format and also request the > publisher > + to send the data in binary format too (as opposed to text). > Done. > 7. A nitpick - space is needed after >= before 16. > +if (walrcv_server_version(LogRepWorkerWalRcvConn) >=16 && > Done. > 8. Note that the COPY binary format isn't portable across platforms > (Windows to Linux for instance) or major versions > https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical > replication is > https://www.postgresql.org/docs/devel/logical-replication.html. > I don't see any handling of such cases in copy_table but only a check > for the publisher version. I think we need to account for all the > cases - allow binary COPY only when publisher and subscriber are of > same versions, architectures, platforms. The trick here is how we > identify if the subscriber is of the same type and taste > (architectures and platforms) as the publisher. Looking for > PG_VERSION_STR of publisher and subscriber might be naive, but I'm not > sure if there's a better way to do it. > I think having the "copy_format" option with text as default, like I replied to your 3rd review above, will keep things as they are now. The patch now will only allow users to choose binary copy as well, if they want it and acknowledge the limitations that come with binary copy. COPY command's portability issues shouldn't be an issue right now, since the patch still supports text format. Right? > Also, the COPY binary format is very data type specific - per the docs > "for example it will not
Re: MacOS: xsltproc fails with "warning: failed to load external entity"
Hi hackers, > At this point I could use a friendly piece of advice from the community. I've found a solution: ``` export SGML_CATALOG_FILES=/usr/local/etc/xml/catalog export XMLLINT="xmllint --catalogs" export XSLTPROC="xsltproc --catalogs" ``` I will submit a patch for the documentation in a bit, after I'll check it properly. -- Best regards, Aleksander Alekseev
Re: Assertion failure in SnapBuildInitialSnapshot()
On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada wrote: > > On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila wrote: > > One idea to fix this issue is that in > ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin > while holding both ProcArrayLock and ReplicationSlotControlLock, and > release only ReplicationSlotsControlLock before updating the > replication_slot_xmin. I'm concerned it will increase the contention > on ProcArrayLock but I've attached the patch for discussion. > But what kind of workload are you worried about? This will be called while processing XLOG_RUNNING_XACTS to update procArray->replication_slot_xmin/procArray->replication_slot_catalog_xmin only when required. So, if we want we can test some concurrent workloads along with walsenders doing the decoding to check if it impacts performance. What other way we can fix this? Do you think we can try to avoid retreating xmin values in ProcArraySetReplicationSlotXmin() to avoid this problem? Personally, I think taking the lock as proposed by your patch is a better idea. BTW, this problem seems to be only logical replication specific, so if we are too worried then we can change this locking only for logical replication. -- With Regards, Amit Kapila.
Re: Assertion failure in SnapBuildInitialSnapshot()
On Mon, Jan 30, 2023 at 11:34 AM Amit Kapila wrote: > > I have reproduced it manually. For this, I had to manually make the > debugger call ReplicationSlotsComputeRequiredXmin(false) via path > SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot()->LogicalConfirmReceivedLocation() > ->ReplicationSlotsComputeRequiredXmin(false) for the apply worker. The > sequence of events is something like (a) the replication_slot_xmin for > tablesync worker is overridden by apply worker as zero as explained in > Sawada-San's email, (b) another transaction happened on the publisher > that will increase the value of ShmemVariableCache->nextXid (c) > tablesync worker invokes > SnapBuildInitialSnapshot()->GetOldestSafeDecodingTransactionId() which > will return an oldestSafeXid which is higher than snapshot's xmin. > This happens because replication_slot_xmin has an InvalidTransactionId > value and we won't consider replication_slot_catalog_xmin because > catalogOnly flag is false and there is no other open running > transaction. I think we should try to get a simplified test to > reproduce this problem if possible. > Here are steps to reproduce it manually with the help of a debugger: Session-1 == select pg_create_logical_replication_slot('s', 'test_decoding'); create table t2(c1 int); select pg_replication_slot_advance('s', pg_current_wal_lsn()); -- Debug this statement. Stop before taking procarraylock in ProcArraySetReplicationSlotXmin. Session-2 psql -d postgres Begin; Session-3 === psql -d "dbname=postgres replication=database" begin transaction isolation level repeatable read read only; CREATE_REPLICATION_SLOT slot1 LOGICAL test_decoding USE_SNAPSHOT; --Debug this statement. Stop in SnapBuildInitialSnapshot() before taking procarraylock Session-1 == Continue debugging and finish execution of ProcArraySetReplicationSlotXmin. Verify procArray->replication_slot_xmin is zero. Session-2 = Select txid_current(); Commit; Session-3 == Continue debugging. Verify that safeXid follows snap->xmin. This leads to assertion (in back branches) or error (in HEAD). -- With Regards, Amit Kapila.
Re: Assertion failure in SnapBuildInitialSnapshot()
On Fri, Jan 27, 2023 at 4:31 PM Hayato Kuroda (Fujitsu) wrote: > > Thank you for making the patch! I'm still considering whether this approach is > correct, but I can put a comment to your patch anyway. > > ``` > - Assert(!already_locked || LWLockHeldByMe(ProcArrayLock)); > - > - if (!already_locked) > - LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > + Assert(LWLockHeldByMe(ProcArrayLock)); > ``` > > In this function, we regard that the ProcArrayLock has been already acquired > as > exclusive mode and modify data. I think LWLockHeldByMeInMode() should be used > instead of LWLockHeldByMe(). > Right, this is even evident from the comments atop ReplicationSlotsComputeRequiredXmin("If already_locked is true, ProcArrayLock has already been acquired exclusively.". But, I am not sure if it is a good idea to remove 'already_locked' parameter, especially in back branches as this is an exposed API. -- With Regards, Amit Kapila.
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
> I kind of think this is a lot of unnecessary work. The case that is > problematic is a GUC that's marked GUC_NO_SHOW_ALL but not marked > GUC_NOT_IN_SAMPLE. There aren't any of those, and I don't think there > are likely to be any in future, because it doesn't make a lot of sense. > Why don't we just make a policy against doing that, and enforce it > with an assertion somewhere in GUC initialization? > > Looking at guc.sql, I think that these is a second mistake: the test > checks for (no_show_all AND !no_reset_all) but this won't work > because NO_SHOW_ALL GUCs cannot be scanned via SQL. There are two > parameters that include this combination of flags: default_with_oids > and ssl_renegotiation_limit, so the check would not do what it > should. I think that this part should be removed. Thanks Michael for identifying a new mistake. I am a little confused here. I don't understand why GUC_NO_SHOW_ALL depends on other GUCs like GUC_NOT_IN_SAMPLE or GUC_NO_RESET_ALL. Looks like the dependency between GUC_NO_RESET_ALL and GUC_NO_SHOW_ALL is removed in the above patch since we have these combinations now. Similarly why can't we have a GUC marked as GUC_NO_SHOW_ALL but not GUC_NOT_IN_CONFIG. For me it makes sense if a GUC is marked as NO_SHOW_ALL and it can be present in a sample file. It's up to the author/developer to choose which all flags are applicable to the newly introduced GUCs. Please share your thoughts. Thanks & Regards, Nitin Jadhav On Mon, Jan 30, 2023 at 10:36 AM Michael Paquier wrote: > > On Sun, Jan 29, 2023 at 01:05:07PM -0500, Tom Lane wrote: > > I kind of think this is a lot of unnecessary work. The case that is > > problematic is a GUC that's marked GUC_NO_SHOW_ALL but not marked > > GUC_NOT_IN_SAMPLE. There aren't any of those, and I don't think there > > are likely to be any in future, because it doesn't make a lot of sense. > > Why don't we just make a policy against doing that, and enforce it > > with an assertion somewhere in GUC initialization? > > [..thinks..] > > Looking at guc.sql, I think that these is a second mistake: the test > checks for (no_show_all AND !no_reset_all) but this won't work > because NO_SHOW_ALL GUCs cannot be scanned via SQL. There are two > parameters that include this combination of flags: default_with_oids > and ssl_renegotiation_limit, so the check would not do what it > should. I think that this part should be removed. > > For the second part to prevent GUCs to be marked as NO_SHOW_ALL && > !NOT_IN_SAMPLE, check_GUC_init() looks like the right place to me, > because this routine has been designed exactly for this purpose. > > So, what do you think about the attached? > -- > Michael
Re: MacOS: xsltproc fails with "warning: failed to load external entity"
Hi hackers, > I've found a solution: > > ``` > export SGML_CATALOG_FILES=/usr/local/etc/xml/catalog > export XMLLINT="xmllint --catalogs" > export XSLTPROC="xsltproc --catalogs" > ``` > > I will submit a patch for the documentation in a bit, after I'll check > it properly. PFA the patch. I don't have a strong opinion regarding any particular wording and would like to ask the committer to change it as he sees fit. -- Best regards, Aleksander Alekseev v1-0001-Document-the-workaround-for-xsltproc-when-buildin.patch Description: Binary data
Re: Deadlock between logrep apply worker and tablesync worker
On Mon, 30 Jan 2023 at 13:00, houzj.f...@fujitsu.com wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila > wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both > > > > places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the > error > should not happen as no concurrent drop can be performed. > > To make it simpler, one idea is to move the code that getting the tuple from > system cache to the replorigin_drop_by_name(). After locking the origin, we > can try to get the tuple and do the existence check, and we can reuse > this tuple to perform origin delete. In this approach we only need to check > origin existence once after locking. BTW, if we do this, then we'd better > rename the > replorigin_drop_guts() to something like replorigin_state_clear() as the > function > only clear the in-memory information after that. > > The code could be like: > > --- > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > ... > /* > * Lock the origin to prevent concurrent drops. We keep the lock > until the > * end of transaction. > */ > LockSharedObject(ReplicationOriginRelationId, roident, 0, > AccessExclusiveLock); > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > if (!HeapTupleIsValid(tuple)) > { > if (!missing_ok) > elog(ERROR, "cache lookup failed for replication > origin with ID %d", > roident); > > return; > } > > replorigin_state_clear(rel, roident, nowait); > > /* > * Now, we can delete the catalog entry. > */ > CatalogTupleDelete(rel, &tuple->t_self); > ReleaseSysCache(tuple); > > CommandCounterIncrement(); The attached updated patch has the changes to handle the same. Regards, Vignesh From dc2e9acc6a55772896117cf3b88e4189f994a82d Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Fri, 27 Jan 2023 15:17:09 +0530 Subject: [PATCH v2] Lock the replication origin record instead of locking the pg_replication_origin relation. Lock the replication origin record instead of locking the pg_replication_origin relation. --- src/backend/replication/logical/origin.c | 56 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index b754c43840..3e360cf41e 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -338,16 +338,14 @@ replorigin_create(const char *roname) * Helper function to drop a replication origin. */ static void -replorigin_drop_guts(Relation rel, RepOriginId roident, bool nowait) +replorigin_state_clear(RepOriginId roident, bool nowait) { - HeapTuple tuple; int i; /* - * First, clean up the slot state info, if there is any matching slot. + * Clean up the slot state info, if there is any matching slot. */ restart: - tuple = NULL; LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE); for (i = 0; i < max_replication_slots; i++) @@ -402,19 +400,6 @@ restart: } LWLockRelease(ReplicationOriginLock); ConditionVariableCancelSleep(); - - /* - * Now, we can delete the catalog entry. - */ - tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); - if (!HeapTupleIsValid(tuple)) - elog(ERROR, "cache lookup failed for replication origin with ID %d", - roident); - - CatalogTupleDelete(rel, &tuple->t_self); - ReleaseSysCache(tuple); - - CommandCounterIncrement(); } /* @@ -427,24 +412,37 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) { RepOriginId roident; Relation rel; + HeapTuple tuple; Assert(IsTransactionState()); - /* - * To interlock against concurrent drops, we hold ExclusiveLock on - * pg_replication_origin till xact commit. - * - * XXX We can optimize this by acquiring the lock on a specific origin by - * using LockSharedObject if required
Re: generate_series for timestamptz and time zone problem
Gurjeet Singh wrote on 30.01.2023 08:18: On Sat, Nov 12, 2022 at 10:44 AM Tom Lane wrote: However, what it shouldn't be is part of this patch. It's worth pushing it separately to have a record of that decision. I've now done that, so you'll need to rebase to remove that delta. I looked over the v5 patch very briefly, and have two main complaints: * There's no documentation additions. You can't add a user-visible function without adding an appropriate entry to func.sgml. * I'm pretty unimpressed with the whole truncate-to-interval thing and would recommend you drop it. I don't think it's adding much useful functionality beyond what we can already do with the existing date_trunc variants; and the definition seems excessively messy (too many restrictions and special cases). Please see attached v6 of the patch. The changes since v5 are: .) Rebased and resolved conflicts caused by commit 533e02e92. .) Removed code and tests related to new date_trunc() functions, as suggested by Tom. .) Added 3 more variants to accompany with date_add(tstz, interval, zone). date_add(tstz, interval) date_subtract(tstz, interval) date_subtract(tstz, interval, zone) .) Eliminate duplication of code; use common function to implement generate_series_timestamptz[_at_zone]() functions. .) Fixed bug where in one of the new code paths, generate_series_timestamptz_with_zone(), did not perform TIMESTAMP_NOT_FINITE() check. .) Replaced some DirectFunctionCall?() with direct calls to the relevant *_internal() function; should be better for performance. .) Added documentation all 5 functions (2 date_add(), 2 date_subtract(), 1 overloaded version of generate_series()). Other work distracted me from this patch. I looked at your update v6 and it looks ok. For me the date_trunc function is important and I still have some corner cases. Now I will continue working with data_trunc in a separate patch. I'm not sure of the convention around authorship. But since this was not an insignificant amount of work, would this patch be considered as co-authored by Przemyslaw and myself? Should I add myself to Authors field in the Commitfest app? I see no obstacles for us to be co-authors. Hi Przemyslaw, I started working on this patch based on Tom's review a few days ago, since you hadn't responded in a while, and I presumed you're not working on this anymore. I should've consulted with/notified you of my intent before starting to work on it, to avoid duplication of work. Sorry if this submission obviates any work you have in progress. Please feel free to provide your feedback on the v6 of the patch. I propose to get the approval of the current truncated version of the patch. As I wrote above, I will continue work on date_trunc later and as a separate patch. -- Przemysław Sztoch | Mobile +48 509 99 00 66
Re: Allow logical replication to copy tables in binary format
On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu wrote: > Thanks for providing an updated patch. >> On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu wrote: >> 1. The performance numbers posted upthread [1] look impressive for the >> use-case tried, that's a 2.25X improvement or 55.6% reduction in >> execution times. However, it'll be great to run a few more varied >> tests to confirm the benefit. > > Sure, do you have any specific test case or suggestion in mind? Have a huge amount of publisher's table (with mix of columns like int, text, double, bytea and so on) prior data so that the subscriber's table sync workers have to do a "good" amount of work to copy, then measure the copy_table time with and without patch. >> 2. It'll be great to capture the perf report to see if the time spent >> in copy_table() is reduced with the patch. > > Will provide that in another email soon. Thanks. >> 4. Why to add tests to existing 002_types.pl? Can we add a new file >> with all the data types covered? > > Since 002_types.pl is where the replication of data types are covered. I > thought it would be okay to test replication with the binary option in that > file. > Sure, we can add a new file with different data types for testing > subscriptions with binary option. But then I feel like it would have too many > duplicated lines with 002_types.pl. > If you think that 002_types.pl lacks some data types needs to be tested, then > we should add those into 002_types.pl too whether we test subscription with > binary option in that file or some other place, right? > > Previously, it wasn't actually testing the initial table sync since all > tables were empty when subscription was created. > I just simply split the data initially inserted to test initial table sync. > > With this patch, it inserts the first two rows for all data types before > subscriptions get created. > You can see these lines: It'd be better and clearer to have a separate TAP test file IMO since the focus of the feature here isn't to just test for data types. With separate tests, you can verify "ERROR: incorrect binary data format in logical replication column 1" cases. >> 8. Note that the COPY binary format isn't portable across platforms >> (Windows to Linux for instance) or major versions >> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical >> replication is >> https://www.postgresql.org/docs/devel/logical-replication.html. >> I don't see any handling of such cases in copy_table but only a check >> for the publisher version. I think we need to account for all the >> cases - allow binary COPY only when publisher and subscriber are of >> same versions, architectures, platforms. The trick here is how we >> identify if the subscriber is of the same type and taste >> (architectures and platforms) as the publisher. Looking for >> PG_VERSION_STR of publisher and subscriber might be naive, but I'm not >> sure if there's a better way to do it. > > I think having the "copy_format" option with text as default, like I replied > to your 3rd review above, will keep things as they are now. > The patch now will only allow users to choose binary copy as well, if they > want it and acknowledge the limitations that come with binary copy. > COPY command's portability issues shouldn't be an issue right now, since the > patch still supports text format. Right? With the above said, do you think checking for publisher versions is needed? The user can decide to enable/disable binary COPY based on the publisher's version no? +/* If the publisher is v16 or later, specify the format to copy data. */ +if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 16) +{ Few more comments on v7: 1. + Specifies the format in which pre-existing data on the publisher will + copied to the subscriber. Supported formats are + text and binary. The default is + text. It'll be good to call out the cases in the documentation as to where copy_format can be enabled and needs to be disabled. 2. + errmsg("%s value should be either \"text\" or \"binary\"", How about "value must be either "? 3. +if (!opts->binary && +opts->copy_format == LOGICALREP_COPY_AS_BINARY) +{ +ereport(ERROR, +(errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s and %s are mutually exclusive options", +"binary = false", "copy_format = binary"))); +"CREATE SUBSCRIPTION tap_sub_binary CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_binary_slot, binary = true, copy_format = 'binary')" Why should the subscription's binary option and copy_format option be tied at all? Tying these two options hurts usability. Is there a fundamental reason? I think they both are/must be independent. One deals with data types and another deals with how initial table data is copied. -- Bharath Rupireddy PostgreSQL Contributors Tea
Re: Assertion failure in SnapBuildInitialSnapshot()
On Mon, Jan 30, 2023 at 8:30 PM Amit Kapila wrote: > > On Fri, Jan 27, 2023 at 4:31 PM Hayato Kuroda (Fujitsu) > wrote: > > > > Thank you for making the patch! I'm still considering whether this approach > > is > > correct, but I can put a comment to your patch anyway. > > > > ``` > > - Assert(!already_locked || LWLockHeldByMe(ProcArrayLock)); > > - > > - if (!already_locked) > > - LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > > + Assert(LWLockHeldByMe(ProcArrayLock)); > > ``` > > > > In this function, we regard that the ProcArrayLock has been already > > acquired as > > exclusive mode and modify data. I think LWLockHeldByMeInMode() should be > > used > > instead of LWLockHeldByMe(). > > > > Right, this is even evident from the comments atop > ReplicationSlotsComputeRequiredXmin("If already_locked is true, > ProcArrayLock has already been acquired exclusively.". Agreed, will fix in the next version patch. > But, I am not > sure if it is a good idea to remove 'already_locked' parameter, > especially in back branches as this is an exposed API. Yes, we should not remove the already_locked parameter in backbranches. So I was thinking of keeping it on back branches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Perform streaming logical transactions by background workers and parallel apply
Dear Hou, Thank you for updating the patch! I checked your replies and new patch, and it seems good. Currently I have no comments Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Fri, Jan 27, 2023 at 6:32 PM Önder Kalacı wrote: >> >> I suppose the options are: >> 1. use regular planner uniformly >> 2. use regular planner only when there's no replica identity (or >> configurable?) >> 3. only use low-level functions >> 4. keep using sequential scans for every single updated row >> 5. introduce a hidden logical row identifier in the heap that is guaranteed >> unique within a table and can be used as a replica identity when no unique >> index exists > > > One other option I considered was to ask the index explicitly on the > subscriber side from the user when REPLICA IDENTITY is FULL. But, it is a > pretty hard choice for any user, even a planner sometimes fails to pick the > right index :) Also, it is probably controversial to change any of the APIs > for this purpose? > I agree that it won't be a very convenient option for the user but how about along with asking for an index from the user (when the user didn't provide an index), we also allow to make use of any unique index over a subset of the transmitted columns, and if there's more than one candidate index pick any one. Additionally, we can allow disabling the use of an index scan for this particular case. If we are too worried about API change for allowing users to specify the index then we can do that later or as a separate patch. > I'd be happy to hear from more experienced hackers on the trade-offs for the > above, and I'd be open to work on that if there is a clear winner. For me (3) > is a decent solution for the problem. > >From the discussion above it is not very clear that adding maintenance costs in this area is worth it even though that can give better results as far as this feature is concerned. -- With Regards, Amit Kapila.
Re: JSONPath Child Operator?
Hi David, On 2022-11-10 21:55, David E. Wheeler wrote: My question: Are there plans to support square bracket syntax for JSON object field name strings like this? Or to update to follow the standard as it’s finalized? This syntax is a part of "jsonpath syntax extensions" patchset: https://www.postgresql.org/message-id/e0fe4f7b-da0b-471c-b3da-d8adaf314357%40postgrespro.ru -- Ph.
Re: dynamic result sets support in extended query protocol
On 2022-Nov-22, Peter Eisentraut wrote: > I added tests using the new psql \bind command to test this functionality in > the extended query protocol, which showed that this got broken since I first > wrote this patch. This "blame" is on the pipeline mode in libpq patch > (acb7e4eb6b1c614c68a62fb3a6a5bba1af0a2659). I need to spend more time on > this and figure out how to repair it. Applying this patch, your test queries seem to work correctly. This is quite WIP, especially because there's a couple of scenarios uncovered by tests that I'd like to ensure correctness about, but if you would like to continue adding tests for extended query and dynamic result sets, it may be helpful. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/) diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index ec62550e38..b530c51ccd 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -2109,19 +2109,19 @@ PQgetResult(PGconn *conn) break; case PGASYNC_READY: + res = pqPrepareAsyncResult(conn); /* -* For any query type other than simple query protocol, we advance -* the command queue here. This is because for simple query -* protocol we can get the READY state multiple times before the -* command is actually complete, since the command string can -* contain many queries. In simple query protocol, the queue -* advance is done by fe-protocol3 when it receives ReadyForQuery. +* When an error has occurred, we consume one command from the +* queue for each result we return. (Normally, the command would +* be consumed as each result is read from the server.) */ if (conn->cmd_queue_head && - conn->cmd_queue_head->queryclass != PGQUERY_SIMPLE) + (conn->error_result || +(conn->result != NULL && + conn->result->resultStatus == PGRES_FATAL_ERROR))) pqCommandQueueAdvance(conn); - res = pqPrepareAsyncResult(conn); + if (conn->pipelineStatus != PQ_PIPELINE_OFF) { /* diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 8ab6a88416..2ed74aa0f1 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -205,6 +205,10 @@ pqParseInput3(PGconn *conn) pqSaveErrorResult(conn); } } + if (conn->cmd_queue_head && + conn->cmd_queue_head->queryclass != PGQUERY_SIMPLE) + pqCommandQueueAdvance(conn); + if (conn->result) strlcpy(conn->result->cmdStatus, conn->workBuffer.data, CMDSTATUS_LEN); @@ -231,6 +235,7 @@ pqParseInput3(PGconn *conn) else { conn->pipelineStatus = PQ_PIPELINE_ON; + pqCommandQueueAdvance(conn); conn->asyncStatus = PGASYNC_READY; } } @@ -257,6 +262,7 @@ pqParseInput3(PGconn *conn) pqSaveErrorResult(conn); } } + /* XXX should we advance the command queue here? */ conn->asyncStatus = PGASYNC_READY; break; case '1': /* Parse Complete */ @@ -274,6 +280,7 @@ pqParseInput3(PGconn *conn) pqSaveErrorResult(conn); } } +
Re: HOT chain validation in verify_heapam()
Hi Hackers, On Sun, Jan 22, 2023 at 8:48 PM Himanshu Upadhyaya < upadhyaya.himan...@gmail.com> wrote: > > The test if (pred_in_progress || TransactionIdDidCommit(curr_xmin)) >> seems wrong to me. Shouldn't it be &&? Has this code been tested at >> all? It doesn't seem to have a test case. Some of these other errors >> don't, either. Maybe there's some that we can't easily test in an >> automated way, but we should test what we can. I guess maybe casual >> testing wouldn't reveal the problem here because of the recheck, but >> it's worrying to find logic that doesn't look right with no >> corresponding comments or test cases. >> >> This is totally my Mistake, apologies for that. I will fix this in my > next patch. Regarding the missing test cases, I need one in-progress > transaction for these test cases to be included in 004_verify_heapam.pl > but I don't find a clear way to have an in-progress transaction(as per the > design of 004_verify_heapam.pl ) that I can use in the test cases. I will > be doing more research on a solution to add these missing test cases. > >> >> I am trying to add test cases related to in-progress transactions in 004_verify_heapam.pl but I am not able to find a proper way to achieve this. We have a logic where we manually corrupt each tuple. Please refer to the code just after the below comment in 004_verify_heapam.pl "# Corrupt the tuples, one type of corruption per tuple. Some types of # corruption cause verify_heapam to skip to the next tuple without # performing any remaining checks, so we can't exercise the system properly if # we focus all our corruption on a single tuple." Before this we stop the node by "$node->stop;" and then only we progress to manual corruption. This will abort all running/in-progress transactions. So, if we create an in-progress transaction and comment "$node->stop;" then somehow all the code that we have for manual corruption does not work. I think it is required to stop the server and then only proceed for manual corruption? If this is the case then please suggest if there is a way to get an in-progress transaction that we can use for manual corruption. -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com
RE: Time delayed LR (WAS Re: logical replication restrictions)
On Monday, January 30, 2023 7:05 PM I wrote: > On Saturday, January 28, 2023 1:28 PM I wrote: > > Attached the updated patch v24. > I've conducted the rebase affected by the commit(1e8b61735c) by renaming > the GUC to logical_replication_mode accordingly, because it's utilized in the > TAP test of this time-delayed LR feature. > There is no other change for this version. > > Kindly have a look at the attached v25. Hi, The v25 caused a failure on windows of cfbot in [1]. But, the failure happened in the tests of pg_upgrade and the failure message looks the same one reported in the ongoing discussion of [2]. Then, it's an issue independent from the v25. [1] - https://cirrus-ci.com/task/5484559622471680 [2] - https://www.postgresql.org/message-id/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de Best Regards, Takamichi Osumi
Re: Amcheck verification of GiST and GIN
Hi Andrey, > Thanks! I also found out that there was a CI complaint about amcheck.h > not including some necessary stuff. Here's a version with a fix for > that. Thanks for the updated patchset. One little nitpick I have is that the tests cover only cases when all the checks pass successfully. The tests don't show that the checks will fail if the indexes are corrupted. Usually we check this as well, see bd807be6 and other amcheck replated patches and commits. -- Best regards, Aleksander Alekseev
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, Jan 30, 2023 at 3:23 PM houzj.f...@fujitsu.com wrote: > > On Monday, January 30, 2023 12:13 PM Peter Smith > wrote: > > > > Here are my review comments for v88-0002. > > Thanks for your comments. > > > > > == > > General > > > > 1. > > The test cases are checking the log content but they are not checking for > > debug logs or untranslated elogs -- they are expecting a normal ereport LOG > > that might be translated. I’m not sure if that is OK, or if it is a > > potential problem. > > We have tests that check the ereport ERROR and ereport WARNING message(by > search for the ERROR or WARNING keyword for all the tap tests), so I think > checking the LOG should be fine. > > > == > > doc/src/sgml/config.sgml > > > > 2. > > On the publisher side, logical_replication_mode allows allows streaming or > > serializing changes immediately in logical decoding. When set to immediate, > > stream each change if streaming option (see optional parameters set by > > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. When set > > to buffered, the decoding will stream or serialize changes when > > logical_decoding_work_mem is reached. > > > > 2a. > > typo "allows allows" (Kuroda-san reported same) > > > > 2b. > > "if streaming option" --> "if the streaming option" > > Changed. > > > ~~~ > > > > 3. > > On the subscriber side, if streaming option is set to parallel, > > logical_replication_mode also allows the leader apply worker to send changes > > to the shared memory queue or to serialize changes. > > > > SUGGESTION > > On the subscriber side, if the streaming option is set to parallel, > > logical_replication_mode can be used to direct the leader apply worker to > > send changes to the shared memory queue or to serialize changes. > > Changed. > > > == > > src/backend/utils/misc/guc_tables.c > > > > 4. > > { > > {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS, > > - gettext_noop("Controls when to replicate each change."), > > - gettext_noop("On the publisher, it allows streaming or serializing each > > change in logical decoding."), > > + gettext_noop("Controls the internal behavior of logical replication > > publisher and subscriber"), > > + gettext_noop("On the publisher, it allows streaming or " > > + "serializing each change in logical decoding. On the " > > + "subscriber, in parallel streaming mode, it allows " > > + "the leader apply worker to serialize changes to " > > + "files and notifies the parallel apply workers to " > > + "read and apply them at the end of the transaction."), > > GUC_NOT_IN_SAMPLE > > }, > > Suggest re-wording the long description (subscriber part) to be more like > > the > > documentation text. > > > > BEFORE > > On the subscriber, in parallel streaming mode, it allows the leader apply > > worker > > to serialize changes to files and notifies the parallel apply workers to > > read and > > apply them at the end of the transaction. > > > > SUGGESTION > > On the subscriber, if the streaming option is set to parallel, it directs > > the leader > > apply worker to send changes to the shared memory queue or to serialize > > changes and apply them at the end of the transaction. > > > > Changed. > > Attach the new version patch which addressed all comments so far (the v88-0001 > has been committed, so we only have one remaining patch this time). > I have one comment on v89 patch: + /* +* Using 'immediate' mode returns false to cause a switch to +* PARTIAL_SERIALIZE mode so that the remaining changes will be serialized. +*/ + if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE) + return false; + Probably we might want to add unlikely() here since we could pass through this path very frequently? The rest looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 30, 2023 at 1:31 PM John Naylor wrote: > > On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada wrote: > > > > On Sat, Jan 28, 2023 at 8:33 PM John Naylor > > wrote: > > > > The first implementation should be simple, easy to test/verify, easy to > > > understand, and easy to replace. As much as possible anyway. > > > > Yes, but if a concurrent writer waits for another process to finish > > the iteration, it ends up waiting on a lwlock, which is not > > interruptible. > > > > > > > > > So the idea is that we set iter_active to true (with the > > > > lock in exclusive mode), and prevent concurrent updates when the flag > > > > is true. > > > > > > ...by throwing elog(ERROR)? I'm not so sure users of this API would > > > prefer that to waiting. > > > > Right. I think if we want to wait rather than an ERROR, the waiter > > should wait in an interruptible way, for example, a condition > > variable. I did a simpler way in the v22 patch. > > > > ...but looking at dshash.c, dshash_seq_next() seems to return an entry > > while holding a lwlock on the partition. My assumption might be wrong. > > Using partitions there makes holding a lock less painful on average, I > imagine, but I don't know the details there. > > If we make it clear that the first committed version is not (yet) designed > for high concurrency with mixed read-write workloads, I think waiting (as a > protocol) is fine. If waiting is a problem for some use case, at that point > we should just go all the way and replace the locking entirely. In fact, it > might be good to spell this out in the top-level comment and include a link > to the second ART paper. Agreed. Will update the comments. > > > > [thinks some more...] Is there an API-level assumption that hasn't been > > > spelled out? Would it help to have a parameter for whether the iteration > > > function wants to reserve the privilege to perform writes? It could take > > > the appropriate lock at the start, and there could then be multiple > > > read-only iterators, but only one read/write iterator. Note, I'm just > > > guessing here, and I don't want to make things more difficult for future > > > improvements. > > > > Seems a good idea. Given the use case for parallel heap vacuum, it > > would be a good idea to support having multiple read-only writers. The > > iteration of the v22 is read-only, so if we want to support read-write > > iterator, we would need to support a function that modifies the > > current key-value returned by the iteration. > > Okay, so updating during iteration is not currently supported. It could in > the future, but I'd say that can also wait for fine-grained concurrency > support. Intermediate-term, we should at least make it straightforward to > support: > > 1) parallel heap vacuum -> multiple read-only iterators > 2) parallel heap pruning -> multiple writers > > It may or may not be worth it for someone to actually start either of those > projects, and there are other ways to improve vacuum that may be more > pressing. That said, it seems the tid store with global locking would > certainly work fine for #1 and maybe "not too bad" for #2. #2 can also > mitigate waiting by using larger batching, or the leader process could > "pre-warm" the tid store with zero-values using block numbers from the > visibility map. True. Using a larger batching method seems to be worth testing when we implement the parallel heap pruning. In the next version patch, I'm going to update the locking support part and incorporate other comments I got. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: cutting down the TODO list thread
On Mon, Jan 30, 2023 at 01:13:45PM +0700, John Naylor wrote: > On Tue, Jan 24, 2023 at 11:57 PM Bruce Momjian wrote: > > I think we just point them at the TODO list and they will read the top > > of the list first, no? I think you are right that we updated the top of > > the TODO but didn't update the places that link to it. I am thinking we > > should just trim down the text linking to it and let the top of the TODO > > list do its job. > > Okay. How about: > > "It's worth checking if the feature of interest is found in the TODO list on > our wiki: http://wiki.postgresql.org/wiki/TODO. The entries there often have > additional information about the feature and may point to reasons why it > hasn't > been implemented yet." Good. > > Wow, I would not send a new person to the SQL standard docs. ;-) I am > > thinking we just don't have a good answer to this so let's say less. > > Do I understand right that we could just remove this entire section "What > areas > need work?"? Yes, I think so. > > > 2) > > > from: > > > "What do I do after choosing an item to work on? > > > > > > Send an email to pgsql-hackers with a proposal for what you want to do > > > (assuming your contribution is not trivial). Working in isolation is not > > > advisable because others might be working on the same TODO item, or you > might > > > have misunderstood the TODO item. In the email, discuss both the internal > > > implementation method you plan to use, and any user-visible changes (new > > > syntax, etc)." > > > > > > to: > > > "What do I do after choosing an area to work on? > > > > > > Send an email to pgsql-hackers with a proposal for what you want to do > > > (assuming your contribution is not trivial). Working in isolation is not > > > > Can new people identify trivial? > > I'd say they have some idea about that, since we do regularly get typo fixes > and doc clarifications. Sure there is some grey area, but I don't think the > dividing point is important. The important thing is, we also sometimes get > large and invasive patches without design discussion, which we want to > discourage. Agreed. > > I can now see that just removing the [E] label totally is the right > > answer. Yes, there might be an easy item on there, but the fact we have > > three labeled and they are not easy makes me thing [E] is causing more > > problems than it solves. > > Okay, having heard no objections I'll remove it. Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Non-superuser subscription owners
On Fri, Jan 27, 2023 at 5:56 PM Mark Dilger wrote: > If the owner cannot modify the subscription, then the owner degenerates into > a mere "run-as" user. Note that this isn't how things work now, and even if > we disallowed owners from modifying the connection string, there would still > be other attributes the owner could modify, such as the set of publications > subscribed. The proposed patch blocks every form of ALTER SUBSCRIPTION if password_required false is set and you aren't a superuser. Is there some other DML command that could be used to modify the set of publications subscribed? > More generally, my thinking on this thread is that there needs to be two > nosuperuser roles: A higher privileged role which can create a subscription, > and a lower privileged role serving the "run-as" function. Those shouldn't > be the same, because the "run-as" concept doesn't logically need to have > subscription creation power, and likely *shouldn't* have that power. > Depending on which sorts of attributes a subscription object has, such as the > connection string, the answer differs for whether the owner/"run-as" user > should get to change those attributes. One advantage of Jeff's idea of using > a server object rather than a string is that it becomes more plausibly safe > to allow the subscription owner to make changes to that attribute of the > subscription. There's some question in my mind about what these different mechanisms are intended to accomplish. On a technical level, I think that the idea of having a separate objection for the connection string vs. the subscription itself is perfectly sound, and to repeat what I said earlier, if someone wants to implement that, cool. I also agree that it has the advantage that you specify, namely, that someone can have rights to modify one of those objects but not the other. What that lets you do is define a short list of known systems and say, hey, you can replicate whatever tables you want with whatever options you want, but only between these systems. I'm not quite sure what problem that solves, though. >From my point of view, the two things that the superuser is most likely to want to do are (1) control the replication setup themselves and delegate nothing to any non-superuser or (2) give a non-superuser pretty much complete control over replication with just enough restrictions to avoid letting them do things that would compromise security, such as hacking the local superuser account. In other words, I expect that delegation of the logical replication configuration is usually going to be all or nothing. Jeff's system allows for a situation where you want to delegate some stuff but not everything, and specifically where you want to dedicate control over the subscription options and the tables being replicated, but not the connection strings. To me, that feels like a bit of an awkward configuration; I don't really understand in what situation that division of responsibility would be particularly useful. I trust that Jeff is proposing it because he knows of such a situation, but I don't know what it is. I feel like, even if I wanted to let people use some connection strings and not others, I'd probably want that control in some form other than listing a specific list of allowable connection strings -- I'd want to say things like "you have to use SSL" or "no connecting back to the local host," because that lets me enforce some general organizational policy without having to care specifically about how each subscription is being set up. Unfortunately, I have even less of an idea about what the run-as concept is supposed to accomplish. I mean, at one level, I see it quite clearly: the user creating the subscription wants replication to have restricted privileges when it's running, and so they make the run-as user some role with fewer privileges than their own. Brilliant. But then I get stuck: against what kind of attack does that actually protect us? If I'm a high privilege user, perhaps even a superuser, and it's not safe to have logical replication running as me, then it seems like the security model of logical replication is fundamentally busted and we need to fix that. It can't be right to say that if you have 263 users in a database and you want to replicate the whole database to some other node, you need 263 different subscriptions with a different run-as user for each. You need to be able to run all of that logical replication as the superuser or some other high-privilege user and not end up with a security compromise. And if we suppose that that already works and is safe, well then what's the case where I do need a run-as user? -- Robert Haas EDB: http://www.enterprisedb.com
Re: JSONPath Child Operator?
On Jan 30, 2023, at 08:17, Filipp Krylov wrote: >> My question: Are there plans to support square bracket syntax for JSON >> object field name strings like this? Or to update to follow the standard as >> it’s finalized? > > This syntax is a part of "jsonpath syntax extensions" patchset: > https://www.postgresql.org/message-id/e0fe4f7b-da0b-471c-b3da-d8adaf314357%40postgrespro.ru Nice, thanks. I learned since sending this email that SQL/JSON Path is not at all the same as plain JSON Path, so now I’m less concerned bout it. I like the new object subscript syntax, though, I’ve been thinking about this myself. D
Re: Timeline ID hexadecimal format
On 27/01/2023 15:55, Peter Eisentraut wrote: On 27.01.23 14:52, Sébastien Lardière wrote: The attached patch proposes to change the format of timelineid from %u to %X. I think your complaint has merit. But note that if we did a change like this, then log files or reports from different versions would have different meaning without a visual difference, which is kind of what you complained about in the first place. At least we should do something like 0x%X. Hi, Here's the patch with the suggested format ; plus, I add some note in the documentation about recovery_target_timeline, because I don't get how strtoul(), with the special 0 base parameter can work without 0x prefix ; I suppose that nobody use it. I also change pg_controldata and the usage of this output by pg_upgrade. I let internal usages unchanded : content of backup manifest and content of history file. Should I open a commitfest entry, or is it too soon ? regards, -- Sébastien [1mdiff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml[m [1mindex be05a33205..7e26b51031 100644[m [1m--- a/doc/src/sgml/backup.sgml[m [1m+++ b/doc/src/sgml/backup.sgml[m [36m@@ -1332,7 +1332,8 @@[m [mrestore_command = 'cp /mnt/server/archivedir/%f %p'[m you like, add comments to a history file to record your own notes about[m how and why this particular timeline was created. Such comments will be[m especially valuable when you have a thicket of different timelines as[m [31m-a result of experimentation.[m [32m+[m[32ma result of experimentation. In both WAL segment file names and history files,[m [32m+[m[32mthe timeline ID number is expressed in hexadecimal.[m [m [m [m [1mdiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml[m [1mindex 1cf53c74ea..508774cfee 100644[m [1m--- a/doc/src/sgml/config.sgml[m [1m+++ b/doc/src/sgml/config.sgml[m [36m@@ -4110,7 +4110,9 @@[m [mrestore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows[m current when the base backup was taken. The[m value latest recovers[m to the latest timeline found in the archive, which is useful in[m [31m-a standby server. latest is the default.[m [32m+[m[32ma standby server. A numerical value expressed in hexadecimal must be[m [32m+[m[32mprefixed with 0x, for example 0x11.[m [32m+[m[32mlatest is the default.[m [m [m [m [1mdiff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c[m [1mindex f390c177e4..bdbe993877 100644[m [1m--- a/src/backend/access/rmgrdesc/xlogdesc.c[m [1m+++ b/src/backend/access/rmgrdesc/xlogdesc.c[m [36m@@ -45,7 +45,7 @@[m [mxlog_desc(StringInfo buf, XLogReaderState *record)[m CheckPoint *checkpoint = (CheckPoint *) rec;[m [m appendStringInfo(buf, "redo %X/%X; "[m [31m- "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "[m [32m+[m [32m "tli 0x%X; prev tli 0x%X; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "[m "oldest xid %u in DB %u; oldest multi %u in DB %u; "[m "oldest/newest commit timestamp xid: %u/%u; "[m "oldest running xid %u; %s",[m [36m@@ -135,7 +135,7 @@[m [mxlog_desc(StringInfo buf, XLogReaderState *record)[m xl_end_of_recovery xlrec;[m [m memcpy(&xlrec, rec, sizeof(xl_end_of_recovery));[m [31m- appendStringInfo(buf, "tli %u; prev tli %u; time %s",[m [32m+[m [32mappendStringInfo(buf, "tli 0x%X; prev tli 0x%X; time %s",[m xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,[m timestamptz_to_str(xlrec.end_time));[m }[m [1mdiff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c[m [1mindex fb4c860bde..c22cf4b2a1 100644[m [1m--- a/src/backend/access/transam/xlog.c[m [1m+++ b/src/backend/access/transam/xlog.c[m [36m@@ -7819,7 +7819,7 @@[m [mxlog_redo(XLogReaderState *record)[m (void) GetCurrentReplayRecPtr(&replayTLI);[m if (checkPoint.ThisTimeLineID != replayTLI)[m ereport(PANIC,[m [31m- (errmsg("unexpected timeline ID %u (should be %u) in shutdown checkpoint record",[m [32m+[m [32m(errmsg("unexpected timeline ID 0x%X (should be 0x%X) in shutdown checkpoint record",[m checkPoint.ThisTimeLineID, replayTLI)));[m [m RecoveryRestartPoint(&checkPoint, record);[m [36m@@ -7906,7 +7906,7 @@[m [mxlog_redo(XLogReaderState *record)[m (void) GetCurrentReplayRecPtr(&replayTLI);[m if (xlrec.ThisTimeLineID != replayTLI)[m ereport(PANIC,[m [31m- (errmsg("unexpected timeline ID %u (should be %u) in end-of-recovery record",[m [32m+[m [32m(errmsg("unexpected timeline ID 0x%X (should be 0x%X) in end-of-recovery record",[m xlrec.ThisTimeLineID, replayTLI)));[m }[m else if (info == XLOG_NOOP)[m [1mdiff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c[m [1mi
Re: Non-superuser subscription owners
> On Jan 30, 2023, at 7:44 AM, Robert Haas wrote: > > And if we suppose that > that already works and is safe, well then what's the case where I do > need a run-as user? A) Alice publishes tables, and occasionally adds new tables to existing publications. B) Bob manages subscriptions, and periodically runs "refresh publication". Bob also creates new subscriptions for people when a row is inserted into the "please create a subscription for me" table which Bob owns, using a trigger that Bob created on that table. C) Alice creates a "please create a subscription for me" table on the publishing database, adds lots of malicious requests, and adds that table to the publication. D) Bob replicates the table, fires the trigger, creates the malicious subscriptions, and starts replicating all that stuff, too. I think that having Charlie, not Bob, as the "run-as" user helps somewhere right around (D). — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MacOS: xsltproc fails with "warning: failed to load external entity"
Aleksander Alekseev writes: >> I've found a solution: >> >> ``` >> export SGML_CATALOG_FILES=/usr/local/etc/xml/catalog >> export XMLLINT="xmllint --catalogs" >> export XSLTPROC="xsltproc --catalogs" >> ``` Hmm, there is no such directory on my Mac, and indeed this recipe does not work here. I tried to transpose it to MacPorts by substituting /opt/local/etc/xml/catalog, which does exist --- but the recipe still doesn't work. I believe what is actually failing is that http://docbook.sourceforge.net now redirects to https:, and the ancient xsltproc version provided by Apple doesn't do https. What you need to do if you want to use their xsltproc is install a local copy of the SGML catalog files and stylesheets, preferably in the place that xsltproc would look by default (/etc/xml/catalog seems to be the standard one). It would be good to document how to do that, but this patch doesn't do so. What we do actually have already is a recommendation to install appropriate MacPorts or Homebrew packages: https://www.postgresql.org/docs/devel/docguide-toolsets.html#DOCGUIDE-TOOLSETS-INST-MACOS and it works okay for me as long as I use MacPorts' version of xsltproc. regards, tom lane
Re: Making Vars outer-join aware
Richard Guo writes: > Sorry for the delayed reply. I don't have any more review comments at > the moment, except a nitpicking one. > In optimizer/README at line 729 there is a query as > SELECT * FROM a > LEFT JOIN (SELECT * FROM b WHERE b.z = 1) ss ON (a.x = b.y) > WHERE a.x = 1; > I think it should be > SELECT * FROM a > LEFT JOIN (SELECT * FROM b WHERE b.z = 1) ss ON (a.x = ss.y) > WHERE a.x = 1; Oh, good catch, thanks. > I have no objection to get it committed. I'll push forward then. Thanks for reviewing! regards, tom lane
Re: pub/sub - specifying optional parameters without values.
Peter Smith writes: > The v3 patch LGTM (just for the logical replication commands). Pushed then. regards, tom lane
Re: Deadlock between logrep apply worker and tablesync worker
On Mon, 30 Jan 2023 at 17:30, vignesh C wrote: > > On Mon, 30 Jan 2023 at 13:00, houzj.f...@fujitsu.com > wrote: > > > > On Monday, January 30, 2023 2:32 PM Amit Kapila > > wrote: > > > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C wrote: > > > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila > > > > wrote: > > > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > > check in replorigin_drop_guts() which is a static function and > > > > > called from only one place, so, will it be required to check at both > > > > > places? > > > > > > > > There is a possibility that the initial check to verify if replication > > > > origin exists in replorigin_drop_by_name was successful but later one > > > > of either table sync worker or apply worker process might have dropped > > > > the replication origin, > > > > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > > drop happens after the patch acquires the lock on the origin. > > > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as > > we > > already lock the origin before that. I think the check in > > replorigin_drop_guts > > is a custom check after calling SearchSysCache1 to get the tuple, but the > > error > > should not happen as no concurrent drop can be performed. > > > > To make it simpler, one idea is to move the code that getting the tuple from > > system cache to the replorigin_drop_by_name(). After locking the origin, we > > can try to get the tuple and do the existence check, and we can reuse > > this tuple to perform origin delete. In this approach we only need to check > > origin existence once after locking. BTW, if we do this, then we'd better > > rename the > > replorigin_drop_guts() to something like replorigin_state_clear() as the > > function > > only clear the in-memory information after that. > > > > The code could be like: > > > > --- > > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > > ... > > /* > > * Lock the origin to prevent concurrent drops. We keep the lock > > until the > > * end of transaction. > > */ > > LockSharedObject(ReplicationOriginRelationId, roident, 0, > > AccessExclusiveLock); > > > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > > if (!HeapTupleIsValid(tuple)) > > { > > if (!missing_ok) > > elog(ERROR, "cache lookup failed for replication > > origin with ID %d", > > roident); > > > > return; > > } > > > > replorigin_state_clear(rel, roident, nowait); > > > > /* > > * Now, we can delete the catalog entry. > > */ > > CatalogTupleDelete(rel, &tuple->t_self); > > ReleaseSysCache(tuple); > > > > CommandCounterIncrement(); > > The attached updated patch has the changes to handle the same. I had not merged one of the local changes that was present, please find the updated patch including that change. Sorry for missing that change. Regards, Vignesh From 5ed769a1ed3b25e0a19ad4b235df8a4140870635 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Fri, 27 Jan 2023 15:17:09 +0530 Subject: [PATCH v3] Lock the replication origin record instead of locking the pg_replication_origin relation. Lock the replication origin record instead of locking the pg_replication_origin relation. --- src/backend/replication/logical/origin.c | 57 1 file changed, 28 insertions(+), 29 deletions(-) diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index b754c43840..5837818ecf 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -338,16 +338,14 @@ replorigin_create(const char *roname) * Helper function to drop a replication origin. */ static void -replorigin_drop_guts(Relation rel, RepOriginId roident, bool nowait) +replorigin_state_clear(RepOriginId roident, bool nowait) { - HeapTuple tuple; int i; /* - * First, clean up the slot state info, if there is any matching slot. + * Clean up the slot state info, if there is any matching slot. */ restart: - tuple = NULL; LWLockAcquire(ReplicationOriginLock, LW_EXCLUSIVE); for (i = 0; i < max_replication_slots; i++) @@ -402,19 +400,6 @@ restart: } LWLockRelease(ReplicationOriginLock); ConditionVariableCancelSleep(); - - /* - * Now, we can delete the catalog entry. - */ - tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); - if (!HeapTupleIsValid(tuple)) - elog(ERROR, "cache lookup failed for replication origin with ID %d", - roident); - - CatalogTupleDelete(rel, &tuple->t_self); - ReleaseSysCache(tuple); - - CommandCounterIncrement(); } /* @@ -427,24 +412,38 @@ replorigin_drop_by_name(const char *name, boo
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
> What compiler / version / flags / OS did you try? > I am running experiment on a machine with: - Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz - Ubuntu 18.04.6 LTS - LLVM/Clang 15.0.6 (build from source) These are the flags I am using: CFLAGS = -O3 -fuse-ld=lld -gline-tables-only -fprofile-instr-generate LDFLAGS = -fuse-ld=lld -Wl,-q FWIW, I've experimented with LTO and PGO a bunch, both with gcc and clang. I > did hit a crash in gcc, but that did turn out to be a compiler bug, and > actually reduced to something not even needing LTO. > Good to hear that it works. I just need to figure out what is going wrong on my end then. > I saw quite substantial speedups with PGO, but I only tested very specific > workloads. IIRC it was >15% gain in concurrent readonly pgbench. > I successfully applied PGO only and obtained similar gains with TPC-C & TPC-H workloads. I dimly recall failing to get some benefit out of bolt for some reason that > I > unfortunately don't even vaguely recall. > I got similar gains slightly higher than PGO with BOLT, but not for all queries in TPC-H. In fact, I observed small (2-4%) regressions with BOLT. -- João Paulo L. de Carvalho Ph.D Computer Science | IC-UNICAMP | Campinas , SP - Brazil Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada joao.carva...@ic.unicamp.br joao.carva...@ualberta.ca
Re: Non-superuser subscription owners
On Mon, Jan 30, 2023 at 11:11 AM Mark Dilger wrote: > > On Jan 30, 2023, at 7:44 AM, Robert Haas wrote: > > > > And if we suppose that > > that already works and is safe, well then what's the case where I do > > need a run-as user? > > A) Alice publishes tables, and occasionally adds new tables to existing > publications. > > B) Bob manages subscriptions, and periodically runs "refresh publication". > Bob also creates new subscriptions for people when a row is inserted into the > "please create a subscription for me" table which Bob owns, using a trigger > that Bob created on that table. > > C) Alice creates a "please create a subscription for me" table on the > publishing database, adds lots of malicious requests, and adds that table to > the publication. > > D) Bob replicates the table, fires the trigger, creates the malicious > subscriptions, and starts replicating all that stuff, too. > > I think that having Charlie, not Bob, as the "run-as" user helps somewhere > right around (D). I suppose it does, but I have some complaints. First, it doesn't seem to make a lot of sense to have one person managing the publications and someone else managing the subscriptions, and especially if those parties are mutually untrusting. I can't think of any real reason to set things up that way. Sure, you could, but why would you? You could, equally, decide that one member of your household was going to decide what's for dinner every night, and some other member of your household was going to decide what gets purchased at the grocery store each week. If those two people exercise their responsibilities without tight coordination, or with hostile intent toward each other, things are going to go badly, but that's not an argument for putting a combination lock on the flour canister. It's an argument for getting along better, or not having such a dumb system in the first place. I don't quite see how the situation you postulate in (A) and (B) is any different. Publications and subscriptions are as closely connected as food purchases and meals. The point of a publication is for it to connect up to a subscription. In what circumstances would be it be reasonable to give responsibility for those objects to different and especially mutually untrusting users? Second, in step (B), we may ask why Bob is doing this with a trigger. If he's willing to create any subscription for which Alice asks, we could have just given Alice the authority to do those actions herself. Presumably, therefore, Bob is willing to create some subscriptions for which Alice may ask and not others. Perhaps this whole arrangement is just a workaround for the lack of a sensible system for controlling which connection strings Alice can use, in which case what is really needed here might be something like the separate connection object which Jeff postulated or my idea of a reverse pg_hba.conf. That kind of approach would give a better user interface to Alice, who wouldn't have to rephrase all of her CREATE SUBSCRIPTION commands as insert statements. Conversely, if Alice and Bob are truly dedicated to this convoluted system of creating subscriptions, then Bob needs to put logic into his trigger that's smart enough to block any malicious requests that Alice may make. He really brought this problem on himself by not doing that. Third, in step (C), it seems to me that whoever set up Alice's permissions has really messed up. Either the schema Bob is using for his create-me-a-subscription table exists on the primary and Alice has permission to create tables in that schema, or else that schema does not exist on the primary and Alice has permission to create it. Either way, that's a bad setup. Bob's table should be located in a schema for which Alice has only USAGE permissions and shouldn't have excess permissions on the table, either. Then this step can't happen. This step could also be blocked if, instead of using a table with a trigger, Bob wrote a security definer function or procedure and granted EXECUTE permission on that function or procedure to Alice. He's still going to need sanity checks, though, and if the function or procedure inserts into a logging table or something, he'd better make sure that table is adequately secured rather than being, say, a table owned by Alice with malicious triggers on it. So basically this doesn't really feel like a valid scenario to me. We're supposed to believe that Alice is hostile to Bob, but the superuser doesn't seem to have thought very carefully about how Bob is supposed to defend himself against Alice, and Bob doesn't even seem to be trying. Maybe we should rename the users to Samson and Delilah? :-) -- Robert Haas EDB: http://www.enterprisedb.com
Re: meson: Optionally disable installation of test modules
Hi, On 2023-01-30 08:37:42 +0100, Peter Eisentraut wrote: > One open issue (IMO) with the meson build system is that it installs the > test modules under src/test/modules/ as part of a normal installation. This > is because there is no way to set up up the build system to install extra > things only when told. I think we still need a way to disable this somehow, > so that building a production installation doesn't end up with a bunch of > extra files. > > The attached simple patch is a starting point for discussion. It just > disables the subdirectory src/test/modules/ based on some Boolean setting. > This could be some new top-level option, or maybe part of PG_TEST_EXTRA, or > something else? With this, I get an identical set of installed files from > meson. I imagine this option would be false by default and developers would > enable it. Bilal, with a bit of help by me, worked on an alternative approach to this. It's a lot more verbose in the initial change, but wouldn't increase the amount of work/lines for new test modules. The main advantage is that we wouldn't have disable the modules by default, which I think would be quite likely to result in plenty people not running the tests. Sending a link instead of attaching, in case you already registered a cfbot entry: https://github.com/anarazel/postgres/commit/d1d192a860da39af9aa63d7edf643eed07c4 Probably worth adding an install-test-modules target for manual use. Greetings, Andres Freund
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Hi, On 2023-01-30 10:24:02 -0700, João Paulo Labegalini de Carvalho wrote: > > What compiler / version / flags / OS did you try? > > > > I am running experiment on a machine with: > >- Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz >- Ubuntu 18.04.6 LTS >- LLVM/Clang 15.0.6 (build from source) > > These are the flags I am using: > > CFLAGS = -O3 -fuse-ld=lld -gline-tables-only -fprofile-instr-generate > LDFLAGS = -fuse-ld=lld -Wl,-q For some reason my notes for using LTO include changing RANLIB to point to gcc/llvm-ranlib of the appropriate version. Won't even be used on HEAD, but before that it can make a difference. Depending on how you built clang, it could be that the above recipe ends up using the system lld, which might be too old. What are the crashes you're getting? Greetings, Andres Freund
Re: Add n_tup_newpage_upd to pg_stat table views
On Fri, Jan 27, 2023 at 6:55 PM Andres Freund wrote: > Hi, > > On 2023-01-27 18:23:39 -0500, Corey Huinker wrote: > > This patch adds the n_tup_newpage_upd to all the table stat views. > > > > Just as we currently track HOT updates, it should be beneficial to track > > updates where the new tuple cannot fit on the existing page and must go > to > > a different one. > > I like that idea. > > > I wonder if it's quite detailed enough - we can be forced to do out-of-page > updates because we actually are out of space, or because we reach the max > number of line pointers we allow in a page. HOT pruning can't remove dead > line > pointers, so that doesn't necessarily help. > I must be missing something, I only see the check for running out of space, not the check for exhausting line pointers. I agree dividing them would be interesting. > Similarly, it's a bit sad that we can't distinguish between the number of > potential-HOT out-of-page updates and the other out-of-page updates. But > that'd mean even more counters. > I wondered that too, but the combinations of "would have been HOT but not no space" and "key update suggested not-HOT but it was id=id so today's your lucky HOT" combinations started to get away from me. I wondered if there was interest in knowing if the tuple had to get TOASTed, and further wondered if we would be interested in the number of updates that had to wait on a lock. Again, more counters.
Re: Add n_tup_newpage_upd to pg_stat table views
Hi, On 2023-01-30 13:40:15 -0500, Corey Huinker wrote: > I must be missing something, I only see the check for running out of space, > not the check for exhausting line pointers. I agree dividing them would be > interesting. See PageGetHeapFreeSpace(), particularly the header comment and the MaxHeapTuplesPerPage check. > > Similarly, it's a bit sad that we can't distinguish between the number of > > potential-HOT out-of-page updates and the other out-of-page updates. But > > that'd mean even more counters. > > I wondered that too, but the combinations of "would have been HOT but not > no space" and "key update suggested not-HOT but it was id=id so today's > your lucky HOT" combinations started to get away from me. Not sure I follow the second part. Are you just worried about explaining when a HOT update is possible? > I wondered if there was interest in knowing if the tuple had to get > TOASTed, and further wondered if we would be interested in the number of > updates that had to wait on a lock. Again, more counters. Those seem a lot less actionable / related to the topic at hand to me. Greetings, Andres Freund
Re: Non-superuser subscription owners
> On Jan 30, 2023, at 9:26 AM, Robert Haas wrote: > > First, it doesn't seem to make a lot of sense to have one person > managing the publications and someone else managing the subscriptions, > and especially if those parties are mutually untrusting. I can't think > of any real reason to set things up that way. Sure, you could, but why > would you? You could, equally, decide that one member of your > household was going to decide what's for dinner every night, and some > other member of your household was going to decide what gets purchased > at the grocery store each week. If those two people exercise their > responsibilities without tight coordination, or with hostile intent > toward each other, things are going to go badly, but that's not an > argument for putting a combination lock on the flour canister. It's an > argument for getting along better, or not having such a dumb system in > the first place. I don't quite see how the situation you postulate in > (A) and (B) is any different. Publications and subscriptions are as > closely connected as food purchases and meals. The point of a > publication is for it to connect up to a subscription. I have a grim view of the requirement that publishers and subscribers trust each other. Even when they do trust each other, they can firewall attacks by acting as if they do not. > In what > circumstances would be it be reasonable to give responsibility for > those objects to different and especially mutually untrusting users? When public repositories of data, such as the IANA whois database, publish their data via postgres publications. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Introduce "log_connection_stages" setting.
Hello, hackers Thank you for the reviews. I've modified the patch to incorporate your suggestions: + flag bits are now used to encode different connection stages + failing tests are now fixed. It was not a keyword issue but rather "check_log_connection_messages" not allocating memory properly in the special case log_connection_messages = 'all' + the GUC option is now only GUC_LIST_INPUT + typo fixes and line rewrapping in the docs Regards, Sergey From 4bf99bccb4b278188dbc679f00d506cd35b025f5 Mon Sep 17 00:00:00 2001 From: Sergey Dudoladov Date: Tue, 8 Nov 2022 18:56:26 +0100 Subject: [PATCH] Introduce 'log_connection_messages' This patch removes 'log_connections' and 'log_disconnections' in favor of 'log_connection_messages', thereby introducing incompatibility Author: Sergey Dudoladov Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/CAA8Fd-qCB96uwfgMKrzfNs32mqqysi53yZFNVaRNJ6xDthZEgA%40mail.gmail.com --- doc/src/sgml/config.sgml | 89 +-- src/backend/commands/variable.c | 77 src/backend/libpq/auth.c | 5 +- src/backend/postmaster/postmaster.c | 5 +- src/backend/tcop/postgres.c | 11 ++- src/backend/utils/init/postinit.c | 2 +- src/backend/utils/misc/guc_tables.c | 29 +++--- src/backend/utils/misc/postgresql.conf.sample | 5 +- src/include/postgres.h| 9 ++ src/include/postmaster/postmaster.h | 1 - src/include/utils/guc_hooks.h | 2 + src/test/authentication/t/001_password.pl | 2 +- src/test/authentication/t/003_peer.pl | 2 +- src/test/kerberos/t/001_auth.pl | 2 +- src/test/ldap/t/001_auth.pl | 2 +- src/test/recovery/t/013_crash_restart.pl | 2 +- src/test/recovery/t/022_crash_temp_files.pl | 2 +- src/test/recovery/t/032_relfilenode_reuse.pl | 2 +- src/test/ssl/t/SSL/Server.pm | 2 +- src/tools/ci/pg_ci_base.conf | 3 +- 20 files changed, 181 insertions(+), 73 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 1cf53c74ea..8b60e814e9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -140,7 +140,7 @@ An example of what this file might look like is: # This is a comment -log_connections = yes +log_connection_messages = all log_destination = 'syslog' search_path = '"$user", public' shared_buffers = 128MB @@ -335,7 +335,7 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter passed to the postgres command via the -c command-line parameter. For example, -postgres -c log_connections=yes -c log_destination='syslog' +postgres -c log_connection_messages=all -c log_destination='syslog' Settings provided in this way override those set via postgresql.conf or ALTER SYSTEM, @@ -7085,23 +7085,74 @@ local0.*/var/log/postgresql - - log_connections (boolean) + + log_connection_messages (string) - log_connections configuration parameter + log_connection_messages configuration parameter -Causes each attempted connection to the server to be logged, -as well as successful completion of both client authentication (if -necessary) and authorization. +Causes stages of each attempted connection to the server to be logged. Example: authorized,disconnected. +The default is the empty string meaning nothing is logged. Only superusers and users with the appropriate SET privilege can change this parameter at session start, and it cannot be changed at all within a session. -The default is off. + + Connection messages + + + + + +Name +Description + + + + +received +Logs receipt of a connection. At this point a connection has been +received, but no further work has been done: Postgres is about to start +interacting with a client. + + + +authenticated +Logs the original identity that an authentication method employs +to identify a user. In most cases, the identity string matches the +PostgreSQL username, but some third-party authentication methods may +alter the original user identifier before the server stores it. Failed +authentication is always logged regardless of the value of this setting. + + + + +authorized +Logs successful completion of authorization. At this point the +connection has been fully established. + + + + +disconnected +Logs session termination. The log output provides information +similar to authorized, plus the duration of the
Re: MacOS: xsltproc fails with "warning: failed to load external entity"
Hi Tom, Thanks for the feedback. > Hmm, there is no such directory on my Mac, and indeed this recipe > does not work here. I tried to transpose it to MacPorts by > substituting /opt/local/etc/xml/catalog, which does exist --- but > the recipe still doesn't work. Well, that's a bummer. > What we do actually have already is a recommendation to install > appropriate MacPorts or Homebrew packages: > > https://www.postgresql.org/docs/devel/docguide-toolsets.html#DOCGUIDE-TOOLSETS-INST-MACOS > > and it works okay for me as long as I use MacPorts' version of xsltproc. Unfortunately it doesn't work for Homebrew anymore and there seems to be only one xsltproc in the system. > I believe what is actually failing is that http://docbook.sourceforge.net > now redirects to https:, and the ancient xsltproc version provided by > Apple doesn't do https. What you need to do if you want to use their > xsltproc is install a local copy of the SGML catalog files and > stylesheets, preferably in the place that xsltproc would look by default > (/etc/xml/catalog seems to be the standard one). It would be good to > document how to do that, but this patch doesn't do so. Fair enough. I would appreciate it if you could help figuring out how to do this for MacPorts, since I'm not a MacPorts user. I'll figure out how to do this for Homebrew. Does something like: ``` ln -s /opt/local/etc/xml/catalog /etc/xml/catalog ``` ... work for you? Does your: ``` xsltproc --help ``` ... also say that it uses /etc/xml/catalog path by default? -- Best regards, Aleksander Alekseev
Re: Non-superuser subscription owners
> On Jan 30, 2023, at 9:26 AM, Robert Haas wrote: > > So basically this doesn't really feel like a valid scenario to me. > We're supposed to believe that Alice is hostile to Bob, but the > superuser doesn't seem to have thought very carefully about how Bob is > supposed to defend himself against Alice, and Bob doesn't even seem to > be trying. Maybe we should rename the users to Samson and Delilah? :-) No, Atahualpa and Pizarro. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote: > After a tiny bit further polishing, and after separately pushing a resource > leak fix for walrcv_connect(), I pushed this. My colleague Robins Tharakan (CC'd) noticed crashes when testing recent commits, and he traced it back to e460248. From this information, I discovered the following typo: diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 8982d623d3..78a8bcee6e 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS) else { if (pconn->conn) -libpqsrv_disconnect(conn); +libpqsrv_disconnect(pconn->conn); pconn->conn = conn; } -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Non-superuser subscription owners
On Mon, Jan 30, 2023 at 1:46 PM Mark Dilger wrote: > I have a grim view of the requirement that publishers and subscribers trust > each other. Even when they do trust each other, they can firewall attacks by > acting as if they do not. I think it's OK if the CREATE PUBLICATION user doesn't particularly trust the CREATE SUBSCRIPTION user, because the publication is just a grouping of tables to which somebody can pay attention or not. The CREATE PUBLICATION user isn't compromised either way. But, at least as things stand, I don't see how the CREATE SUBSCRIPTION user get away with not trusting the CREATE PUBLICATION user. CREATE SUBSCRIPTION provides no tools at all for filtering the data that the subscriber chooses to send. Now that can be changed, I suppose, and a run-as user would be one way to make progress in that direction. But I'm not sure how viable that is, because... > > In what > > circumstances would be it be reasonable to give responsibility for > > those objects to different and especially mutually untrusting users? > > When public repositories of data, such as the IANA whois database, publish > their data via postgres publications. ... for that to work, IANA would need to set up the database so that untrusted parties can create logical replication slots on their PostgreSQL server. And I think that granting REPLICATION privilege on your database to random people on the Internet is not really viable, nor intended to be viable. As the CREATE ROLE documentation says, "A role having the REPLICATION attribute is a very highly privileged role." Concretely, this kind of setup would have the problem that you could kill the IANA database by just creating a replication slot and then not using it (or replicating from it only very very slowly). Eventually, the replication slot would either hold back xmin enough that you got a lot of bloat, or cause enough WAL to be retained that you ran out of disk space. Maybe you could protect yourself against that kind of problem by cutting off users who get too far behind, but that also cuts off people who just have an outage for longer than your cutoff. Also, anyone who can connection to a replication slot can also connect to any other replication slot, and drop any replication slot. So if IANA did grant REPLICATION privilege to random people on the Internet, one of them could jump into the system and screw things up for all the others. This kind of setup just doesn't seem viable to me. -- Robert Haas EDB: http://www.enterprisedb.com
Re: recovery modules
On Mon, Jan 30, 2023 at 04:51:38PM +0900, Michael Paquier wrote: > Now, I find this part, where we use a double pointer to allow the > module initialization to create and give back a private area, rather > confusing, and I think that this could be bug-prone, as well. Once > you incorporate some data within the set of callbacks, isn't this > stuff close to a "state" data, or just something that we could call > only an "ArchiveModule"? Could it make more sense to have > _PG_archive_module_init return a structure with everything rather than > a separate in/out argument? Here is the idea, simply: > typedef struct ArchiveModule { > ArchiveCallbacks *routines; > void *private_data; > /* Potentially more here, like some flags? */ > } ArchiveModule; Yeah, we could probably invent an ArchiveModuleContext struct. I think this is similar to how LogicalDecodingContext is used. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Making background psql nicer to use in tap tests
Hi, Plenty tap tests require a background psql. But they're pretty annoying to use. I think the biggest improvement would be an easy way to run a single query and get the result of that query. Manually having to pump_until() is awkward and often leads to hangs/timeouts, instead of test failures, because one needs to specify a match pattern to pump_until(), which on mismatch leads to trying to keep pumping forever. It's annoyingly hard to wait for the result of a query in a generic way with background_psql(), and more generally for psql. background_psql() uses -XAtq, which means that we'll not get "status" output (like "BEGIN" or "(1 row)"), and that queries not returning anything are completely invisible. A second annoyance is that issuing a query requires a trailing newline, otherwise psql won't process it. The best way I can see is to have a helper that issues the query, followed by a trailing newline, an \echo with a recognizable separator, and then uses pump_until() to wait for that separator. Another area worthy of improvement is that background_psql() requires passing in many variables externally - without a recognizable benefit afaict. What's the point in 'stdin', 'stdout', 'timer' being passed in? stdin/stdout need to point to empty strings, so we know what's needed - in fact we'll even reset them if they're passed in. The timer is always going to be PostgreSQL::Test::Utils::timeout_default, so again, what's the point? I think it'd be far more usable if we made background_psql() return a hash with the relevant variables. The 031_recovery_conflict.pl test has: my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default); my %psql_standby = ('stdin' => '', 'stdout' => ''); $psql_standby{run} = $node_standby->background_psql($test_db, \$psql_standby{stdin}, \$psql_standby{stdout}, $psql_timeout); $psql_standby{stdout} = ''; How about just returning a reference to a hash like that? Except that I'd also make stderr available, which one can't currently access. The $psql_standby{stdout} = ''; is needed because background_psql() leaves a banner in the output, which it shouldn't, but we probably should just fix that. Brought to you by: Trying to write a test for vacuum_defer_cleanup_age. - Andres
Re: recovery modules
Hi, On 2023-01-30 16:51:38 +0900, Michael Paquier wrote: > On Fri, Jan 27, 2023 at 10:27:29PM -0800, Nathan Bossart wrote: > > Here is a work-in-progress patch set for adjusting the archive modules > > interface. Is this roughly what you had in mind? > > I have been catching up with what is happening here. I can get > behind the idea to use the term "callbacks" vs "context" for clarity, > to move the callback definitions into their own header, and to add > extra arguments to the callback functions for some private data. > > -void > -_PG_archive_module_init(ArchiveModuleCallbacks *cb) > +const ArchiveModuleCallbacks * > +_PG_archive_module_init(void **arg) > { > AssertVariableIsOfType(&_PG_archive_module_init, ArchiveModuleInit); > > - cb->check_configured_cb = basic_archive_configured; > - cb->archive_file_cb = basic_archive_file; > + (*arg) = (void *) AllocSetContextCreate(TopMemoryContext, > + "basic_archive", > + ALLOCSET_DEFAULT_SIZES); > + > + return &basic_archive_callbacks; > Now, I find this part, where we use a double pointer to allow the > module initialization to create and give back a private area, rather > confusing, and I think that this could be bug-prone, as well. I don't think _PG_archive_module_init() should actually allocate a memory context and do other similar initializations. Instead it should just return 'const ArchiveModuleCallbacks*', typically a single line. Allocations etc should happen in one of the callbacks. That way we can actually have multiple instances of a module. > Once > you incorporate some data within the set of callbacks, isn't this > stuff close to a "state" data, or just something that we could call > only an "ArchiveModule"? Could it make more sense to have > _PG_archive_module_init return a structure with everything rather than > a separate in/out argument? Here is the idea, simply: > typedef struct ArchiveModule { > ArchiveCallbacks *routines; > void *private_data; > /* Potentially more here, like some flags? */ > } ArchiveModule; I don't like this much. This still basically ends up with the module callbacks not being sufficient to instantiate an archive module. Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
Hi, On 2023-01-30 11:30:08 -0800, Nathan Bossart wrote: > On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote: > > After a tiny bit further polishing, and after separately pushing a resource > > leak fix for walrcv_connect(), I pushed this. > > My colleague Robins Tharakan (CC'd) noticed crashes when testing recent > commits, and he traced it back to e460248. From this information, I > discovered the following typo: > > diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c > index 8982d623d3..78a8bcee6e 100644 > --- a/contrib/dblink/dblink.c > +++ b/contrib/dblink/dblink.c > @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS) > else > { > if (pconn->conn) > -libpqsrv_disconnect(conn); > +libpqsrv_disconnect(pconn->conn); > pconn->conn = conn; > } Ugh. Good catch. Why don't the dblink tests catch this? Any chance you or Robins could prepare a patch with fix and test, given that you know how to trigger this? Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote: > Why don't the dblink tests catch this? Any chance you or Robins could prepare > a patch with fix and test, given that you know how to trigger this? It's trivially reproducible by calling 1-argument dblink_connect() multiple times and then calling dblink_disconnect(). Here's a patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 8982d623d3..78a8bcee6e 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS) else { if (pconn->conn) - libpqsrv_disconnect(conn); + libpqsrv_disconnect(pconn->conn); pconn->conn = conn; } diff --git a/contrib/dblink/expected/dblink.out b/contrib/dblink/expected/dblink.out index 14d015e4d5..0f5050b409 100644 --- a/contrib/dblink/expected/dblink.out +++ b/contrib/dblink/expected/dblink.out @@ -938,6 +938,25 @@ DROP SERVER fdtest; ALTER FOREIGN DATA WRAPPER dblink_fdw OPTIONS (nonexistent 'fdw'); ERROR: invalid option "nonexistent" HINT: There are no valid options in this context. +-- test repeated calls to dblink_connect +SELECT dblink_connect(connection_parameters()); + dblink_connect + + OK +(1 row) + +SELECT dblink_connect(connection_parameters()); + dblink_connect + + OK +(1 row) + +SELECT dblink_disconnect(); + dblink_disconnect +--- + OK +(1 row) + -- test asynchronous notifications SELECT dblink_connect(connection_parameters()); dblink_connect diff --git a/contrib/dblink/sql/dblink.sql b/contrib/dblink/sql/dblink.sql index e560260bfc..7870ce5d5a 100644 --- a/contrib/dblink/sql/dblink.sql +++ b/contrib/dblink/sql/dblink.sql @@ -489,6 +489,11 @@ DROP SERVER fdtest; -- should fail ALTER FOREIGN DATA WRAPPER dblink_fdw OPTIONS (nonexistent 'fdw'); +-- test repeated calls to dblink_connect +SELECT dblink_connect(connection_parameters()); +SELECT dblink_connect(connection_parameters()); +SELECT dblink_disconnect(); + -- test asynchronous notifications SELECT dblink_connect(connection_parameters());
Re: recovery modules
On Mon, Jan 30, 2023 at 11:48:10AM -0800, Andres Freund wrote: > I don't think _PG_archive_module_init() should actually allocate a memory > context and do other similar initializations. Instead it should just return > 'const ArchiveModuleCallbacks*', typically a single line. > > Allocations etc should happen in one of the callbacks. That way we can > actually have multiple instances of a module. I think we'd need to invent a startup callback for archive modules for this to work, but that's easy enough. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Making background psql nicer to use in tap tests
Andres Freund writes: > It's annoyingly hard to wait for the result of a query in a generic way with > background_psql(), and more generally for psql. background_psql() uses -XAtq, > which means that we'll not get "status" output (like "BEGIN" or "(1 row)"), > and that queries not returning anything are completely invisible. Yeah, the empty-query-result problem was giving me fits recently. +1 for wrapping this into something more convenient to use. regards, tom lane
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hi, On 2023-01-10 21:32:54 +0100, Matthias van de Meent wrote: > On Tue, 10 Jan 2023 at 20:14, Andres Freund wrote: > > On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote: > > What precisely do you mean with "skew" here? Do you just mean that it'd > > take a > > long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds > > like > > you might mean more than that? > > h->oldest_considered_running can be extremely old due to the global > nature of the value and the potential existence of a snapshot in > another database that started in parallel to a very old running > transaction. Here's a version that, I think, does not have that issue. In an earlier, not posted, version I had an vacuum_defer_cleanup_age specific helper function for this, but it seems likely we'll need it in other places too. So I named it TransactionIdRetreatSafely(). I made it accept the xid by pointer, as the line lengths / repetition otherwise end up making it hard to read the code. For now I have TransactionIdRetreatSafely() be private to procarray.c, but I expect we'll have to change that eventually. Not sure I like TransactionIdRetreatSafely() as a name. Maybe TransactionIdRetreatClamped() is better? I've been working on a test for vacuum_defer_cleanup_age. It does catch the corruption at hand, but not much more. It's quite painful to write, right now. Some of the reasons: https://postgr.es/m/20230130194350.zj5v467x4jgqt3d6%40awork3.anarazel.de > > I'm tempted to go with reinterpreting 64bit xids as signed. Except that it > > seems like a mighty invasive change to backpatch. > > I'm not sure either. Protecting against underflow by halving the > effective valid value space is quite the intervention, but if it is > necessary to make this work in a performant manner, it would be worth > it. Maybe someone else with more experience can provide their opinion > here. The attached assertions just removes 1/2**32'ths of the space, by reserving the xid range with the upper 32bit set as something that shouldn't be reachable. Still requires us to change the input routines to reject that range, but I think that's a worthy tradeoff. I didn't find the existing limits for the type to be documented anywhere. Obviously something like that could only go into HEAD. Greetings, Andres Freund >From 8be634900796630a772c0131925f38f136fed599 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 9 Jan 2023 10:23:10 -0800 Subject: [PATCH v3 1/6] WIP: Fix corruption due to vacuum_defer_cleanup_age underflowing 64bit xids Author: Reviewed-by: Discussion: https://postgr.es/m/20230108002923.cyoser3ttmt63...@awork3.anarazel.de Backpatch: --- src/backend/storage/ipc/procarray.c | 85 + 1 file changed, 73 insertions(+), 12 deletions(-) diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 4340bf96416..64d0896b23b 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -367,6 +367,9 @@ static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId l static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid); static void MaintainLatestCompletedXid(TransactionId latestXid); static void MaintainLatestCompletedXidRecovery(TransactionId latestXid); +static void TransactionIdRetreatSafely(TransactionId *xid, + int retreat_by, + FullTransactionId rel); static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel, TransactionId xid); @@ -1888,17 +1891,35 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h) * so guc.c should limit it to no more than the xidStopLimit threshold * in varsup.c. Also note that we intentionally don't apply * vacuum_defer_cleanup_age on standby servers. + * + * Need to use TransactionIdRetreatSafely() instead of open-coding the + * subtraction, to prevent creating an xid before + * FirstNormalTransactionId. */ - h->oldest_considered_running = - TransactionIdRetreatedBy(h->oldest_considered_running, - vacuum_defer_cleanup_age); - h->shared_oldest_nonremovable = - TransactionIdRetreatedBy(h->shared_oldest_nonremovable, - vacuum_defer_cleanup_age); - h->data_oldest_nonremovable = - TransactionIdRetreatedBy(h->data_oldest_nonremovable, - vacuum_defer_cleanup_age); - /* defer doesn't apply to temp relations */ + Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running, + h->shared_oldest_nonremovable)); + Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable, + h->data_oldest_nonremovable)); + + if (vacuum_defer_cleanup_age > 0) + { + TransactionIdRetreatSafely(&h->oldest_considered_running, + vacuum_defer_cleanup_age, + h->latest_completed); + TransactionIdRetreatSafely(&h->shared_oldest_nonremovable, + vacuum_defer_cleanup_age, + h->latest_completed); + Transact
Re: Add SHELL_EXIT_CODE to psql
> > > Unfortunately, there is a fail in FreeBSD > https://cirrus-ci.com/task/6466749487382528 > > Maybe, this patch is need to be rebased? > > That failure is in postgres_fdw, which this code doesn't touch. I'm not able to get to /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/regression.out - It's not in either of the browseable zip files and the 2nd zip file isn't downloadable, so it's hard to see what's going on. I rebased, but there are no code differences. From e99c7b8aa3b81725773c03583bcff50a88a58a38 Mon Sep 17 00:00:00 2001 From: coreyhuinker Date: Wed, 11 Jan 2023 17:27:23 -0500 Subject: [PATCH v8 1/3] Add PG_OS_TARGET environment variable to enable OS-specific changes to regression tests. --- src/test/regress/pg_regress.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 6cd5998b9d..bf0ada4560 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -594,6 +594,17 @@ initialize_environment(void) #endif } + /* + * Regression tests that invoke OS commands must occasionally use + * OS-specific syntax (example: NUL vs /dev/null). This env var allows + * those tests to adjust commands accordingly. + */ +#if defined(WIN32) + setenv("PG_OS_TARGET", "WIN32", 1); +#else + setenv("PG_OS_TARGET", "OTHER", 1); +#endif + /* * Set translation-related settings to English; otherwise psql will * produce translated messages and produce diffs. (XXX If we ever support -- 2.39.1 From 496a84ccf758ed34d2c9bff0e9b458bef5fa9e58 Mon Sep 17 00:00:00 2001 From: coreyhuinker Date: Thu, 12 Jan 2023 23:04:31 -0500 Subject: [PATCH v8 2/3] Add wait_result_to_exit_code(). --- src/common/wait_error.c | 15 +++ src/include/port.h | 1 + 2 files changed, 16 insertions(+) diff --git a/src/common/wait_error.c b/src/common/wait_error.c index 4a3c3c61af..1eb7ff3fa2 100644 --- a/src/common/wait_error.c +++ b/src/common/wait_error.c @@ -127,3 +127,18 @@ wait_result_is_any_signal(int exit_status, bool include_command_not_found) return true; return false; } + +/* + * Return the POSIX exit code (0 to 255) that corresponds to the argument. + * The argument is a return code returned by wait(2) or waitpid(2), which + * also applies to pclose(3) and system(3). + */ +int +wait_result_to_exit_code(int exit_status) +{ + if (WIFEXITED(exit_status)) + return WEXITSTATUS(exit_status); + if (WIFSIGNALED(exit_status)) + return WTERMSIG(exit_status); + return 0; +} diff --git a/src/include/port.h b/src/include/port.h index e66193bed9..1b119a3c56 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -495,6 +495,7 @@ extern char *escape_single_quotes_ascii(const char *src); extern char *wait_result_to_str(int exitstatus); extern bool wait_result_is_signal(int exit_status, int signum); extern bool wait_result_is_any_signal(int exit_status, bool include_command_not_found); +extern int wait_result_to_exit_code(int exit_status); /* * Interfaces that we assume all Unix system have. We retain individual macros -- 2.39.1 From e3dc0f2117b52f9661a75e79097741ed6c0f3b20 Mon Sep 17 00:00:00 2001 From: coreyhuinker Date: Mon, 23 Jan 2023 16:46:16 -0500 Subject: [PATCH v8 3/3] Add psql variables SHELL_ERROR and SHELL_EXIT_CODE which report the success or failure of the most recent psql OS-command executed via the \! command or `backticks`. These variables are roughly analogous to ERROR and SQLSTATE, but for OS level operations instead of SQL commands. --- doc/src/sgml/ref/psql-ref.sgml | 21 ++ src/bin/psql/command.c | 14 src/bin/psql/help.c| 4 src/bin/psql/psqlscanslash.l | 35 +- src/test/regress/expected/psql.out | 29 + src/test/regress/sql/psql.sql | 25 + 6 files changed, 122 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index dc6528dc11..3d8a7353b3 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -4264,6 +4264,27 @@ bar + + SHELL_ERROR + + + true if the last shell command failed, false if + it succeeded. This applies to shell commands invoked via either \! + or `. See also SHELL_EXIT_CODE. + + + + + + SHELL_EXIT_CODE + + + The exit code return by the last shell command, invoked via either \! or `. + 0 means no error. See also SHELL_ERROR. + + + + SHOW_ALL_RESULTS diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index b5201edf55..ea79d4fe57 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -5032,6 +5032,20 @@ do_shell(const char *command) else result = system(command); + if (result == 0) + { + Se
Allow an extention to be updated without a script
Hi hackers, I propose to add a new option "updates_without_script" to extension's control file which a list of updates that do not need update script. This enables to update an extension by ALTER EXTENSION even if the extension module doesn't provide the update script. Currently, even when we don't need to execute any command to update an extension from one version to the next, we need to provide an update script that doesn't contain any command. Preparing such meaningless files are sometimes annoying. The attached patch introduces a new option "updates_without_script" into extension control file. This specifies a list of such updates following the pattern 'old_version--target_version'. For example, updates_without_script = '1.1--1.2, 1.3--1.4' means that updates from version 1.1 to version 1.2 and from version 1.3 to version 1.4 don't need an update script. In this case, users don't need to prepare update scripts extension--1.1--1.2.sql and extension--1.3--1.4.sql if it is not necessary to execute any commands. The updated path of an extension is determined based on both the filenames of update scripts and the list of updates specified in updates_without_script. Presence of update script has higher priority than the option. Therefore, if an update script is provided, the script will be executed even if this update is specified in updates_without_script. What do you think of this feature? Any feedback would be appreciated. Regards, Yugo Nagata -- Yugo NAGATA >From 938e15b28f66015044b559e4c523fc74590691fc Mon Sep 17 00:00:00 2001 From: Yugo Nagata Date: Mon, 30 Jan 2023 17:36:55 +0900 Subject: [PATCH] Allow an extention to be updated without a script When we don't need to execute any command to update an extension from one version to the next, we can specify a list of such updates following the pattern 'old_version--target_version' into a new option updates_without_script. For example, specifying '1.1--1.2, 1.3--1.4' means updates from version 1.1 to version 1.2, and from version 1.3 to version 1.4 don't need an update script. User doesn't need to provide an update script that doesn't contain any command for such updates. The updated path is determined based on both the names of update scripts and the list in updates_without_script. If an update script is provided, the script will be executed even if this update is specified in updates_without_script. --- doc/src/sgml/extend.sgml | 30 +++- src/backend/commands/extension.c | 161 +++--- src/test/modules/test_extensions/Makefile | 3 +- .../expected/test_extensions.out | 6 + .../test_extensions/sql/test_extensions.sql | 8 + .../test_extensions/test_ext9--1.0.sql| 0 .../test_extensions/test_ext9--2.0--3.0.sql | 0 .../modules/test_extensions/test_ext9.control | 5 + 8 files changed, 182 insertions(+), 31 deletions(-) create mode 100644 src/test/modules/test_extensions/test_ext9--1.0.sql create mode 100644 src/test/modules/test_extensions/test_ext9--2.0--3.0.sql create mode 100644 src/test/modules/test_extensions/test_ext9.control diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index 46e873a166..1c4c978264 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -807,6 +807,17 @@ RETURNS anycompatible AS ... + + + updates_without_script (string) + + +A list of updates that do not need update scripts following the pattern +old_version--target_version, +for example updates_without_script = '1.1--1.2, 1.3--1.4'. + + + @@ -818,8 +829,9 @@ RETURNS anycompatible AS ... Secondary control files follow the same format as the primary control file. Any parameters set in a secondary control file override the primary control file when installing or updating to that version of - the extension. However, the parameters directory and - default_version cannot be set in a secondary control file. + the extension. However, the parameters directory, + default_version, and updates_without_script + cannot be set in a secondary control file. @@ -1092,6 +1104,20 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr objects, they are automatically dissociated from the extension. + + If you don't need to execute any command to update an extension from one + version to the next, provide an update script that doesn't contain + any command or specify a list of such updates following the pattern + old_version--target_version + into updates_without_script. For example, + updates_without_script = '1.1--1.2, 1.3--1.4' + means updates from version 1.1 to version 1.2 + and from version 1.3 to version 1.4 + don't need an update script. Note that even if an update is specified in + updates_without_script, i
Re: Non-superuser subscription owners
> On Jan 30, 2023, at 11:30 AM, Robert Haas wrote: > > CREATE SUBSCRIPTION > provides no tools at all for filtering the data that the subscriber > chooses to send. > > Now that can be changed, I suppose, and a run-as user would be one way > to make progress in that direction. But I'm not sure how viable that > is, because... > >>> In what >>> circumstances would be it be reasonable to give responsibility for >>> those objects to different and especially mutually untrusting users? >> >> When public repositories of data, such as the IANA whois database, publish >> their data via postgres publications. > > ... for that to work, IANA would need to set up the database so that > untrusted parties can create logical replication slots on their > PostgreSQL server. And I think that granting REPLICATION privilege on > your database to random people on the Internet is not really viable, > nor intended to be viable. That was an aspirational example in which there's infinite daylight between the publisher and subscriber. I, too, doubt that's ever going to be possible. But I still think we should aspire to some extra daylight between the two. Perhaps IANA doesn't publish to the whole world, but instead publishes only to subscribers who have a contract in place, and have agreed to monetary penalties should they abuse the publishing server. Whatever. There's going to be some amount of daylight possible if we design for it, and none otherwise. My real argument here isn't against your goal of having non-superusers who can create subscriptions. That part seems fine to me. Given that my work last year made it possible for subscriptions to run as somebody other than the subscription creator, it annoys me that you now want the subscription creator's privileges to be what the subscription runs as. That seems to undo what I worked on. In my mental model of a (superuser-creator, non-superuser-owner) pair, it seems you're logically only touching the lefthand side, so you should then have a (nonsuperuser-creator, nonsuperuser-owner) pair. But you don't. You go the apparently needless extra step of just squashing them together. I just don't see why it needs to be like that. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Non-superuser subscription owners
On Fri, Jan 27, 2023 at 5:00 PM Andres Freund wrote: > > Or, another thought, maybe this should be checking for CREATE on the > > current database + also pg_create_subscription. That seems like it > > might be the right idea, actually. > > Yes, that seems like a good idea. Done in this version. I also changed check_conninfo to take an extra argument instead of returning a Boolean, as per your suggestion. I had a long think about what to do with ALTER SUBSCRIPTION ... OWNER TO in terms of permissions checks. The previous version required that the new owner have permissions of pg_create_subscription, but there seems to be no particular reason for that rule except that it happens to be what I made the code do. So I changed it to say that the current owner must have CREATE privilege on the database, and must be able to SET ROLE to the new owner. This matches the rule for CREATE SCHEMA. Possibly we should *additionally* require that the person performing the rename still have pg_create_subscription, but that shouldn't be the only requirement. This change means that you can't just randomly give your subscription to the superuser (with or without concurrently attempting some other change as per your other comments) which is good because you can't do that with other object types either. There seems to be a good deal of inconsistency here. If you want to give someone a schema, YOU need CREATE on the database. But if you want to give someone a table, THEY need CREATE on the containing schema. It make sense that we check permissions on the containing object, which could be a database or a schema depending on what you're renaming, but it's unclear to me why we sometimes check on the person performing the ALTER command and at other times on the recipient. It's also somewhat unclear to me why we are checking CREATE in the first place, especially on the donor. It might make sense to have a rule that you can't own an object in a place where you couldn't have created it, but there is no such rule, because you can give someone CREATE on a schema, they can create an object, and they you can take CREATE a way and they still own an object there. So it kind of looks to me like we made it up as we went along and that the result isn't very consistent, but I'm inclined to follow CREATE SCHEMA here unless there's some reason to do otherwise. Another question around ALTER SUBSCRIPTION ... OWNER TO and also ALTER SUBSCRIPTION .. RENAME is whether they ought to fail if you're not a superuser and password_required false is set. They are separate code paths from the rest of the ALTER SUBSCRIPTION cases, so if we want that to be a rule we need dedicated code for it. I'm not quite sure what's right. There's no comparable case for ALTER USER MAPPING because a user mapping doesn't have an owner and so can't be reassigned to a new owner. I don't see what the harm is, especially for RENAME, but I might be missing something, and it certainly seems arguable. > > I'm not entirely clear whether there's a hazard there. > > I'm not at all either. It's just a code pattern that makes me anxious - I > suspect there's a few places it makes us more vulnerable. It looks likely to me that it was cut down from the CREATE SCHEMA code, FWIW. > > If there is, I think we could fix it by moving the LockSharedObject call up > > higher, above object_ownercheck. The only problem with that is it lets you > > lock an object on which you have no permissions: see > > 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an > > analogue of RangeVarGetRelidExtended. > > Yea, we really should have something like RangeVarGetRelidExtended() for other > kinds of objects. It'd take a fair bit of work / time to use it widely, but > it'll take even longer if we start in 5 years ;) We actually have something sort of like that in the form of get_object_address(). It doesn't allow for a callback, but it does have a retry loop. -- Robert Haas EDB: http://www.enterprisedb.com v3-0001-Add-new-predefined-role-pg_create_subscriptions.patch Description: Binary data
Re: run pgindent on a regular basis / scripted manner
On Sat, Jan 28, 2023 at 05:06:03PM -0800, Noah Misch wrote: > On Tue, Jan 24, 2023 at 02:04:02PM -0500, Bruce Momjian wrote: > > On Tue, Jan 24, 2023 at 09:54:57AM -0500, Tom Lane wrote: > > > As another example, the mechanisms we use to create the typedefs list > > > in the first place are pretty squishy/leaky: they depend on which > > > buildfarm animals are running the typedef-generation step, and on > > > whether anything's broken lately in that code --- which happens on > > > a fairly regular basis (eg [1]). Maybe that could be improved, > > > but I don't see an easy way to capture the set of system-defined > > > typedefs that are in use on platforms other than your own. I > > > definitely do not want to go over to hand maintenance of that list. > > > > As I now understand it, we would need to standardize on a typedef list > > at the beginning of each major development cycle, and then only allow > > additions, > > Not to my knowledge. There's no particular obstacle to updating the list more > frequently or removing entries. We would need to re-pgindent the tree each time, I think, which would cause disruption if we did it too frequently. > > and the addition would have to include any pgindent affects > > of the addition. > > Yes, a hook intended to enforce pgindent cleanliness should run tree-wide > pgindent when the given commit(s) change the typedef list. typedef list > changes essentially become another kind of refactoring that can yield merge > conflicts. If your commit passed the pgindent check, rebasing it onto a new > typedefs list may require further indentation changes. New typedefs don't > tend to change a lot of old code, so I would expect this sort of conflict to > be minor, compared to all the other sources of conflicts. Agreed. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: heapgettup() with NoMovementScanDirection unused in core?
v2 attached On Fri, Jan 27, 2023 at 6:28 PM David Rowley wrote: > > On Sat, 28 Jan 2023 at 12:15, Tom Lane wrote: > > /* > > * Determine the net effect of two direction specifications. > > * This relies on having ForwardScanDirection = +1, BackwardScanDirection = > > -1, > > * and will probably not do what you want if applied to any other values. > > */ > > #define CombineScanDirections(a, b) ((a) * (b)) > > > > The main thing this'd buy us is being able to grep for uses of the > > trick. If it's written as just multiplication, good luck being > > able to find what's depending on that, should you ever need to. > > Yeah, I think the multiplication macro is a good way of doing it. > Having the definition of it close to the ScanDirection enum's > definition is likely a very good idea so that anyone adjusting the > enum values is more likely to notice that it'll cause an issue. A > small note on the enum declaration about the -1, +1 values being > exploited in various places might be a good idea too. I see v6-0006 in > [1] further exploits this, so that's further reason to document that. > > My personal preference would have been to call it > ScanDirectionCombine, so the naming is more aligned to the 4 other > macro names that start with ScanDirection in sdir.h, but I'm not going > to fuss over it. I've gone with this macro name. I've also updated comments Tom mentioned and removed the test. As for the asserts, I was at a bit of a loss as to where to put an assert which will make it clear that heapgettup() and heapgettup_pagemode() do not handle NoMovementScanDirection but was at a higher level of the executor. Do we not have to accommodate the direction changing from tuple to tuple? If we don't expect the plan node direction to change during execution, then why recalculate estate->es_direction for each invocation of Index/SeqNext()? As such, in this version I've put the asserts in heapgettup() and heapgettup_pagemode(). I also realized that it doesn't really make sense to assert about the index scan direction in ExecInitIndexOnlyScan() and ExecInitIndexScan() -- so I've moved the assertion to planner when we make the index plan from the path. I'm not sure if it is needed. - Melanie From d6aae3a73864aaaf84b4dc00ff299c2e8b4a5729 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 30 Jan 2023 09:49:54 -0500 Subject: [PATCH v2] remove nomovement scandir --- src/backend/access/heap/heapam.c | 71 ++-- src/backend/commands/explain.c | 3 - src/backend/executor/nodeIndexonlyscan.c | 11 +--- src/backend/executor/nodeIndexscan.c | 11 +--- src/backend/optimizer/path/indxpath.c| 8 +-- src/backend/optimizer/plan/createplan.c | 14 + src/backend/optimizer/util/pathnode.c| 5 +- src/include/access/sdir.h| 11 +++- src/include/nodes/pathnodes.h| 6 +- 9 files changed, 41 insertions(+), 99 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e6024a980b..10aaeb14aa 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -490,9 +490,6 @@ heapgetpage(TableScanDesc sscan, BlockNumber block) * tuple as indicated by "dir"; return the next tuple in scan->rs_ctup, * or set scan->rs_ctup.t_data = NULL if no more tuples. * - * dir == NoMovementScanDirection means "re-fetch the tuple indicated - * by scan->rs_ctup". - * * Note: the reason nkeys/key are passed separately, even though they are * kept in the scan descriptor, is that the caller may not want us to check * the scankeys. @@ -523,6 +520,8 @@ heapgettup(HeapScanDesc scan, int linesleft; ItemId lpp; + Assert(dir != NoMovementScanDirection); + /* * calculate next starting lineoff, given scan direction */ @@ -583,7 +582,7 @@ heapgettup(HeapScanDesc scan, linesleft = lines - lineoff + 1; } - else if (backward) + else { /* backward parallel scan not supported */ Assert(scan->rs_base.rs_parallel == NULL); @@ -653,34 +652,6 @@ heapgettup(HeapScanDesc scan, linesleft = lineoff; } - else - { - /* - * ``no movement'' scan direction: refetch prior tuple - */ - if (!scan->rs_inited) - { - Assert(!BufferIsValid(scan->rs_cbuf)); - tuple->t_data = NULL; - return; - } - - block = ItemPointerGetBlockNumber(&(tuple->t_self)); - if (block != scan->rs_cblock) - heapgetpage((TableScanDesc) scan, block); - - /* Since the tuple was previously fetched, needn't lock page here */ - page = BufferGetPage(scan->rs_cbuf); - TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page); - lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self)); - lpp = PageGetItemId(page, lineoff); - Assert(ItemIdIsNormal(lpp)); - - tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp); - tuple->t_len = ItemIdGetLength(lpp); - - return; - } /* * advance the scan until we find a qualifying tuple or run out of stuff @@ -861,6 +832,8 @@ heapgettup_pagemod
Re: Authentication fails for md5 connections if ~/.postgresql/postgresql.{crt and key} exist
On Fri, Jan 27, 2023 at 12:13 PM Cary Huang wrote: > > (Eventually I'd like to teach the server not to ask for a client > > certificate if it's not going to use it.) > > If clientcert is not requested by the server, but yet the client still > sends the certificate, the server will still verify it. This is the case > in this discussion. I think this is maybe conflating the application-level behavior with the protocol-level behavior. A client certificate is requested by the server if ssl_ca_file is set, whether clientcert is set in the HBA or not. It's this disconnect between the intuitive behavior and the actual behavior that I'd like to eventually improve. --Jacob
Re: MacOS: xsltproc fails with "warning: failed to load external entity"
Aleksander Alekseev writes: >> What we do actually have already is a recommendation to install >> appropriate MacPorts or Homebrew packages: >> https://www.postgresql.org/docs/devel/docguide-toolsets.html#DOCGUIDE-TOOLSETS-INST-MACOS >> and it works okay for me as long as I use MacPorts' version of xsltproc. > Unfortunately it doesn't work for Homebrew anymore and there seems to > be only one xsltproc in the system. Hmm. Seems unlikely that Homebrew would have dropped the package(s) altogether. But ... poking at this, I discovered that there are inaccuracies in our docs for MacPorts: * /opt/local/bin/xsltproc is provided by libxslt, and /opt/local/bin/xmllint is provided by libxml2, neither of which will be installed by our recipe as given. You might have pulled those ports in already to build Postgres with, but if you didn't, the recipe will fail. I wonder if the Homebrew recipe has the same bug. * At some point MacPorts renamed docbook-xsl to docbook-xsl-nons. This is harmless at the moment, because if you ask for docbook-xsl it will automatically install docbook-xsl-nons instead. I wonder if that'll be true indefinitely, though. I also wonder whether we shouldn't point at the meta-package docbook-xml instead of naming a particular version here (and having to update that from time to time). The extra disk space to install all the DTD versions is entirely insignificant (< 2MB). > Does your: > xsltproc --help > ... also say that it uses /etc/xml/catalog path by default? Both /usr/bin/xsltproc and /opt/local/bin/xsltproc say --catalogs : use SGML catalogs from $SGML_CATALOG_FILES otherwise XML Catalogs starting from file:///etc/xml/catalog are activated by default However, this appears to be a lie for /opt/local/bin/xsltproc; what it's apparently *actually* using is /opt/local/etc/xml/catalog, which is what MacPorts provides. I repeated the test I did this morning, and this time using --catalogs with SGML_CATALOG_FILES set to /opt/local/etc/xml/catalog worked for me, using either copy of xsltproc. I must've fat-fingered it somehow before. Nonetheless, I doubt that that recipe is worth recommending to MacPorts users: if they pull in the DTD packages they might as well pull in libxml2 and libxslt, and then they don't need to adjust anything. In short, I think we need to update J.2.4 to say this for MacPorts: sudo port install libxml2 libxslt docbook-xml docbook-xsl-nons fop and I strongly suspect that the Homebrew recipe has a similar oversight. regards, tom lane
Re: Authentication fails for md5 connections if ~/.postgresql/postgresql.{crt and key} exist
On Sun, Jan 29, 2023 at 5:02 AM Jim Jones wrote: > On 27.01.23 21:13, Cary Huang wrote: > > But, if the server does request clientcert but client uses > "sslcertmode=disable" to connect and not give a certificate, it would > also result in authentication failure. In this case, we actually would > want to ignore "sslcertmode=disable" and send default certificates if > found. > > I'm just wondering if this is really necessary. If the server asks for a > certificate and the user explicitly says "I don't want to send it", > shouldn't it be ok for the server return an authentication failure? I > mean, wouldn't it defeat the purpose of "sslcertmode=disable"? +1. In my opinion, if I tell libpq not to share my certificate with the server, and it then fails to authenticate, that's intended and useful behavior. (I don't really want libpq to try to find more ways to authenticate me; that causes other security issues [1, 2].) --Jacob [1] https://www.postgresql.org/message-id/0adf992619e7bf138eb4119622d37e3efb6515d5.camel%40j-davis.com [2] https://www.postgresql.org/message-id/46562.1637695110%40sss.pgh.pa.us
Re: Allow an extention to be updated without a script
Yugo NAGATA writes: > Currently, even when we don't need to execute any command to update an > extension from one version to the next, we need to provide an update > script that doesn't contain any command. Preparing such meaningless > files are sometimes annoying. If you have no update script, why call it a new version? The point of extension versions is to distinguish different states of the extension's SQL objects. We do not consider mods in underlying C code to justify a new version. > The attached patch introduces a new option "updates_without_script" > into extension control file. This specifies a list of such updates > following the pattern 'old_version--target_version'. This seems completely unnecessary and confusing. regards, tom lane
Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security
On Fri, Jan 27, 2023 at 1:08 PM Robert Haas wrote: > > 1) Forwarding the original ambient context along with the request, so > > the server can check it too. > > Right, so a protocol extension. Reasonable idea, but a big lift. Not > only do you need everyone to be running a new enough version of > PostgreSQL, but existing proxies like pgpool and pgbouncer need > updates, too. Right. > > 2) Explicitly combining the request with the proof of authority needed > > to make it, as in capability-based security [3]. > > As far as I can see, that link doesn't address how you'd make this > approach work across a network. The CSRF-token example I gave is one. But that's HTTP-specific (stateless, server-driven) and probably doesn't make a lot of sense for our case. For our case, assuming that connections have side effects, one solution could be for the client to signal to the server that the connection should use in-band authentication only; i.e. fail the connection if the credentials provided aren't good enough by themselves to authenticate the client. (This has some overlap with SASL negotiation, maybe.) But that still requires server support. I don't know if there's a clever way to tie the authentication to the request on the client side only, using existing server implementations. (If connections don't have side effects, require_auth should be sufficient.) > > 3) Dropping as many implicitly-held privileges as possible before making > > a request. This doesn't solve the problem but may considerably reduce > > the practical attack surface. > > Right. I definitely don't object to this kind of approach, but I don't > think it can ever be sufficient by itself. I agree. (But for the record, I think that an outbound proxy filter is also insufficient. Someone, somewhere, is going to want to safely proxy through localhost _and_ have peer authentication set up.) > > I think this style focuses on absolute configuration flexibility at the > > expense of usability. It obfuscates the common use cases. (I have the > > exact same complaint about our HBA and ident configs, so I may be > > fighting uphill.) > > That's probably somewhat true, but on the other hand, it also is more > powerful than what you're describing. In your system, is there some > way the DBA can say "hey, you can connect to any of the machines on > this list of subnets, but nothing else"? Or equally, "hey, you may NOT > connect to any machine on this list of subnets, but anything else is > fine"? Or "you can connect to these subnets without SSL, but if you > want to talk to anything else, you need to use SSL"? I guess I didn't call it out explicitly, so it was fair to assume that it did not. I don't think we should ignore those cases. But if we let the configuration focus on policies instead, and simultaneously improve the confused-deputy problem, then any IP/host filter functionality that we provide becomes an additional safety measure instead of your only viable line of defense. "I screwed up our IP filter, but we're still safe because the proxy refused to forward its client cert to the backend." Or, "this other local application requires peer authentication, but it's okay because the proxy disallows those connections by default." > Your idea > seems to rely on us being able to identify all of the policies that a > user is likely to want and give names to each one, and I don't feel > very confident that that's realistic. But maybe I'm misinterpreting > your idea? No, that's pretty accurate. But I'm used to systems that provide a ridiculous number of policies [1, 2] via what's basically a scoped property bag. "Turn off option 1 and 2 globally. For host A and IP address B, turn on option 1 as an exception." And I don't really expect us to need as many options as those systems do. I think that configuration style evolves well, it focuses on the right things, and it can still handle IP lists intuitively [3], if that's the way a DBA really wants to set up policies. --Jacob [1] https://httpd.apache.org/docs/2.4/mod/mod_proxy.html [2] https://www.haproxy.com/documentation/hapee/latest/onepage/#4 [3] https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-tcp/
Re: Non-superuser subscription owners
On Mon, Jan 30, 2023 at 3:27 PM Mark Dilger wrote: > That was an aspirational example in which there's infinite daylight between > the publisher and subscriber. I, too, doubt that's ever going to be > possible. But I still think we should aspire to some extra daylight between > the two. Perhaps IANA doesn't publish to the whole world, but instead > publishes only to subscribers who have a contract in place, and have agreed > to monetary penalties should they abuse the publishing server. Whatever. > There's going to be some amount of daylight possible if we design for it, and > none otherwise. > > My real argument here isn't against your goal of having non-superusers who > can create subscriptions. That part seems fine to me. > > Given that my work last year made it possible for subscriptions to run as > somebody other than the subscription creator, it annoys me that you now want > the subscription creator's privileges to be what the subscription runs as. > That seems to undo what I worked on. In my mental model of a > (superuser-creator, non-superuser-owner) pair, it seems you're logically only > touching the lefthand side, so you should then have a (nonsuperuser-creator, > nonsuperuser-owner) pair. But you don't. You go the apparently needless > extra step of just squashing them together. I just don't see why it needs to > be like that. I feel like you're accusing me of removing functionality that has never existed. A subscription doesn't run as the subscription creator. It runs as the subscription owner. If you or anyone else had added the capability for it to run as someone other than the subscription owner, I certainly wouldn't be trying to back that capability out as part of this patch, and because there isn't, I'm not proposing to add that as part of this patch. I don't see how that makes me guilty of squashing anything together. The current state of affairs, where the run-as user is taken from pg_subscription.subowner, the same field that is updated by ALTER SUBSCRIPTION ... OWNER TO, is the result of your work, not anything that I have done or am proposing to do. I also *emphatically* disagree with the idea that this undoes what you worked on. My patch would be *impossible* without your work. Prior to your work, the run-as user was always, basically, the superuser, and so the idea of allowing anyone other than a superuser to execute CREATE SUBSCRIPTION would be flat-out nuts. Because of your work, that's now a thing that we may be able to reasonably allow, if we can work through the remaining issues. So I'm grateful to you, and also sorry to hear that you're annoyed with me. But I still don't think that the fact that the division you want doesn't exist is somehow my fault. I'm kind of curious why you *didn't* make this distinction at the time that you were did the other work in this area. Maybe my memory is playing tricks on me again, but I seem to recall talking about the idea with you at the time, and I seem to recall thinking that it sounded like an OK idea. I seem to vaguely recall us discussing hazards like: well, what if replication causes code to get executed as the subscription owner that that causes something bad to happen? But I think the only way that happens is if they put triggers on the tables that are being replicated, which is their choice, and they can avoid installing problematic code there if they want. I think there might have been some other scenarios, too, but I just can't remember. In any case, I don't think the idea is completely without merit. I think it could very well be something that we want to have for one reason or another. But I don't currently understand exactly what those reasons are, and I don't see any reason why one patch should both split owner from run-as user and also allow the owner to be a non-superuser. That seems like two different efforts to me. -- Robert Haas EDB: http://www.enterprisedb.com
Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security
On Mon, Jan 30, 2023 at 4:12 PM Jacob Champion wrote: > For our case, assuming that connections have side effects, one > solution could be for the client to signal to the server that the > connection should use in-band authentication only; i.e. fail the > connection if the credentials provided aren't good enough by > themselves to authenticate the client. (This has some overlap with > SASL negotiation, maybe.) I'm not an expert on this stuff, but to me that feels like a weak and fuzzy concept. If the client is going to tell the server something, I'd much rather have it say something like "i'm proxying a request from my local user rhaas, who authenticated using such and such a method and connected from such and such an IP yadda yadda". That feels to me like really clear communication that the server can then be configured to something about via pg_hba.conf or similar. Saying "use in-band authentication only", to me, feels much murkier. As the recipient of that message, I don't know exactly what to do about it, and it feels like whatever heuristic I adopt might end up being wrong and something bad happens anyway. > I agree. (But for the record, I think that an outbound proxy filter is > also insufficient. Someone, somewhere, is going to want to safely > proxy through localhost _and_ have peer authentication set up.) Well then they're indeed going to need some way to distinguish a proxied connection from a non-proxied one. You can't send identical connection requests in different scenarios and get different results > I guess I didn't call it out explicitly, so it was fair to assume that > it did not. I don't think we should ignore those cases. OK, cool. > But if we let the configuration focus on policies instead, and > simultaneously improve the confused-deputy problem, then any IP/host > filter functionality that we provide becomes an additional safety > measure instead of your only viable line of defense. "I screwed up our > IP filter, but we're still safe because the proxy refused to forward > its client cert to the backend." Or, "this other local application > requires peer authentication, but it's okay because the proxy > disallows those connections by default." Defense in depth is good. > > Your idea > > seems to rely on us being able to identify all of the policies that a > > user is likely to want and give names to each one, and I don't feel > > very confident that that's realistic. But maybe I'm misinterpreting > > your idea? > > No, that's pretty accurate. But I'm used to systems that provide a > ridiculous number of policies [1, 2] via what's basically a scoped > property bag. "Turn off option 1 and 2 globally. For host A and IP > address B, turn on option 1 as an exception." And I don't really > expect us to need as many options as those systems do. > > I think that configuration style evolves well, it focuses on the right > things, and it can still handle IP lists intuitively [3], if that's > the way a DBA really wants to set up policies. I think what we really need here is an example or three of a proposed configuration file syntax. I think it would be good if we could pick a syntax that doesn't require a super-complicated parser, and that maybe has something in common with our existing configuration file syntaxes. But if we have to invent something new, then we can do that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Transparent column encryption
On Wed, Jan 25, 2023 at 11:00 AM Peter Eisentraut wrote: > > When writing the paragraph on RSA-OAEP I was reminded that we didn't > > really dig into the asymmetric/symmetric discussion. Assuming that most > > first-time users will pick the builtin CMK encryption method, do we > > still want to have an asymmetric scheme implemented first instead of a > > symmetric keywrap? I'm still concerned about that public key, since it > > can't really be made public. > > I had started coding that, but one problem was that the openssl CLI > doesn't really provide any means to work with those kinds of keys. The > "openssl enc" command always wants to mix in a password. Without that, > there is no way to write a test case, and more crucially no way for > users to set up these kinds of keys. Unless we write our own tooling > for this, which, you know, the patch just passed 400k in size. Arrgh: https://github.com/openssl/openssl/issues/10605 > > The column encryption algorithm is set per-column -- but isn't it > > tightly coupled to the CEK, since the key length has to match? From a > > layperson perspective, using the same key to encrypt the same plaintext > > under two different algorithms (if they happen to have the same key > > length) seems like it might be cryptographically risky. Is there a > > reason I should be encouraged to do that? > > Not really. I was also initially confused by this setup, but that's how > other similar systems are set up, so I thought it would be confusing to > do it differently. Which systems let you mix and match keys and algorithms this way? I'd like to take a look at them. > > With the loss of \gencr it looks like we also lost a potential way to > > force encryption from within psql. Any plans to add that for v1? > > \gencr didn't do that either. We could do it. The libpq API supports > it. We just need to come up with some syntax for psql. Do you think people would rather set encryption for all parameters at once -- something like \encbind -- or have the ability to mix encrypted and unencrypted parameters? > > Are there plans to document client-side implementation requirements, to > > ensure cross-client compatibility? Things like the "PG\x00\x01" > > associated data are buried at the moment (or else I've missed them in > > the docs). If you're holding off until the feature is more finalized, > > that's fine too. > > This is documented in the protocol chapter, which I thought was the > right place. Did you want more documentation, or in a different place? I just missed it; sorry. > > Speaking of cross-client compatibility, I'm still disconcerted by the > > ability to write the value "hello world" into an encrypted integer > > column. Should clients be required to validate the text format, using > > the attrealtypid? > > Well, we can ask them to, but we can't really require them, in a > cryptographic sense. I'm not sure what more we can do. Right -- I just mean that clients need to pay more attention to it now, whereas before they may have delegated correctness to the server. The problem is documented in the context of deterministic encryption, but I think it applies to randomized as well. More concretely: should psql allow you to push arbitrary text into an encrypted \bind parameter, like it does now? > > It occurred to me when looking at the "unspecified" CMK scheme that the > > CEK doesn't really have to be an encryption key at all. In that case it > > can function more like a (possibly signed?) cookie for lookup, or even > > be ignored altogether if you don't want to use a wrapping scheme > > (similar to JWE's "direct" mode, maybe?). So now you have three ways to > > look up or determine a column encryption key (CMK realm, CMK name, CEK > > cookie)... is that a concept worth exploring in v1 and/or the documentation? > > I don't completely follow this. Yeah, I'm not expressing it very well. My feeling is that the organization system here -- a realm "contains" multiple CMKs, a CMK encrypts multiple CEKs -- is so general and flexible that it may need some suggested guardrails for people to use it sanely. I just don't know what those guardrails should be. I was motivated by the realization that CEKs don't even need to be keys. Thanks, --Jacob
Re: pub/sub - specifying optional parameters without values.
On Tue, Jan 31, 2023 at 4:00 AM Tom Lane wrote: > > Peter Smith writes: > > The v3 patch LGTM (just for the logical replication commands). > > Pushed then. > Thanks for pushing the v3 patch. I'd forgotten about the 'streaming' option -- AFAIK this was previously a boolean parameter and so its [= value] part can also be omitted. However, in PG16 streaming became an enum type (on/off/parallel), and the value can still be omitted but that is not really being covered by the new generic text note about booleans added by yesterday's patch. e.g. The enum 'streaming' value part can still be omitted. test_sub=# create subscription sub1 connection 'host=localhost dbname=test_pub' publication pub1 with (streaming); Perhaps a small top-up patch to CREATE SUBSCRIPTION is needed to describe this special case? PSA. (I thought mentioning this special streaming case again for ALTER SUBSCRIPTION might be overkill) -- Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Added-note-for-streaming-parameter-with-value-par.patch Description: Binary data
Re: pub/sub - specifying optional parameters without values.
Peter Smith writes: > I'd forgotten about the 'streaming' option -- AFAIK this was > previously a boolean parameter and so its [= value] part can also be > omitted. However, in PG16 streaming became an enum type > (on/off/parallel), and the value can still be omitted but that is not > really being covered by the new generic text note about booleans added > by yesterday's patch. Hmph. I generally think that options defined like this (it's a boolean, except it isn't) are a bad idea, and would prefer to see that API rethought while we still can. regards, tom lane
Re: recovery modules
On Mon, Jan 30, 2023 at 12:04:22PM -0800, Nathan Bossart wrote: > On Mon, Jan 30, 2023 at 11:48:10AM -0800, Andres Freund wrote: >> I don't think _PG_archive_module_init() should actually allocate a memory >> context and do other similar initializations. Instead it should just return >> 'const ArchiveModuleCallbacks*', typically a single line. >> >> Allocations etc should happen in one of the callbacks. That way we can >> actually have multiple instances of a module. > > I think we'd need to invent a startup callback for archive modules for this > to work, but that's easy enough. If you don't return some (void *) pointing to a private area that would be stored by the backend, allocated as part of the loading path, I agree that an extra callback is what makes the most sense, presumably called around the beginning of PgArchiverMain(). Doing this kind of one-time action in the file callback woud be weird.. -- Michael signature.asc Description: PGP signature
Re: heapgettup refactoring
v7 attached On Fri, Jan 27, 2023 at 10:34 PM David Rowley wrote: > > "On Wed, 25 Jan 2023 at 10:17, Melanie Plageman > wrote: > > I've added a comment but I didn't include the function name in it -- I > > find it repetitive when the comments above functions do that -- however, > > I'm not strongly attached to that. > > I think the general format for header comments is: > > /* > * > *\t\t > * > * [Further details] > */ > > We've certainly got places that don't follow that, but I don't think > that's any reason to have no comment or invent some new format. > > heapam.c seems to have some other format where we do: " > - ". I generally just try to copy > the style from the surrounding code. I think generally, people won't > argue if you follow the style from the surrounding code, but there'd > be exceptions to that, I'm sure. I have followed the same convention as the other functions in heapam.c in the various helper functions comments I've added in this version. > v6-0002: > > 1. heapgettup_initial_block() needs a header comment to mention what > it does and what it returns. It would be good to make it obvious that > it returns InvalidBlockNumber when there are no blocks to scan. I've done this. > > 2. After heapgettup_initial_block(), you're checking "if (block == > InvalidBlockNumber). It might be worth a mention something like > > /* > * Check if we got to the end of the scan already. This could happen for > * an empty relation or if parallel workers scanned everything before we > * got a chance. > */ > > the backward scan comment wouldn't mention parallel workers. I've done this as well. > > v6-0003: > > 3. Can you explain why you removed the snapshot local variable in > heapgettup()? In the subsequent commit, the helpers I add call TestForOldSnapshot(), and I didn't want to pass in the snapshot as a separate parameter since I already need to pass the scan descriptor. I thought it was confusing to have a local variable (snapshot) used in some places and the one in the scan used in others. This "streamlining" commit also reduces the number of times the snapshot variable is used, making it less necessary to have a local variable. I didn't remove the snapshot local variable in the same commit as adding the helpers because I thought it made the diff harder to understand (for review, the final commit should likely not be separate patches). > 4. I think it might be a good idea to use unlikely() in if > (!scan->rs_inited). The idea is to help coax the compiler into moving > that code off to a cold path. That's likely especially important if > heapgettup_initial_block is inlined, which I see it is marked as. I've gone ahead and added unlikely. However, should I perhaps skip inlining the heapgettup_initial_block() function? > v6-0004: > > 5. heapgettup_start_page() and heapgettup_continue_page() both need a > header comment to explain what they do and what the inputs and output > arguments are. I've added these. I've also removed an unused parameter to both, block. > > 6. I'm not too sure what the following comment means: > > /* block and lineoff now reference the physically next tid */ > > "block" is just a parameter to the function and its value is not > adjusted. The comment makes it sound like something was changed. > > (I think these might be just not well updated from having split this > out from the 0006 patch as the same comment makes more sense in 0006) Yes, that is true. I've updated it to just mention lineoff. > v6-0005: > > 7. heapgettup_advance_block() needs a header comment. > > 8. Is there a reason why heapgettup_advance_block() handle backward > scans first? I'd expect you should just follow the lead of the other > functions and do ScanDirectionIsForward first. The reason I do this is that backwards scans cannot be parallel, so handling backwards scans first let me return, then handle parallel scans, then forward scans. This reduced the level of nesting/if statements for all of the code in this function. - Melanie From 62416fee879a1ea6a7ca7ec132227701d35d7468 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 9 Jan 2023 17:33:43 -0500 Subject: [PATCH v7 2/6] Add heapgettup_initial_block() helper --- src/backend/access/heap/heapam.c | 226 ++- 1 file changed, 102 insertions(+), 124 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 75f68b4e79..ff2e64822a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -484,6 +484,67 @@ heapgetpage(TableScanDesc sscan, BlockNumber block) } +/* + * heapgettup_initial_block - helper for heapgettup() and heapgettup_pagemode() + * + * Determines and returns the first block to scan, given the scan direction and + * whether or not it is parallel. If the relation is empty or the current + * parallel worker finds the scan has already been completed by other workers, + * return InvalidBlockNumber. + * + * Note th
monitoring usage count distribution
My colleague Jeremy Schneider (CC'd) was recently looking into usage count distributions for various workloads, and he mentioned that it would be nice to have an easy way to do $SUBJECT. I've attached a patch that adds a pg_buffercache_usage_counts() function. This function returns a row per possible usage count with some basic information about the corresponding buffers. postgres=# SELECT * FROM pg_buffercache_usage_counts(); usage_count | buffers | dirty | pinned -+-+---+ 0 | 0 | 0 | 0 1 |1436 | 671 | 0 2 | 102 |88 | 0 3 | 23 |21 | 0 4 | 9 | 7 | 0 5 | 164 | 106 | 0 (6 rows) This new function provides essentially the same information as pg_buffercache_summary(), but pg_buffercache_summary() only shows the average usage count for the buffers in use. If there is interest in this idea, another approach to consider could be to alter pg_buffercache_summary() instead. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From b2fd87696185537d2cbb611cf70e743d7e197406 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 27 Jan 2023 16:39:43 -0800 Subject: [PATCH v1 1/1] introduce pg_buffercache_usage_counts() --- .../expected/pg_buffercache.out | 14 +++ .../pg_buffercache--1.3--1.4.sql | 13 +++ contrib/pg_buffercache/pg_buffercache_pages.c | 46 contrib/pg_buffercache/sql/pg_buffercache.sql | 4 + doc/src/sgml/pgbuffercache.sgml | 101 +- 5 files changed, 176 insertions(+), 2 deletions(-) diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out index 635f01e3b2..b745dc69ea 100644 --- a/contrib/pg_buffercache/expected/pg_buffercache.out +++ b/contrib/pg_buffercache/expected/pg_buffercache.out @@ -17,6 +17,12 @@ from pg_buffercache_summary(); t| t| t (1 row) +SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0; + ?column? +-- + t +(1 row) + -- Check that the functions / views can't be accessed by default. To avoid -- having to create a dedicated user, use the pg_database_owner pseudo-role. SET ROLE pg_database_owner; @@ -26,6 +32,8 @@ SELECT * FROM pg_buffercache_pages() AS p (wrong int); ERROR: permission denied for function pg_buffercache_pages SELECT * FROM pg_buffercache_summary(); ERROR: permission denied for function pg_buffercache_summary +SELECT * FROM pg_buffercache_usage_counts(); +ERROR: permission denied for function pg_buffercache_usage_counts RESET role; -- Check that pg_monitor is allowed to query view / function SET ROLE pg_monitor; @@ -41,3 +49,9 @@ SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary(); t (1 row) +SELECT count(*) > 0 FROM pg_buffercache_usage_counts(); + ?column? +-- + t +(1 row) + diff --git a/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql index 8f212dc5e9..f4702e4b4b 100644 --- a/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql +++ b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql @@ -15,3 +15,16 @@ LANGUAGE C PARALLEL SAFE; -- Don't want these to be available to public. REVOKE ALL ON FUNCTION pg_buffercache_summary() FROM PUBLIC; GRANT EXECUTE ON FUNCTION pg_buffercache_summary() TO pg_monitor; + +CREATE FUNCTION pg_buffercache_usage_counts( +OUT usage_count int4, +OUT buffers int4, +OUT dirty int4, +OUT pinned int4) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'pg_buffercache_usage_counts' +LANGUAGE C PARALLEL SAFE; + +-- Don't want these to be available to public. +REVOKE ALL ON FUNCTION pg_buffercache_usage_counts() FROM PUBLIC; +GRANT EXECUTE ON FUNCTION pg_buffercache_usage_counts() TO pg_monitor; diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 1c6a2f22ca..f333967c51 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -18,6 +18,7 @@ #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 #define NUM_BUFFERCACHE_PAGES_ELEM 9 #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 +#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4 PG_MODULE_MAGIC; @@ -61,6 +62,7 @@ typedef struct */ PG_FUNCTION_INFO_V1(pg_buffercache_pages); PG_FUNCTION_INFO_V1(pg_buffercache_summary); +PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); Datum pg_buffercache_pages(PG_FUNCTION_ARGS) @@ -304,3 +306,47 @@ pg_buffercache_summary(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +Datum +pg_buffercache_usage_counts(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + int usage_counts[BM_MAX_USAGE_COUNT + 1] = {0}; + int dirty[BM_MAX_USAGE_COUNT + 1] = {0}; + int pinned[BM_MAX_USAGE_COUNT + 1] = {0}; +
Re: Making background psql nicer to use in tap tests
Hi, On 2023-01-30 15:06:46 -0500, Tom Lane wrote: > Andres Freund writes: > > It's annoyingly hard to wait for the result of a query in a generic way with > > background_psql(), and more generally for psql. background_psql() uses > > -XAtq, > > which means that we'll not get "status" output (like "BEGIN" or "(1 row)"), > > and that queries not returning anything are completely invisible. > > Yeah, the empty-query-result problem was giving me fits recently. > +1 for wrapping this into something more convenient to use. I've hacked some on this. I first tried to just introduce a few helper functions in Cluster.pm, but that ended up being awkward. So I bit the bullet and introduced a new class (in BackgroundPsql.pm), and made background_psql() and interactive_psql() return an instance of it. This is just a rough prototype. Several function names don't seem great, it need POD documentation, etc. The main convenience things it has over the old interface: - $node->background_psql('dbname') is enough - $psql->query(), which returns the query results as a string, is a lot easier to use than having to pump, identify query boundaries via regex etc. - $psql->query_safe(), which dies if any query fails (detected via stderr) - $psql->query_until() is a helper that makes it a bit easier to start queries that won't finish until a later point I don't quite like the new interface yet: - It's somewhat common to want to know if there was a failure, but also get the query result, not sure what the best function signature for that is in perl. - query_until() sounds a bit too much like $node->poll_query_until(). Maybe query_wait_until() is better? OTOH, the other function has poll in the name, so maybe it's ok. - right now there's a bit too much logic in background_psql() / interactive_psql() for my taste Those points aside, I think it already makes the tests a good bit more readable. My WIP vacuum_defer_cleanup_age patch shrunk by half with it. I think with a bit more polish it's easy enough to use that we could avoid a good number of those one-off psql's that we do all over. I didn't really know what this, insrc/test/subscription/t/015_stream.pl, is about: $h->finish;# errors make the next test fail, so ignore them here There's no further test? I'm somewhat surprised it doesn't cause problems in another ->finish later on, where we then afterwards just use $h again. Apparently IPC::Run just automagically restarts psql? Greetings, Andres Freund >From fb0e9fceead01aaee0bdd05c3ad6c67814f6b820 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 30 Jan 2023 15:39:08 -0800 Subject: [PATCH v1] WIP: test: Introduce BackgroundPsql class Discussion: https://postgr.es/m/882122.1675109...@sss.pgh.pa.us --- src/bin/psql/t/010_tab_completion.pl | 27 +-- contrib/amcheck/t/003_cic_2pc.pl | 66 +++ .../perl/PostgreSQL/Test/BackgroundPsql.pm| 165 ++ src/test/perl/PostgreSQL/Test/Cluster.pm | 83 ++--- src/test/recovery/t/031_recovery_conflict.pl | 101 +++ src/test/subscription/t/015_stream.pl | 52 ++ 6 files changed, 252 insertions(+), 242 deletions(-) create mode 100644 src/test/perl/PostgreSQL/Test/BackgroundPsql.pm diff --git a/src/bin/psql/t/010_tab_completion.pl b/src/bin/psql/t/010_tab_completion.pl index 7746c75e0c9..7c382b98f8f 100644 --- a/src/bin/psql/t/010_tab_completion.pl +++ b/src/bin/psql/t/010_tab_completion.pl @@ -92,14 +92,7 @@ print $FH "other stuff\n"; close $FH; # fire up an interactive psql session -my $in = ''; -my $out = ''; - -my $timer = timer($PostgreSQL::Test::Utils::timeout_default); - -my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); - -like($out, qr/psql/, "print startup banner"); +my $h = $node->interactive_psql('postgres'); # Simple test case: type something and see if psql responds as expected sub check_completion @@ -109,15 +102,12 @@ sub check_completion # report test failures from caller location local $Test::Builder::Level = $Test::Builder::Level + 1; - # reset output collector - $out = ""; # restart per-command timer - $timer->start($PostgreSQL::Test::Utils::timeout_default); - # send the data to be sent - $in .= $send; - # wait ... - pump $h until ($out =~ $pattern || $timer->is_expired); - my $okay = ($out =~ $pattern && !$timer->is_expired); + $h->{timeout}->start($PostgreSQL::Test::Utils::timeout_default); + + # send the data to be sent and wait for its result + my $out = $h->query_until($pattern, $send); + my $okay = ($out =~ $pattern && !$h->{timeout}->is_expired); ok($okay, $annotation); # for debugging, log actual output if it didn't match local $Data::Dumper::Terse = 1; @@ -443,10 +433,7 @@ check_completion("blarg \t\t", qr//, "check completion failure path"); clear_query(); # send psql an explicit \q to shut it down, else pty won't close properly -$timer->start($PostgreSQL::Test::Utils::timeout_default); -$in .= "\\q\n"; -fin
Re: generate_series for timestamptz and time zone problem
Gurjeet Singh writes: > [ generate_series_with_timezone.v6.patch ] The cfbot isn't terribly happy with this. It looks like UBSan is detecting some undefined behavior. Possibly an uninitialized variable? regards, tom lane
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, Jan 30, 2023 at 5:23 PM houzj.f...@fujitsu.com wrote: > > On Monday, January 30, 2023 12:13 PM Peter Smith > wrote: > > > > Here are my review comments for v88-0002. > > Thanks for your comments. > > > > > == > > General > > > > 1. > > The test cases are checking the log content but they are not checking for > > debug logs or untranslated elogs -- they are expecting a normal ereport LOG > > that might be translated. I’m not sure if that is OK, or if it is a > > potential problem. > > We have tests that check the ereport ERROR and ereport WARNING message(by > search for the ERROR or WARNING keyword for all the tap tests), so I think > checking the LOG should be fine. > > > == > > doc/src/sgml/config.sgml > > > > 2. > > On the publisher side, logical_replication_mode allows allows streaming or > > serializing changes immediately in logical decoding. When set to immediate, > > stream each change if streaming option (see optional parameters set by > > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. When set > > to buffered, the decoding will stream or serialize changes when > > logical_decoding_work_mem is reached. > > > > 2a. > > typo "allows allows" (Kuroda-san reported same) > > > > 2b. > > "if streaming option" --> "if the streaming option" > > Changed. Although you replied "Changed" for the above, AFAICT my review comment #2b. was accidentally missed. Otherwise, the patch LGTM. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Introduce "log_connection_stages" setting.
Thanks for updating the patch. It's currently failing check-world, due to a test that was added on January 23 (a9dc7f941): http://cfbot.cputube.org/sergey-dudoladov.html [19:15:57.101] Summary of Failures: [19:15:57.101] [19:15:57.101] 250/251 postgresql:ldap / ldap/002_bindpasswd ERROR 1.38s 2023-01-30 19:15:52.427 GMT [56038] LOG: unrecognized configuration parameter "log_connections" in file "/tmp/cirrus-ci-build/build/testrun/ldap/002_bindpasswd/data/t_002_bindpasswd_node_data/pgdata/postgresql.conf" line 839 > +received, but no further work has been done: Postgres is about to > start say "PostgreSQL" to match the rest of the docs. > + GUC_check_errmsg("Invalid value '%s'", stage); This message shouldn't be uppercased. > + GUC_check_errdetail("Valid values to use in the list > are 'received', 'authenticated', 'authorized', 'disconnected', and 'all'." > + "If 'all' is present, it must be the only value in the > list."); Maybe "all" should be first ? There's no spaces before "If": | 2023-01-31 00:17:48.906 GMT [5676] DETALLE: Valid values to use in the list are 'received', 'authenticated', 'authorized', 'disconnected', and 'all'.If 'all' is present, it must be the only value in the list. > +/* flags for logging information about session state */ > +int Log_connection_messages = LOG_CONNECTION_ALL; The initial value here is overwritten by the GUC default during startup. For consistency, the integer should be initialized to 0. > +extern PGDLLIMPORT int Log_connection_messages; > + > +/* Bitmap for logging connection messages */ > +#define LOG_CONNECTION_RECEIVED 1 > +#define LOG_CONNECTION_AUTHENTICATED 2 > +#define LOG_CONNECTION_AUTHORIZED 4 > +#define LOG_CONNECTION_DISCONNECTED 8 > +#define LOG_CONNECTION_ALL15 Maybe the integers should be written like (1<<0).. And maybe ALL should be 0x ? More nitpicks: > +Causes stages of each attempted connection to the server to be > logged. Example: authorized,disconnected. "Causes the specified stages of each connection attempt .." > +The default is the empty string meaning nothing is logged. ".. string, which means that nothing is logged" > +Logs the original identity that an authentication method > employs > +to identify a user. In most cases, the identity string matches the ".. original identity used by an authentication method ..' -- Justin
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
Hi Andres, On Tue, 31 Jan 2023 at 06:31, Nathan Bossart wrote: > > On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote: > > Why don't the dblink tests catch this? Any chance you or Robins could > > prepare > > a patch with fix and test, given that you know how to trigger this? > > It's trivially reproducible by calling 1-argument dblink_connect() multiple > times and then calling dblink_disconnect(). Here's a patch. > My test instance has been running Nathan's patch for the past few hours, and it looks like this should resolve the issue. A little bit of background, since the past few days I was noticing frequent crashes on a test instance. They're not simple to reproduce manually, but if left running I can reliably see crashes on an ~hourly basis. In trying to trace the origin, I ran a multiple-hour test for each commit going back a few commits and noticed that the crashes stopped prior to commit e4602483e9, which is when the issue became visible. The point is now moot, but let me know if full backtraces still help (I was concerned looking at the excerpts from some of the crashes): "double free or corruption (out)" "munmap_chunk(): invalid pointer" "free(): invalid pointer" str->data[0] = '\0'; Some backtraces ### Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u21 postgres 127.0.0.1(45334) SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102 #0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102 #1 0x7fc0062dfefd in pqDropConnection (conn=0x564bb3e1a080, flushInput=true) at fe-connect.c:495 #2 0x7fc0062e5cb3 in closePGconn (conn=0x564bb3e1a080) at fe-connect.c:4112 #3 0x7fc0062e5d55 in PQfinish (conn=0x564bb3e1a080) at fe-connect.c:4134 #4 0x7fc0061a442b in libpqsrv_disconnect (conn=0x564bb3e1a080) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #5 0x7fc0061a4df1 in dblink_disconnect (fcinfo=0x564bb3d980f0) at dblink.c:357 #6 0x564bb0e70aa7 in ExecInterpExpr (state=0x564bb3d98018, econtext=0x564bb3d979a0, isnull=0x7ffd60824b0f) at execExprInterp.c:728 #7 0x564bb0e72f36 in ExecInterpExprStillValid (state=0x564bb3d98018, econtext=0x564bb3d979a0, isNull=0x7ffd60824b0f) at execExprInterp.c:1838 Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u52 postgres 127.0.0.1(58778) SELECT '. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7fc021792859 in __GI_abort () at abort.c:79 #2 0x7fc0217fd26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fc021927298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x7fc0218052fc in malloc_printerr ( str=str@entry=0x7fc021929670 "double free or corruption (out)") at malloc.c:5347 #4 0x7fc021806fa0 in _int_free (av=0x7fc02195cb80 , p=0x7fc02195cbf0 , have_lock=) at malloc.c:4314 #5 0x7fc0062e16ed in freePGconn (conn=0x564bb6e36b80) at fe-connect.c:3977 #6 0x7fc0062e1d61 in PQfinish (conn=0x564bb6e36b80) at fe-connect.c:4135 #7 0x7fc0062a142b in libpqsrv_disconnect (conn=0x564bb6e36b80) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #8 0x7fc0062a1df1 in dblink_disconnect (fcinfo=0x564bb5647b58) at dblink.c:357 Program terminated with signal SIGSEGV, Segmentation fault. #0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153 153 str->data[0] = '\0'; #0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153 #1 0x7f2240b0a876 in PQexecStart (conn=0x559d3af0e410) at fe-exec.c:2320 #2 0x7f2240b0a688 in PQexec (conn=0x559d3af0e410, query=0x559d56fb8ee8 "p3") at fe-exec.c:2227 #3 0x7f223ba8c7e4 in dblink_exec (fcinfo=0x559d3b101f58) at dblink.c:1432 #4 0x559d2f003c82 in ExecInterpExpr (state=0x559d3b101ac0, econtext=0x559d34e76578, isnull=0x7ffe3d590fdf) at execExprInterp.c:752 Core was generated by `postgres: 728f86fec6@(HEAD detached at 728f86fec6)@sqith: u19 postgres 127.0.0.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x7f96f5fc6e99 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1 #0 0x7f96f5fc6e99 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1 #1 0x7f96da56027a in pgtls_close (conn=0x55d919752fb0) at fe-secure-openssl.c:1555 #2 0x7f96da558e41 in pqsecure_close (conn=0x55d919752fb0) at fe-secure.c:192 #3 0x7f96da53dd12 in pqDropConnection (conn=0x55d919752fb0, flushInput=true) at fe-connect.c:449 #4 0x7f96da543cb3 in closePGconn (conn=0x55d919752fb0) at fe-connect.c:4112 #5 0x7f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:
Re: pg_upgrade test failure
On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro wrote: > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund wrote: > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > > So [1] on its own didn't fix this. My next guess is that the attached > > > might help. > > > What is our plan here? This afaict is the most common "false positive" for > > cfbot in the last weeks. I pushed the rmtree() change. Let's see if that helps, or tells us something new. Michael: There were some questions from Andres above. FWIW I think if you wanted to investigate this properly on a local Windows system to chase down who's got the file open (shutdown sequence problem or whatever), you'd probably have to install Server 2019, or maybe use an old 8.1 VM if you still have such a thing, based on the suspicion that typical 10 and 11 systems won't exhibit the problem. But then I could be wrong about what's going on...
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Hello! On 24.11.2022 04:02, Thomas Munro wrote: On Thu, Nov 24, 2022 at 11:05 AM Tom Lane wrote: Thomas Munro writes: ERROR: calculated CRC checksum does not match value stored in file The attached draft patch fixes it. Tried to catch this error on my PC, but failed to do it within a reasonable time. The 1s interval is probably too long for me. It seems there are more durable way to reproduce this bug with 0001 patch applied: At the first backend: do $$ begin loop perform pg_update_control_file(); end loop; end; $$; At the second one: do $$ begin loop perform pg_control_system(); end loop; end; $$; It will fails almost immediately with: "ERROR: calculated CRC checksum does not match value stored in file" both with fsync = off and fsync = on. Checked it out for master and REL_11_STABLE. Also checked for a few hours that the patch 0002 fixes this error, but there are some questions to its logical structure. The equality between the previous and newly calculated crc is checked only if the last crc calculation was wrong, i.e not equal to the value stored in the file. It is very unlikely that in this case the previous and new crc can match, so, in fact, the loop will spin until crc is calculated correctly. In the other words, this algorithm is practically equivalent to an infinite loop of reading from a file and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true). But then it can be simplified significantly by removing checksums equality checks, bool fist_try and by limiting the maximum number of iterations with some constant in the e.g. for loop to avoid theoretically possible freeze. Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand, repeat the file_open-read-close-calculate_crc sequence twice without a pause between them and check the both calculated crcs for the equality. If they match, exit and return the bool result of comparing between the last calculation with the value from the file, if not, take a pause and repeat everything from the beginning. In this case, no need to check *crc_ok_p inside get_controlfile() as it was in the present version; i think it's more logically correct since this variable is intended top-level functions and the logic inside get_controlfile() should not depend on its value. Also found a warning in 0001 patch for master branch. On my PC gcc gives: xlog.c:2507:1: warning: no previous prototype for ‘pg_update_control_file’ [-Wmissing-prototypes] 2507 | pg_update_control_file() Fixed it with #include "utils/fmgrprotos.h" to xlog.c and add PG_FUNCTION_ARGS to pg_update_control_file(). With the best wishes, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Assertion failure in SnapBuildInitialSnapshot()
On Mon, Jan 30, 2023 at 8:24 PM Amit Kapila wrote: > > On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada wrote: > > > > On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila wrote: > > > > One idea to fix this issue is that in > > ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin > > while holding both ProcArrayLock and ReplicationSlotControlLock, and > > release only ReplicationSlotsControlLock before updating the > > replication_slot_xmin. I'm concerned it will increase the contention > > on ProcArrayLock but I've attached the patch for discussion. > > > > But what kind of workload are you worried about? This will be called > while processing XLOG_RUNNING_XACTS to update > procArray->replication_slot_xmin/procArray->replication_slot_catalog_xmin > only when required. So, if we want we can test some concurrent > workloads along with walsenders doing the decoding to check if it > impacts performance. > I was slightly concerned about holding ProcArrayLock while iterating over replication slots especially when there are many replication slots in the system. But you're right; we need it only when processing XLOG_RUNINNG_XACTS and it's not frequent. So it doesn't introduce visible overhead or negligible overhead. > What other way we can fix this? Do you think we can try to avoid > retreating xmin values in ProcArraySetReplicationSlotXmin() to avoid > this problem? Personally, I think taking the lock as proposed by your > patch is a better idea. Agreed. > BTW, this problem seems to be only logical > replication specific, so if we are too worried then we can change this > locking only for logical replication. Yes, but I agree that there won't be a big overhead by this fix. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Scan buffercache for a table
Hi, I am looking for function calls to scan the buffer cache for a table and find the cached pages. I want to find out which pages are cached and which of them are dirty. Having the relation id, how can I do that? I have gone through bufmgr.c and relcache.c, but could not find a way to get relation-specific pages from the buffer cache. Thank you!
Re: Allow an extention to be updated without a script
Hi, Thank you for your comment. On Mon, 30 Jan 2023 16:05:52 -0500 Tom Lane wrote: > Yugo NAGATA writes: > > Currently, even when we don't need to execute any command to update an > > extension from one version to the next, we need to provide an update > > script that doesn't contain any command. Preparing such meaningless > > files are sometimes annoying. > > If you have no update script, why call it a new version? The point > of extension versions is to distinguish different states of the > extension's SQL objects. We do not consider mods in underlying C code > to justify a new version. Although, as you say, the extension manager doesn't track changes in C code functions, a new version could be released with changes in only in C functions that implement a new feature or fix some bugs because it looks a new version from user's view. Actually, there are several extensions that provide empty update scripts in order to allow user to install such new versions, for example; - pglogical (https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical--2.4.1--2.4.2.sql) - hll (https://github.com/citusdata/postgresql-hll/blob/master/update/hll--2.16--2.17.sql) - orafce (https://github.com/orafce/orafce/blob/master/orafce--3.12--3.13.sql) - hypopg (https://github.com/HypoPG/hypopg/blob/REL1_STABLE/hypopg--1.3.1--1.3.2.sql) - timescaledb (https://github.com/timescale/timescaledb/blob/main/sql/updates/2.9.2--2.9.1.sql) -- Yugo NAGATA
Re: Scan buffercache for a table
On Mon, Jan 30, 2023 at 06:01:08PM -0800, Amin wrote: > Hi, > > I am looking for function calls to scan the buffer cache for a table and > find the cached pages. I want to find out which pages are cached and which > of them are dirty. Having the relation id, how can I do that? I have gone > through bufmgr.c and relcache.c, but could not find a way to get > relation-specific pages from the buffer cache. This looks like a re-post of the question you asked on Jan 13: caf-ka8_axsmpqw1scotnaqx8nfhgmjc6l87qzao3jezlibu...@mail.gmail.com It'd be better not to start a new thread (or if you do that, it'd be better to mention the old one and include its participants). On Fri, Jan 13, 2023 at 05:28:31PM -0800, Amin wrote: > Hi, > > Before scanning a relation, in the planner stage, I want to make a > call to > retrieve information about how many pages will be a hit for a specific > relation. The module pg_buffercache seems to be doing a similar thing. The planner is a *model* which (among other things) tries to guess how many pages will be read/hit. It's not expected to be anywhere near accurate. pg_buffercache only checks for pages within postgres' own buffer cache. It doesn't look for pages which are in the OS page cache, which require a system call to access (but don't require device I/O). Read about pgfincore for introinspection of the OS page cache. > Also, pg_statio_all_tables seems to be having that information, but it > is updated after execution. However, I want the information before > execution. Also not sure how pg_statio_all_tables is created and how > I can access it in the code. But the view isn't omnicient. When you execute a plan, you don't know how it's going to end. If you did, you wouldn't need to run it - you could just print the answer. Note that planning and execution are separate and independant. It's possible to plan a query without ever running it, or to plan it once and run it multiple times. The view reflects I/O requested by postgres; the I/O normally comes primarily from execution. You can look at how the view is defined: \sv pg_statio_all_tables And then you can look at how the functions that it calls are implemented (\df+). Same for pg_buffercache. It seems like you'll want to learn how to navigate the source code to find how things are connected. -- Justin
RE: Perform streaming logical transactions by background workers and parallel apply
On Monday, January 30, 2023 10:20 PM Masahiko Sawada wrote: > > > I have one comment on v89 patch: > > + /* > +* Using 'immediate' mode returns false to cause a switch to > +* PARTIAL_SERIALIZE mode so that the remaining changes will > be serialized. > +*/ > + if (logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE) > + return false; > + > > Probably we might want to add unlikely() here since we could pass through this > path very frequently? I think your comment makes sense, thanks. I updated the patch for the same. Best Regards, Hou zj v90-0001-Extend-the-logical_replication_mode-to-test-the-.patch Description: v90-0001-Extend-the-logical_replication_mode-to-test-the-.patch