Re: RISC-V animals sporadically produce weird memory-related failures
Hello Tom, 17.11.2024 20:28, Tom Turelinckx wrote: I have now done just that, but on a new HiFive Premier P550 board [2]. It is running Ubuntu 24.04 LTS with a board-specific kernel, currently 6.6.21-9-premier (2024-11-09). The buildfarm client is executing within a Debian Trixie container created from the official Debian repo. This stack is a lot more recent, should be more future-proof, and the board is significantly faster too. Boomslang has already built all branches and copperhead is currently going through them. Thank you for upgrading these machines! Could you please take a look at new failures produced by copperhead recently?: [1] 2024-11-30 19:34:53.302 CET [13395:4] LOG: server process (PID 13439) was terminated by signal 11: Segmentation fault 2024-11-30 19:34:53.302 CET [13395:5] DETAIL: Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, BOOLTBL2.* FROM BOOLTBL1, BOOLTBL2 WHERE BOOLTBL2.f1 <> BOOLTBL1.f1; [2] 2024-11-30 19:54:11.478 CET [27560:15] LOG: server process (PID 28459) was terminated by signal 11: Segmentation fault 2024-11-30 19:54:11.478 CET [27560:16] DETAIL: Failed process was running: SELECT count(*) FROM test_tsvector WHERE a @@ any ('{wr,qh}'); These crashes are hardly related to code changes, so maybe there are platform-specific issues still... I've run 100 iterations of `make check` for REL_13_STABLE, using trixie/sid 6.8.12-riscv64 (gcc 14.2.0), emulated with qemu-system-riscv64, with no failures. Unfortunately, the log files saved don't include coredump information, maybe because of inappropriate core_pattern. (Previously, a stack trace was extracted in case of a crash: [3].) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A16%3A37 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A35%3A17 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-09-03%2016%3A38%3A46 Best regards, Alexander
Re: CREATE SCHEMA ... CREATE DOMAIN support
On 02/12/2024 17:56, Tom Lane wrote: Vik Fearing writes: On 02/12/2024 03:15, Tom Lane wrote: Also, if SQL intended to constrain the search path for unqualified identifiers to be only the new schema, they'd hardly need a concept of at all. I looked up the original paper (MUN-051) that introduced the and it says, "The paper is proposing the use of paths only to resolve unqualified routine invocations." Interesting. The standard actually does say that that is what it is for. Section 11.1 SR 8: "The of the explicit or implicit specification> is used as the SQL- path of the schema. The SQL-path is used to effectively qualify unqualified s that are immediately contained in s that are contained in the ." But still, the spec allows within , so even that narrow interpretation opens them to the is-this-an-external-reference-or-a-forward-reference problem. Surely that is determined by the placement of the schema in its own SQL-path. For us, that's clouded further for functions by our overloading rules. If foo(bigint) exists in the search path, and we have a view or whatever that references foo() with an int argument, and there is a CREATE FUNCTION for foo(float8) later in the , what are we supposed to think is the user's intent? (Just to save people doing the experiment: we'd prefer foo(float8) if both are visible, but foo(bigint) would be perfectly acceptable if not. Other choices of the argument types would yield different results, and none of them seem especially open-and-shut to me.) My answer is the same as above, for unqualified names. However, since there is nothing that says anything either way about forward references, my preference is to just execute them all in the order written. In your example, that would mean choosing otherschema.foo(bigint) over thisschema.foo(float8) if the latter hasn't been created yet. I don't know offhand if the spec allows function overloading in the same way. Feature T-321 has a note saying, "Support for overloaded functions and procedures is not part of Core SQL." -- Vik Fearing
Re: Proposal: Role Sandboxing for Secure Impersonation
On 12/2/24 08:41, Eric Hanson wrote: Hi all, I'd like to revisit a previously discussed feature [1] that PostgreSQL could benefit from a "role sandbox", a feature that would build on SET [LOCAL] ROLE, and prevent or restrict RESET ROLE. Rationale: Connection pooling is widely used to optimize database performance by reducing use of memory, process creation, etc. However, connection pools typically operate on a "pool-per-role" basis, because each connection is bound to a single role and can't be reused by another role. For systems that make use of many roles, this limits the effectiveness of connection pooling because each role has their own "pool space" and max_connections puts a hard limit on how many connections can exist. To work around this, projects (e.g. PostgREST) employ the "user impersonation" pattern: - All connections use a shared "authenticator" role - When a user (e.g. Alice) sends a request to the connection pooler, it temporarily sets the role using: SET [LOCAL] ROLE alice; - After processing Alice's request, the session resets the role back to the "authenticator" role by either issuing a "RESET ROLE" or ending the "local" transaction. This approach works well in theory, but poses a significant security concern: RESET ROLE allows a client to reset the role back to the "authenticator" role, *before* handing the session back to the pooler. Any SQL injection vulnerability or anything else that allows arbitrary SQL allows the client to issue a `RESET ROLE; SET ROLE anybody_else;`, bypassing authentication. Depending on the privileges of the "authenticator" role, the client can become any other user, or worse. Proposal: What if PostgreSQL had a "role sandbox", a state where RESET ROLE was prohibited or restricted? If PostgreSQL could guarantee that RESET ROLE was not allowed, even SQL injection vulnerabilities would not allow a client to bypass database privileges and RLS when using user impersonation. Systems with many roles could safely and efficiently use many roles in parallel with connection pooling. The feature probably has other applications as well. Sandboxing could happen at the session level, or the transaction level; both seem to have benefits. Here are some syntax ideas floating around: SET ROLE IDEAS a) Transaction ("local") Sandbox: - SET LOCAL ROLE alice NO RESET; - SET LOCAL ROLE alice WITHOUT RESET; - BEGIN AS ROLE alice; Transaction-level sandboxes have the benefit that a pooler can simply start a new sandboxed transaction for each request and never have to worry about resetting or reusing them. b) Session Sandbox: - SET ROLE alice NO RESET; - SET ROLE alice WITHOUT RESET; - SET UNRESETTABLE ROLE alice; --veto Session-level sandboxes have the benefit that they can do things that can't be done inside a transaction (e.g. create extensions, vacuum, analyze, etc.) It's a fully functional session. However if RESET ROLE is prohibited for the rest of the session, a connection pooler couldn't reuse it. c) "Guarded" Transaction/Session - SET [LOCAL] ROLE alice GUARDED BY reset_token; - RESET ROLE WITH TOKEN reset_token; Guarded sandboxes are nice because the session can also exit the sandbox if it has the token. Another aspect of this is SET SESSION AUTHORIZATION. I don't see preventing reset as particularly useful at least for connection poolers, since it then couldn't be reused. However, the GUARDED BY token idea would make it restricted but not prevented, which could be useful. I'd love to hear your thoughts on this feature. I am very much in favor of functionality of this sort being built in to the core database. Very similar functionality is available in an extension I wrote years ago (without the SQL grammar support) -- see https://github.com/pgaudit/set_user I have never proposed it (or maybe I did years ago, don't actually remember) because I did not think the community was interested in this approach, but perhaps the time is ripe to discuss it. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: revamp row-security tracking
On Fri, Nov 29, 2024 at 10:01:41AM +, Dean Rasheed wrote: > On Thu, 21 Nov 2024 at 18:00, Nathan Bossart wrote: >> The attached patch accomplishes this by establishing a global queue of >> row-security "nest levels" that the aforementioned higher-level callers can >> use. > > I'm not convinced that this is an improvement. Thanks for reviewing. > The code in check_sql_fn_retval() is building a Query struct from > scratch, so it seems perfectly natural for it to be responsible for > setting all the required fields, based on the information it has > available. With this patch, check_sql_fn_retval() is returning a > potentially incorrectly marked Query at the end of the querytree list, > which the caller is responsible for fixing up, which doesn't seem > ideal. While it is indeed natural for the code that builds a Query to be responsible for setting it correctly, unfortunately there's no backstop if someone forgets to do so (as was the case in the recent CVE). I don't think my v1 patch would necessarily prevent all such problems, but I do think it would help prevent some. > There is exactly one place where RLS policies are applied, and it > seems much more natural for it to have responsibility for setting this > flag. I think that a slightly neater way for it to handle that would > be to modify fireRIRrules(), adding an extra parameter "bool > *hasRowSecurity" that it would set to true if RLS is enabled for the > query it is rewriting. Doing that forces all callers to think about > whether or not that affects some outer query. For example, > ApplyRetrieveRule() would then do: > > rule_action = fireRIRrules(rule_action, activeRIRs, >&parsetree->hasRowSecurity); > > rather than having a separate second step to update the flag on > "parsetree", and similarly for fireRIRrules()'s recursive calls to > itself. If, in the future, it becomes necessary to invoke > fireRIRrules() on more parts of a Query, it's then much more likely > that the new code won't forget to update the parent query's flag. I've attempted this in the attached v2 patch. I do think this is an improvement over the status quo, but I worry that it doesn't go far enough. -- nathan >From 0c81e04c3e62cd178dff9fc5b5f671e41fd39f28 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 2 Dec 2024 10:24:49 -0600 Subject: [PATCH v2 1/1] revamp row security tracking --- src/backend/rewrite/rewriteHandler.c | 42 +--- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c index ab2e2cd647..c9ad429703 100644 --- a/src/backend/rewrite/rewriteHandler.c +++ b/src/backend/rewrite/rewriteHandler.c @@ -94,7 +94,7 @@ static void markQueryForLocking(Query *qry, Node *jtnode, bool pushedDown); static List *matchLocks(CmdType event, Relation relation, int varno, Query *parsetree, bool *hasUpdate); -static Query *fireRIRrules(Query *parsetree, List *activeRIRs); +static Query *fireRIRrules(Query *parsetree, List *activeRIRs, bool *hasRowSecurity); static Bitmapset *adjust_view_column_set(Bitmapset *cols, List *targetlist); @@ -1730,6 +1730,7 @@ ApplyRetrieveRule(Query *parsetree, RangeTblEntry *rte; RowMarkClause *rc; int numCols; + boolhasRowSecurity = false; if (list_length(rule->actions) != 1) elog(ERROR, "expected just one rule action"); @@ -1843,13 +1844,13 @@ ApplyRetrieveRule(Query *parsetree, /* * Recursively expand any view references inside the view. */ - rule_action = fireRIRrules(rule_action, activeRIRs); + rule_action = fireRIRrules(rule_action, activeRIRs, &hasRowSecurity); /* * Make sure the query is marked as having row security if the view query * does. */ - parsetree->hasRowSecurity |= rule_action->hasRowSecurity; + parsetree->hasRowSecurity |= hasRowSecurity; /* * Now, plug the view query in as a subselect, converting the relation's @@ -1971,15 +1972,17 @@ fireRIRonSubLink(Node *node, fireRIRonSubLink_context *context) if (IsA(node, SubLink)) { SubLink*sub = (SubLink *) node; + boolhasRowSecurity = false; /* Do what we came for */ sub->subselect = (Node *) fireRIRrules((Query *) sub->subselect, - context->activeRIRs); + context->activeRIRs, + &hasRowSecurity); /* * Remember if any of
Re: Consider pipeline implicit transaction as a transaction block
On Thu, Nov 28, 2024 at 2:53 AM Anthonin Bonnefoy wrote: > On Thu, Nov 28, 2024 at 12:26 AM Michael Paquier wrote: > > I don't mind being more careful here based on your concerns, so I'll > > go remove that in the stable branches. > > Sorry about that. I didn't have a strong need for this to be > backpatched and should have made this clearer. FWIW, I don't think you did anything wrong. To me, the thread reads like you just submitted this as a normal patch and Michael decided to back-patch. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 12:11 PM Tom Lane wrote: > > Freezing a page, and setting a page all-visible are orthogonal. > > Sorry, sloppy wording on my part. Freezing doesn't affect the contents of the visibility map in any way that seems relevant. The executor only cares about the all-visible bit (and never the all-frozen bit), and the rules around when and how VACUUM sets the all-visible bit (and how everybody else unsets the all-visible bit) haven't changed in forever. So I just can't see it. I guess it's natural to suspect more recent work -- commit 7c70996e is about 6 years old. But I the race condition that I suspect is at play here is very narrow. It's pretty unlikely that there'll be a dead-to-all TID returned to a scan (not just dead to our MVCC snapshot, dead to everybody's) that is subsequently concurrently removed from the index, and then set LP_UNUSED in the heap. It's probably impossible if you don't have a small table -- VACUUM just isn't going to be fast enough to get to the leaf page after the bitmap index scan, but still be able to get to the heap before its corresponding bitmap heap scan (that uses the VM as an optimization) can do the relevant visibility checks (while it could happen with a large table and a slow bitmap scan, the chances of the VACUUM being precisely aligned with the bitmap scan, in just the wrong way, seem remote in the extreme). Finally, none of this will happen if some other factor hinders VACUUM from setting the relevant heap page all-visible. AFAICT this is only a problem because of the involvement of the VM, specifically -- an MVCC snapshot *is* generally sufficient to make bitmap index scans safe from the dangers of concurrent TID recycling, as explained in "62.4. Index Locking Considerations". That only ceases to be true when the visibility map becomes involved (the VM lacks the granular visibility information required to make all this safe). This is essentially the same VM race issue that nbtree's _bt_drop_lock_and_maybe_pin protects against during conventional index-only scans. -- Peter Geoghegan
Re: Partition-wise join with whole row vars
> On Tue, Oct 08, 2024 at 09:24:15AM GMT, Alexander Pyhalov wrote: > > Attaching rebased patches. Just to let you know, looks like CFBot tests are red again, but this time there are some unexpected differences in some test query plan.
Proposal: Role Sandboxing for Secure Impersonation
Hi all, I'd like to revisit a previously discussed feature [1] that PostgreSQL could benefit from a "role sandbox", a feature that would build on SET [LOCAL] ROLE, and prevent or restrict RESET ROLE. Rationale: Connection pooling is widely used to optimize database performance by reducing use of memory, process creation, etc. However, connection pools typically operate on a "pool-per-role" basis, because each connection is bound to a single role and can't be reused by another role. For systems that make use of many roles, this limits the effectiveness of connection pooling because each role has their own "pool space" and max_connections puts a hard limit on how many connections can exist. To work around this, projects (e.g. PostgREST) employ the "user impersonation" pattern: - All connections use a shared "authenticator" role - When a user (e.g. Alice) sends a request to the connection pooler, it temporarily sets the role using: SET [LOCAL] ROLE alice; - After processing Alice's request, the session resets the role back to the "authenticator" role by either issuing a "RESET ROLE" or ending the "local" transaction. This approach works well in theory, but poses a significant security concern: RESET ROLE allows a client to reset the role back to the "authenticator" role, *before* handing the session back to the pooler. Any SQL injection vulnerability or anything else that allows arbitrary SQL allows the client to issue a `RESET ROLE; SET ROLE anybody_else;`, bypassing authentication. Depending on the privileges of the "authenticator" role, the client can become any other user, or worse. Proposal: What if PostgreSQL had a "role sandbox", a state where RESET ROLE was prohibited or restricted? If PostgreSQL could guarantee that RESET ROLE was not allowed, even SQL injection vulnerabilities would not allow a client to bypass database privileges and RLS when using user impersonation. Systems with many roles could safely and efficiently use many roles in parallel with connection pooling. The feature probably has other applications as well. Sandboxing could happen at the session level, or the transaction level; both seem to have benefits. Here are some syntax ideas floating around: SET ROLE IDEAS a) Transaction ("local") Sandbox: - SET LOCAL ROLE alice NO RESET; - SET LOCAL ROLE alice WITHOUT RESET; - BEGIN AS ROLE alice; Transaction-level sandboxes have the benefit that a pooler can simply start a new sandboxed transaction for each request and never have to worry about resetting or reusing them. b) Session Sandbox: - SET ROLE alice NO RESET; - SET ROLE alice WITHOUT RESET; - SET UNRESETTABLE ROLE alice; --veto Session-level sandboxes have the benefit that they can do things that can't be done inside a transaction (e.g. create extensions, vacuum, analyze, etc.) It's a fully functional session. However if RESET ROLE is prohibited for the rest of the session, a connection pooler couldn't reuse it. c) "Guarded" Transaction/Session - SET [LOCAL] ROLE alice GUARDED BY reset_token; - RESET ROLE WITH TOKEN reset_token; Guarded sandboxes are nice because the session can also exit the sandbox if it has the token. Another aspect of this is SET SESSION AUTHORIZATION. I don't see preventing reset as particularly useful at least for connection poolers, since it then couldn't be reused. However, the GUARDED BY token idea would make it restricted but not prevented, which could be useful. I'd love to hear your thoughts on this feature. If we can finalize the design, I would be willing to try implementing this. I haven't coded C for years though so I will probably need some help depending on how complex it is. SET ROLE is intertwined with the rest of the SET variable grammar but doesn't seem too hard to extend, if we go that route. Steve Chavez of PostgREST said he'd be willing to help, and could use the feature in PostgREST if it existed. I think other poolers could benefit from it as well. Thanks, Eric [1] https://postgr.es/m/flat/CACA6kxgdzt-oForijaxfXHHhnZ1WBoVGMXVwFrJqUu-Hg3C-jA%40mail.gmail.com
Re: Consider the number of columns in the sort cost model
Hi folks, Just wanted to mention, that looks like some CFBot test are failing, something around level_tracking in pgss.
Re: Drop back the redundant "Lock" suffix from LWLock wait event names
On 2024-Dec-02, Bertrand Drouvot wrote: > Hi hackers, > > da952b415f unintentionally added back the "Lock" suffix into the LWLock wait > event names: > > - "added back" because the "Lock" suffix was removed in 14a9101091 > - "unintentionally" because there is nothing in the thread [2] that explicitly > mentions that the idea was also to revert 14a9101091 Oh, you're right, this was unintentional and unnoticed. I'll push this shortly, to both 17 and master. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Using read stream in autoprewarm
> On 2 Dec 2024, at 16:16, Kirill Reshke wrote: > > I feel like we are ready to mark this as RFC, WDYT? +1 Best regards, Andrey Borodin.
Re: Using read stream in autoprewarm
Hi, On Mon, 2 Dec 2024 at 16:30, Andrey M. Borodin wrote: > > > On 2 Dec 2024, at 16:16, Kirill Reshke wrote: > > > > I feel like we are ready to mark this as RFC, WDYT? > > +1 Done. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent wrote: > The running theory is that bitmap executor nodes incorrectly assume > that the rows contained in the bitmap all are still present in the > index, and thus assume they're allowed to only check the visibility > map to see if the reference contained in the bitmap is visible. > However, this seems incorrect: Note that index AMs must hold at least > pins on the index pages that contain their results when those results > are returned by amgettuple() [0], and that amgetbitmap() doesn't do > that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs > from the index (and later, heap) that are still present in the bitmap > used in the scan. This theory seems very believable. We hold onto a leaf page buffer pin for index-only scans as an interlock against concurrent TID recycling. If we assume for the sake of argument that the optimization from commit 7c70996e is correct, then why do we even bother with holding onto the pin during index-only scans? In theory we should either do the "buffer pin interlock against TID recycling" thing everywhere, or nowhere -- how could bitmap scans possibly be different here? > I think this might be an oversight when the feature was originally > committed in 7c70996e (PG11): we don't know when the VM bit was set, > and the bitmap we're scanning may thus be out-of-date (and should've > had TIDs removed it it had been an index), so I propose disabling this > optimization for now, as attached. I have a hard time imagining any alternative fix that is suitable for backpatch. Can we save the optimization on the master branch? Clearly it would be wildly impractical to do the "buffer pin interlock against TID recycling" thing in the context of bitmap scans. The only thing that I can think of that might work is a scheme that establishes a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later than the "safe LSN", it's not okay to trust its all-visible bits. At least not in the context of bitmap index scans that use the optimization from 7c70996e. -- Peter Geoghegan
Re: CREATE SCHEMA ... CREATE DOMAIN support
On 02/12/2024 03:15, Tom Lane wrote: Michael Paquier writes: If I'm parsing the spec right, the doc mentions in its 5)~6) of the syntax rules in CREATE SCHEMA that non-schema-qualified objects should use the new schema name defined in the CREATE SCHEMA query. So that pretty much settles the rules to use when having a new object that has a reference to a non-qualified object created in the same CREATE SCHEMA query? I don't see where you're getting that from? DB2 says that unqualified reference names (not to be confused with unqualified creation-target names) are taken to be in the new schema, but I don't see any corresponding restriction in the spec. What I do see (11.1 SR 6 in SQL:2021) is: If is not specified, then a containing an implementation-defined that contains the contained in is implicit. What I read this as is that the "search path" during schema-element creation must include at least the new schema, but can also include some other schemas as defined by the implementation. That makes our behavior compliant, because we can define the other schemas as those in the session's prevailing search_path. (DB2's behavior is also compliant, but they're defining the path as containing only the new schema.) Also, if SQL intended to constrain the search path for unqualified identifiers to be only the new schema, they'd hardly need a concept of at all. I looked up the original paper (MUN-051) that introduced the path specification> and it says, "The paper is proposing the use of paths only to resolve unqualified routine invocations." That doesn't seem to have been explained much by the rest of the spec, but it is visible in the definition of which says, "Specify an order for searching for an SQL-invoked routine." I can find nowhere that says that the path can or cannot be used for other objects. -- Vik Fearing
Re: Vacuum statistics
In my opinion, the patches are semantically correct. However, not all dead code has been removed - I'm referring to pgstat_update_snapshot(). Also, the tests need to be fixed. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Incorrect result of bitmap heap scan.
Hi, On 2024-12-02 11:31:48 -0500, Andres Freund wrote: > I think it'd be good if we added a test that shows the failure mechanism so > that we don't re-introduce this in the future. Evidently this failure isn't > immediately obvious... Attached is an isolationtest that reliably shows wrong query results. Greetings, Andres Freund >From a666e6da7af9a0af31ae506de8c2c229b713f8a6 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 2 Dec 2024 14:38:11 -0500 Subject: [PATCH v1] isolationtester showing broken index-only bitmap heap scan --- .../expected/index-only-bitmapscan.out| 28 ++ src/test/isolation/isolation_schedule | 1 + .../specs/index-only-bitmapscan.spec | 85 +++ 3 files changed, 114 insertions(+) create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out new file mode 100644 index 000..132ff1bda70 --- /dev/null +++ b/src/test/isolation/expected/index-only-bitmapscan.out @@ -0,0 +1,28 @@ +Parsed test spec with 2 sessions + +starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit +step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap; +step s2_mod: + DELETE FROM ios_bitmap WHERE a > 1; + +step s1_begin: BEGIN; +step s1_prepare: +DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0; + +step s1_fetch_1: +FETCH FROM foo; + +row_number +-- + 1 +(1 row) + +step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap; +step s1_fetch_all: +FETCH ALL FROM foo; + +row_number +-- +(0 rows) + +step s1_commit: COMMIT; diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule index 143109aa4da..e3c669a29c7 100644 --- a/src/test/isolation/isolation_schedule +++ b/src/test/isolation/isolation_schedule @@ -17,6 +17,7 @@ test: partial-index test: two-ids test: multiple-row-versions test: index-only-scan +test: index-only-bitmapscan test: predicate-lock-hot-tuple test: update-conflict-out test: deadlock-simple diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec new file mode 100644 index 000..9962b8dc531 --- /dev/null +++ b/src/test/isolation/specs/index-only-bitmapscan.spec @@ -0,0 +1,85 @@ +# index-only-bitmapscan test showing wrong results +# +setup +{ +-- by using a low fillfactor and a wide tuple we can get multiple blocks +-- with just few rows +CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '') +WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10); + +INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i); + +CREATE INDEX ios_bitmap_a ON ios_bitmap(a); +CREATE INDEX ios_bitmap_b ON ios_bitmap(b); +} + +teardown +{ +DROP TABLE ios_bitmap; +} + + +session s1 + +setup { +SET enable_seqscan = false; +} + +step s1_begin { BEGIN; } +step s1_commit { COMMIT; } + + +# The test query uses an or between two indexes to ensure make it more likely +# to use a bitmap index scan +# +# The row_number() hack is a way to have something returned (isolationtester +# doesn't display empty rows) while still allowing for the index-only scan +# optimization in bitmap heap scans, which requires an empty targetlist. +step s1_prepare { +DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0; +} + +step s1_fetch_1 { +FETCH FROM foo; +} + +step s1_fetch_all { +FETCH ALL FROM foo; +} + + +session s2 + +# Don't delete row 1 so we have a row for the cursor to "rest" on. +step s2_mod +{ + DELETE FROM ios_bitmap WHERE a > 1; +} + +# Disable truncation, as otherwise we'll just wait for a timeout while trying +# to acquire the lock +step s2_vacuum { VACUUM (TRUNCATE false) ios_bitmap; } + +permutation + # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider + # VM to be size 0, due to caching. Can't do that in setup because + s2_vacuum + + # delete nearly all rows, to make issue visible + s2_mod + # create a cursor + s1_begin + s1_prepare + + # fetch one row from the cursor, that ensures the index scan portion is done + # before the vacuum in the next step + s1_fetch_1 + + # with the bug this vacuum will mark pages as all-visible that the scan in + # the next step then considers all-visible, despite all rows from those + # pages having been removed. + s2_vacuum + # if this returns any rows, we're busted + s1_fetch_all + + s1_commit -- 2.45.2.746.g06e570c0df.dirty
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 11:43 AM Tom Lane wrote: > Whether that's true or not, it seems like it'd be worth bisecting > to see if we can finger a commit where the behavior changed (and > the same goes for the question of why-isnt-it-an-IOS-scan). However, > the reproducer seems to have quite a low failure probability for me, > which makes it hard to do bisection testing with much confidence. > Can we do anything to make the test more reliable? If I'm right > to suspect autovacuum, maybe triggering lots of manual vacuums > would improve the odds? I agree that autovacuum (actually, VACUUM) is important here. I find that the test becomes much more reliable if I create the test table "with (autovacuum_analyze_scale_factor=0.99, vacuum_truncate=off)". More importantly, rather than relying on autovacuum, I just run VACUUM manually from psql. I find it convenient to use "\watch 0.01" to run VACUUM repeatedly. -- Peter Geoghegan
Re: Incorrect result of bitmap heap scan.
Hi, On 2024-12-02 12:02:39 -0500, Tom Lane wrote: > Andres Freund writes: > > I think the problematic scenario involves tuples that *nobody* can see. > > During > > the bitmap index scan we don't know that though. Thus the tid gets inserted > > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > > the tuple from the indexes and then the heap and marks the page as > > all-visible, as the deleted row version has been removed. > > Yup. I am saying that that qualifies as too-aggressive setting of the > all-visible bit. I'm not sure what rule we should adopt instead of > the current one, but I'd much rather slow down page freezing than > institute new page locking rules. How? This basically would mean we could never set all-visible if there is *any* concurrent scan on the current relation, because any concurrent scan could have an outdated view of all-visible. Afaict this isn't an issue of "too-aggressive setting of the all-visible bit", it's an issue of setting it at all. Greetings, Andres Freund
Re: meson missing test dependencies
On 02.12.24 11:50, Nazir Bilal Yavuz wrote: I applied your patches and the cube example worked. But when I edited 'test_json_parser_perf.c' and 'test_json_parser_incremental.c', meson did not rebuild. I used the 'meson test -C build --suite setup --suite test_json_parser' command to test test_json_parser. Did I do something wrong? Seems to work for me. I edited test_json_parser_incremental.c and then ran $ meson test -C build --suite setup --suite test_json_parser -v ninja: Entering directory `/Users/peter/devel/postgresql/postgresql/build' [2/2] Linking target src/test/modules/test_json_parser/test_json_parser_incremental 1/7 postgresql:setup / tmp_install RUNNING ... Without my patch, you don't get the "Linking target ..." output.
Re: Remove useless casts to (void *)
On Tue, Dec 3, 2024 at 7:06 AM Tom Lane wrote: > This is from hake[1], which is running OpenIndiana/illumos. > That platform shows a couple of other strange warnings, so maybe > re-eliminating these two is not worth worrying about, but > nonetheless the casts to void * were doing something here. I wouldn't change that. illumos is selecting the old pre-standard declaration here, but it knows the standard one: https://github.com/illumos/illumos-gate/blob/27ecbff00d8c86a2647d6fe325cacb220d712115/usr/src/uts/common/sys/shm.h#L129 https://illumos.org/man/2/shmdt I don't know why we have only one tiny issue if the system headers think we want pre-POSIX stuff. I wonder if the particular header has incorrect guarding, but I don't know how that is supposed to work.
Re: sunsetting md5 password support
On Wed, Nov 20, 2024 at 08:17:07PM -0500, Greg Sabino Mullane wrote: > Sounds good to me. I think my hesitation was more that the hint was > overpromising help, so big +1 to more detail and keeping it. Committed. If anyone wants to try putting together a patch that expands the "migrating to SCRAM" section of the docs before I get to it, please be my guest. -- nathan
Re: More CppAsString2() in psql's describe.c
> On 2 Dec 2024, at 08:52, Tom Lane wrote: > But isn't there a way to improve the macro so this'd lead to an error? That sounds like a pretty decent improvement in general. I experimented with quick hack using a typeof check on the passed symbol which catches symbolname typos. It might be totally unfit for purpose but it was an interesting hack. #define CppAsString2(x) ((__builtin_types_compatible_p(__typeof__(x),char *) ?: CppAsString(x))) It does require changing the any uses of the macro in string generation from f("pre" CppAsString2(SYM) "post"); into f_v("pre%spost", CppAsString2(SYM)); however, and using it as part of another macro (TABLESPACE_VERSION_DIRECTORY) doesn't work. -- Daniel Gustafsson
Re: RISC-V animals sporadically produce weird memory-related failures
On Tue, Dec 3, 2024 at 7:00 AM Alexander Lakhin wrote: > A build made with clang-19 without llvm passed `make check` successfully. We heard in another thread[1] that we'd need to use the JITLink API for RISCV, instead of the RuntimeDyld API we're using. I have a newer patch to use JITLink on all architectures, starting at some LLVM version, but it needs a bit more polish and research before sharing. I'm surprised it's segfaulting instead of producing an error of some sort, though. I wonder why. It would be nice if we could fail gracefully instead. Hmm, from a quick look in the LLVM main branch, it looks like a bunch of RISCV stuff just landed in recent months under llvm/lib/ExecutionEngine/RuntimeDyld, so maybe that's not true anymore on bleeding-edge LLVM (20-devel), I have no idea what state that's in, but IIUC there is no way RuntimeDyld could work on LLVM 16 or 19. [1] https://www.postgresql.org/message-id/flat/20220829074622.2474104-1-alex.fan.q%40gmail.com
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 11:52 AM Andres Freund wrote: > I think the problematic scenario involves tuples that *nobody* can see. During > the bitmap index scan we don't know that though. Right, exactly. FWIW, this same issue is why it is safe for nbtree to drop its pin early during plain index scans, but not during index-only scans -- see _bt_drop_lock_and_maybe_pin, and the nbtree/README section on making concurrent TID recycling safe. Weirdly, nbtree is specifically aware that it needs to *not* drop its pin in the context of index-only scans (to make sure that VACUUM cannot do unsafe concurrent TID recycling) -- even though an equivalent index scan would be able to drop its pin like this. The underlying reason why nbtree can discriminate like this is that it "knows" that plain index scans will always visit the heap proper. If a TID points to an LP_UNUSED item, then it is considered dead to the scan (even though in general the heap page itself might be marked all-visible). If some completely unrelated, newly inserted heap tuple is found instead, then it cannot be visible to the plain index scan's MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get dropped like this). -- Peter Geoghegan
Re: Vacuum statistics
On 02.12.2024 11:27, Alexander Korotkov wrote: Hi, Alena! On Wed, Nov 13, 2024 at 6:21 PM Alena Rybakina wrote: Updated 0001-v13 attached, as well as the diff between v12 and v13. Thank you) And I agree with your changes. And included them in patches. Thank you for the updated patchset. Some points from me. * I've read the previous discussion on how important to keep all these fields regarding vacuum statistics including points by Andrei and Jim. It still worrying me that statistics volume is going to burst in about 3 times, but I don't have a particular proposal on how to make more granular approach. I wonder if you could propose something. * Previously PGSTAT_FILE_FORMAT_ID got increased by 1. Your 0001 patch increases it by 2. It's minor note, but I'd like to keep the tradition. * Commit message for 0001 looks nice, but commit messages of 0002, 0003, and 0004 look messy. Could you please, rearrange them. * The distinction between 0001 and 0002 is not clear. The first line of 0001 is "Machinery for grabbing an extended vacuum statistics on heap relations", the first line of 0002 is "Machinery for grabbing an extended vacuum statistics on heap and index relations." I guess 0001 should be about heap relations while 0002 should be about just index relations. Is this correct? * I guess this statistics should work for any table AM, based on what has been done in relation_vacuum() interface method. If that's correct, we need to get rid of "heap" terminology and use "table" instead. * 0004 should be pure documentation patch, but it seems containing changes to isolation tests. Please, move them into a more appropriate place. Thank you for your valuable feedback, I am already carefully processing your comments and will update the patches soon. I will think about what can be done to address the problem of increasing the volume of statistics; perhaps it will be possible to implement a guc that, when enabled, will accumulate additional information on vacuum statistics. For example, this way you can group statistics by buffers and vacuum statistics. -- Regards, Alena Rybakina Postgres Professional
Re: RISC-V animals sporadically produce weird memory-related failures
Hi Alexander, On Mon, Dec 2, 2024, at 2:00 PM, Alexander Lakhin wrote: > These crashes are hardly related to code changes, so maybe there are > platform-specific issues still... I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and enable --with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert the --with-llvm again. Sorry about not first testing without submitting results. > Unfortunately, the log files saved don't include coredump information, > maybe because of inappropriate core_pattern. I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default /etc/security/limits.d/10-coredump-debian.conf. Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but apport is not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core file being written to the current directory, like it was before the upgrade. Best regards, Tom
Re: Support POSITION with nondeterministic collations
On 26.08.24 08:09, Peter Eisentraut wrote: This patch allows using text position search functions with nondeterministic collations. These functions are - position, strpos - replace - split_part - string_to_array - string_to_table which all use common internal infrastructure. Some exploratory testing could be useful here. The present test coverage was already quite helpful during development, but there is always the possibility that something was overlooked. Here is a freshly rebased patch version, no functionality changes.From 272befa6ab78d94ac35d24203da415545b3ad2d9 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 10:06:28 +0100 Subject: [PATCH v2] Support POSITION with nondeterministic collations This allows using text position search functions with nondeterministic collations. These functions are - position, strpos - replace - split_part - string_to_array - string_to_table which all use common internal infrastructure. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. Unlike with deterministic collations, the search cannot use any byte-by-byte optimized techniques but has to go substring by substring. We also need to consider that the found match could have a different length than the needle and that there could be substrings of different length matching at a position. In most cases, we need to find the longest such substring (greedy semantics). Discussion: https://www.postgresql.org/message-id/flat/582b2613-0900-48ca-8b0d-340c06f4d...@eisentraut.org --- src/backend/utils/adt/varlena.c | 105 ++-- .../regress/expected/collate.icu.utf8.out | 154 +++--- src/test/regress/sql/collate.icu.utf8.sql | 36 +++- 3 files changed, 247 insertions(+), 48 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 533bebc1c7b..3fda7a8aeea 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -54,7 +54,9 @@ typedef struct varlena VarString; */ typedef struct { + pg_locale_t locale; /* collation used for substring matching */ boolis_multibyte_char_in_char; /* need to check char boundaries? */ + boolgreedy; /* find longest possible substring? */ char *str1; /* haystack string */ char *str2; /* needle string */ @@ -65,7 +67,13 @@ typedef struct int skiptablemask; /* mask for ANDing with skiptable subscripts */ int skiptable[256]; /* skip distance for given mismatched char */ + /* +* Note that with nondeterministic collations, the length of the last +* match is not necessarily equal to the length of the "needle" passed in. +*/ char *last_match; /* pointer to last match in 'str1' */ + int last_match_len; /* length of last match */ + int last_match_len_tmp; /* same but for internal use */ /* * Sometimes we need to convert the byte position of a match to a @@ -1178,15 +1186,21 @@ text_position(text *t1, text *t2, Oid collid) TextPositionState state; int result; + check_collation_set(collid); + /* Empty needle always matches at position 1 */ if (VARSIZE_ANY_EXHDR(t2) < 1) return 1; /* Otherwise, can't match if haystack is shorter than needle */ - if (VARSIZE_ANY_EXHDR(t1) < VARSIZE_ANY_EXHDR(t2)) + if (VARSIZE_ANY_EXHDR(t1) < VARSIZE_ANY_EXHDR(t2) && + pg_newlocale_from_collation(collid)->deterministic) return 0; text_position_setup(t1, t2, collid, &state); + /* don't need greedy mode here */ + state.greedy = false; + if (!text_position_next(&state)) result = 0; else @@ -1217,18 +1231,17 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state) { int len1 = VARSIZE_ANY_EXHDR(t1); int len2 = VARSIZE_ANY_EXHDR(t2); - pg_locale_t mylocale; check_collation_set(collid); - mylocale = pg_newlocale_from_collation(collid); + state->locale = pg_newlocale_from_collation(collid); - if (!mylocale->deterministic) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), -errmsg("nondeterministic collations are not supported for substring searches"))); + /* +* Most callers need greedy mode, but some might want to unset this to +* optimize. +*/ + state->greedy = true; - Assert(len1 > 0); Assert(len2 > 0);
Re: Make tuple deformation faster
On Sat, 30 Nov 2024 at 02:54, Victor Yegorov wrote: > I've been testing this patch for the last week, I have M3 and i7 based MBP > around. Thanks for having a look at this and running the benchmarks. > Construct >sizeof(FormData_pg_attribute) * (src)->natts > is used in 7 places (in various forms), I thought it might be good > to use a macro here, say TupleArraySize(natts). I ended up adjusting the code here so that TupleDescSize() returns the full size and TupleDescAttrAddress() manually calculates the offset to start the FormData_pg_attribute array. That allows TupleDescFullSize() to be deleted. I changed how TupleDescCopy() works as it used to perform the memcpy in 2 parts. I've changed that to now perform a single memcpy() and reset the ->attrs field after the memcpy so that it correctly points to the address for its own TupleDesc rather than the one from the source. > In v4-0002-Introduce-CompactAttribute-array-in-TupleDesc.patch > > +#define COMPACT_ATTR_IS_PACKABLE(att) \ > +> ((att)->attlen == -1 && att->attispackable) > > Seems second att needs parenthesis around it. Adjusted. Thanks. > Although I haven't seen 30% speedup, I find this change very good to have. I think there's a bit more juice to squeeze out still. I started another thread for a much different approach to increasing the tuple deform performance over in [1]. The benchmarks I showed over there show the results with all the v4 patches on this thread plus that patch, and also another set of results from just the v4 patches from here. My Apple M2 very much likes the patch from the other thread. I don't have any decent Intel hardware to test on. I've attached a v5 set of patches, which I think addresses everything you mentioned. I've also shuffled the patches around a little to how I think they should be committed. Here's a summary: v5-0001: Adds the CompactAttribute struct, includes it in TupleDesc and adds all the code to populate it. Also includes a very small number of users of CompactAttribute. v5-0002: Adjusts dozens of locations to use CompactAttribute struct instead of the Form_pg_attribute struct. Lots of churn, but not complex changes. Separated out from v5-0001 so it's easier to see the important changes 0001 is making. v5-0003: Change CompactAttribute.attalign char field to attalignby to uint8 field to optimise alignment calculations and remove branching. v5-0004: Delete the now unused pg_attribute.attcacheoff column and Form_pg_attribute field. v5-0005: This is the patch from [1] rebased atop of this patch set. I'll pick up the discussion on that thread, but offering a rebased version here in case you'd like to try that one. I spent all day today reviewing and fixing up a few missing comments for the v5 patch series. I'm quite happy with these now. If nobody else wants to look or test, I plan on pushing these tomorrow (Tuesday UTC+13). If anyone wants me to delay so they can look, they better let me know soon. David [1] https://postgr.es/m/caaphdvo9e0xg71wrefyarv5n4xnplk4k8ljd0msr3c9kr2v...@mail.gmail.com v5-0001-Introduce-CompactAttribute-array-in-TupleDesc.patch Description: Binary data v5-0002-Use-CompactAttribute-instead-of-FormData_pg_attri.patch Description: Binary data v5-0003-Optimize-alignment-calculations-in-tuple-form-def.patch Description: Binary data v5-0004-Remove-pg_attribute.attcacheoff-column.patch Description: Binary data v5-0005-Speedup-tuple-deformation-with-additional-functio.patch Description: Binary data
RE: Conflict detection for update_deleted in logical replication
On Friday, November 29, 2024 6:35 PM Kuroda, Hayato/黒田 隼人 wrote: > > Dear Hou, > > Thanks for updating the patch! Here are my comments mainly for 0001. Thanks for the comments! > > 02. maybe_advance_nonremovable_xid > > ``` > +case RCI_REQUEST_PUBLISHER_STATUS: > +request_publisher_status(data); > +break; > ``` > > I think the part is not reachable because the transit > RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATU > S is done in get_candidate_xid()->request_publisher_status(). > Can we remove this? I changed to call the maybe_advance_nonremovable_xid() after changing the phase in get_candidate_xid/wait_for_publisher_status, so that the code is reachable. > > > 05. request_publisher_status > > ``` > +if (!reply_message) > +{ > +MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext); > + > +reply_message = makeStringInfo(); > +MemoryContextSwitchTo(oldctx); > +} > +else > +resetStringInfo(reply_message); > ``` > > Same lines exist in two functions: can we provide an inline function? I personally feel these codes may not worth a separate function since it’s simple. So didn't change in this version. > > 06. wait_for_publisher_status > > ``` > +if (!FullTransactionIdIsValid(data->last_phase_at)) > +data->last_phase_at = > FullTransactionIdFromEpochAndXid(data->remote_epoch, > + > + data->remote_nextxid); > + > ``` > > Not sure, is there a possibility that data->last_phase_at is valid here? It is > initialized just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS. Oh. I think last_phase_at should be initialized only in the first phase. Fixed. Other comments look good to me and have been addressed in V13. Best Regards, Hou zj
Re: Re: Added prosupport function for estimating numeric generate_series rows
On Sat, 30 Nov 2024 at 00:38, tsinghualucky...@foxmail.com wrote: > > Dear Dean Rasheed, I have reviewed the v4 patch and it is very thoughtful and > reasonable, with a very clever attention to detail (plus I am very happy that > we can get rid of the goto, which I was not a big fan of). > > This patch looks very good and I have no complaints about it. Thanks again > for your help from beginning to end! > Cool. Patch committed. Regards, Dean
Re: Virtual generated columns
On Fri, Nov 29, 2024 at 3:16 PM Peter Eisentraut wrote: > > On 14.11.24 10:46, Amit Kapila wrote: > >> Moreover, we would have to implement some elaborate cross-checks if a > >> table gets added to a publication. How would that work? "Can't add > >> table x to publication because it contains a virtual generated column > >> with a non-simple expression"? With row filters, this is less of a > >> problem, because the row filter a property of the publication. > >> > > Because virtual generated columns work in row filters, so I thought it > > could follow the rules for column lists as well. If the virtual column > > doesn't adhere to the rules of the row filter then it shouldn't even > > work there. My response was based on the theory that the expression > > for virtual columns could be computed during logical decoding. So, > > let's first clarify that before discussing this point further. > > Row filter expressions have restrictions that virtual columns do not > have. For example, row filter expressions cannot use user-defined > functions. If you have a virtual column that uses a user-defined > function and then you create a row filter using that virtual column, you > get an error when you create the publication. (This does not work > correctly in the posted patches, but it will in v10 that I will post > shortly.) This behavior is ok, I think, you get the error when you > write the faulty expression, and it's straightforward to implement. > Fair enough but the same argument applies to the column list. I mean to say based on the same theory, users will get the ERROR when an unsupported virtual column type will be used in column the list. > Now let's say that we implement what you suggest that we compute virtual > columns during logical decoding. Then we presumably need similar > restrictions, like not allowing user-defined functions. > > Firstly, I don't know if that would be such a good restriction. For row > filters, that's maybe ok, but for virtual columns, you want to be able > to write complex and interesting expressions, otherwise you wouldn't > need a virtual column. > > And secondly, we'd then need to implement logic to check that you can't > add a table with a virtual column with a user-defined function to a > publication. This would happen not when you write the expression but > only later when you operate on the table or publication. So it's > already a dubious user experience. > > And the number of combinations and scenarios that you'd need to check > there is immense. (Not just CREATE PUBLICATION and ALTER PUBLICATION, > but also CREATE TABLE when a FOR ALL TABLES publication exists, ALTER > TABLE when new columns are added, new partitions are attached, and so > on.) Maybe someone wants to work on that, but that's more than I am > currently signed up for. And given the first point, I'm not sure if > it's even such a useful feature. > > I think, for the first iteration of this virtual generated columns > feature, the publish_generated_columns option should just not apply to > it. > Ok. But as mentioned above, we should consider it for the column list. > Whether that means renaming the option or just documenting this is > something for discussion. > We can go either way. Say, if we just document it and in the future, if we want to support it for virtual columns then we need to introduce another boolean option like publish_generated_virtual_columns. The other possibility is that we change publish_generated_columns to enum or string and allow values 's' (stored), 'v' (virtual), and 'n' (none). Now, only 's' and 'n' will be supported. In the future, if one wishes to add support for virtual columns, we have a provision to extend the existing option. -- With Regards, Amit Kapila.
Use Python "Limited API" in PL/Python
This patch changes PL/Python to use the Python "limited API". This API has stronger ABI stability guarantees.[0] This means, you can build PL/Python against any Python 3.x version and use any other Python 3.x version at run time. This is especially useful for binary packages where the operating system does not come with a fixed suitable version of Python. For example, Postgres.app (for macOS) would prefer to link against the Python version supplied by python.org (Python.app). But that has a 3.x version that changes over time. So instead they bundle a Python version inside Postgres.app. The Windows installer used to also bundle Python but as of PG17 you have to get it yourself, but you have to get a very specific version [1], which is unsatisfactory. This patch fixes that: You can use any Python version independent of what PL/Python was built against. (There is a mechanism to say "at least 3.N", but for this patch, we don't need that, we can stick with the current minimum of 3.2.) (I have only tested the macOS side of this, not the Windows side. In fact, the patch currently doesn't build on Windows on CI. I haven't figured out why.) For Linux-style packaging, I don't think this would have any benefit for users right now, since the OS comes with a Python installation and all the packages are built against that. But it could potentially be helpful for packagers. For example, on Debian, this could detach the postgresql packages from python version transitions. But AFAICT, the Python packaging layout is not prepared for that. (There are only libpython3.x.so libraries, no libpython3.so that one would have to link against.) Finally, I think this patch is part of a path toward making PL/Python thread-safe. I don't think the patch by itself changes anything, but if you read through [2], using heap types is part of the things mentioned there. [0]: https://docs.python.org/3/c-api/stable.html [1]: https://github.com/EnterpriseDB/edb-installers/blob/REL-17/server/resources/installation-notes.html#L34-L36 [2]: https://docs.python.org/3/howto/isolating-extensions.html From 76d4e4b628f9e7913b5c361dbbd4c0ea9d570146 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 08:53:42 +0100 Subject: [PATCH 1/2] Remove obsolete Python version check The checked version is already the current minimum supported version (3.2). --- src/pl/plpython/plpy_exec.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c index 0e84bb90829..00747bb811b 100644 --- a/src/pl/plpython/plpy_exec.c +++ b/src/pl/plpython/plpy_exec.c @@ -1066,13 +1066,7 @@ PLy_procedure_call(PLyProcedure *proc, const char *kargs, PyObject *vargs) PG_TRY(); { -#if PY_VERSION_HEX >= 0x0302 - rv = PyEval_EvalCode(proc->code, -proc->globals, proc->globals); -#else - rv = PyEval_EvalCode((PyCodeObject *) proc->code, -proc->globals, proc->globals); -#endif + rv = PyEval_EvalCode(proc->code, proc->globals, proc->globals); /* * Since plpy will only let you close subtransactions that you base-commit: 2f696453d2b39fea800d5f7d8e5d3e1a2266de24 -- 2.47.1 From a031a128140f3f4c61e6a5468c1a603f547a986b Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 08:53:42 +0100 Subject: [PATCH 2/2] Use Python "Limited API" in PL/Python This allows building PL/Python against any Python 3.x version and using another Python 3.x version at run time. Implementation details: - Convert static types to heap types (https://docs.python.org/3/howto/isolating-extensions.html#heap-types). - Replace PyRun_String() with component functions. - Replace PyList_SET_ITEM() with PyList_SetItem(). --- src/pl/plpython/plpy_cursorobject.c | 71 +--- src/pl/plpython/plpy_planobject.c| 61 +++-- src/pl/plpython/plpy_procedure.c | 5 +- src/pl/plpython/plpy_resultobject.c | 98 +--- src/pl/plpython/plpy_subxactobject.c | 41 +++- src/pl/plpython/plpy_typeio.c| 6 +- src/pl/plpython/plpython.h | 2 + 7 files changed, 179 insertions(+), 105 deletions(-) diff --git a/src/pl/plpython/plpy_cursorobject.c b/src/pl/plpython/plpy_cursorobject.c index 6108384c9a5..39bcae3f1d9 100644 --- a/src/pl/plpython/plpy_cursorobject.c +++ b/src/pl/plpython/plpy_cursorobject.c @@ -20,7 +20,7 @@ #include "utils/memutils.h" static PyObject *PLy_cursor_query(const char *query); -static void PLy_cursor_dealloc(PyObject *arg); +static void PLy_cursor_dealloc(PLyCursorObject *self); static PyObject *PLy_cursor_iternext(PyObject *self); static PyObject *PLy_cursor_fetch(PyObject *self, PyObject *args); static PyObject *PLy_cursor_close(PyObject *self, PyObject *unused); @@ -33,22
Re: Converting README documentation to Markdown
On 01.10.24 22:15, Daniel Gustafsson wrote: On 1 Oct 2024, at 16:53, Jelte Fennema-Nio wrote: On Tue, 1 Oct 2024 at 15:52, Daniel Gustafsson wrote: Apart from this, I don't changing the placeholders like to < foo >. In some cases, this really decreases readability. Maybe we should look for different approaches there. Agreed. I took a stab at some of them in the attached. The usage in src/test/isolation/README is seemingly the hardest to replace and I'm not sure how we should proceed there. One way to improve the isolation/README situation is by: 1. indenting the standalone lines by four spaces to make it a code block 2. for the inline cases, replace with `` or `foo` If we go for following Markdown syntax then for sure, if not it will seem a bit off I think. I took another look through this discussion. I think the v4 patches from 2024-10-01 are a good improvement. I suggest you commit them and then we can be done here.
Re: Memory leak in WAL sender with pgoutput (v10~)
On Mon, Dec 2, 2024 at 12:49 PM Michael Paquier wrote: > > On Mon, Dec 02, 2024 at 11:47:09AM +0530, Amit Kapila wrote: > > We already have PGOutputData->cachectx which could be used for it. I > > think we should be able to reset such a context when we are > > revalidating the publications. Even, if we want a new context for some > > localized handling, we should add that in PGOutputData rather than a > > local context as the proposed patch is doing at the very least for > > HEAD. > > cachectx is used for the publications and the hash table holding > all the RelationSyncEntry entries, but we lack control of individual > parts within it. So you cannot reset the whole context when > processing a publication invalication. Perhaps adding that to > PGOutputData would be better, but that would be inconsistent with > RelationSyncCache. > AFAICS, RelationSyncCache is not allocated in PGOutputData->cachectx. It is allocated in CacheMemoryContext, see caller of init_rel_sync_cache(). I think you are talking about individual hash entries. Ideally, we can free all entries together and reset cachectx but right now, we are freeing the allocated memory in those entries, if required, at the next access. So, resetting the entire PGOutputData->cachectx won't be possible. But, I don't get why adding new context in PGOutputData for publications would be inconsistent with RelationSyncCache? Anyway, I think it would be okay to retail-free in this case, see the below responses. > > Can't we consider freeing the publication names individually that can > > be backpatchable and have no or minimal risk of breaking anything? > > Sure. The first thing I did was a loop that goes through the > publication list and does individual pfree() for the publication > names. That works, but IMO that's weird as we rely on the internals > of GetPublication() hidden two levels down in pg_publication.c. > We can look at it from a different angle which is that the FreePublication(s) relies on how the knowledge of Publication structure is built. So, it doesn't look weird if we think from that angle. > >> I am slightly concerned about the current design of GetPublication() > >> in the long-term, TBH. LoadPublications() has hidden the leak behind > >> two layers of routines in the WAL sender, and that's easy to miss once > >> you call anything that loads a Publication depending on how the caller > >> caches its data. So I would still choose for modifying the structure > >> on HEAD removing the pstrdup() for the publication name. > >> > > > > BTW, the subscription structure also used the name in a similar way. > > This will make the publication/subscription names handled differently. > > Good point about the inconsistency, so the name could also be switched > to a fixed-NAMEDATALEN there if we were to do that. The subscription > has much more pstrdup() fields, though.. How about having some Free() > routines instead that deal with the whole cleanup of a single list > entry? If that's kept close to the GetPublication() and > GetSubscription() routines, a refresh when changing these structures > would be hard to miss. > We already have FreeSubscription() which free name and other things before calling list_free_deep. So, I thought a call on those lines for publications wouldn't be a bad idea. -- With Regards, Amit Kapila.
Re: Truncate logs by max_log_size
On 29.11.24 21:57, Kirill Gavrilov wrote: > Same thing applies to log_parameter_max_length, for example. > > postgres=# set log_parameter_max_length = '1foo'; > ERROR: invalid value for parameter "log_parameter_max_length": "1foo" > HINT: Valid units for this parameter are "B", "kB", "MB", "GB", and "TB". > postgres=# set log_parameter_max_length = '1TB'; > ERROR: invalid value for parameter "log_parameter_max_length": "1TB" > HINT: Value exceeds integer range. > > I think we can leave it as is. I see. So I guess it is out of scope to change this message here. Small nitpicks: 1) The indentation of the comment at postgresql.conf.sample is a little bit off #max_log_size = 0 # max size of logged statement # 0 disables the feature IMHO it looks better like this: #max_log_size = 0 # max size of logged statement # 0 disables the feature 2) You introduced a trailing whitespace at L34 (Not critical :)) + Zero disables the setting. It happens to me all the time, so I usually try to apply my patches in a clean branch just to make sure I didn't miss anything. Other than that, I have nothing more to add at this point. Thanks -- Jim
Re: More CppAsString2() in psql's describe.c
On 02.12.24 08:52, Tom Lane wrote: (Moreover, the current structure assumes that the C character literal syntax used by the PROKIND_* and other symbols happens to be the same as the SQL string literal syntax required in those queries, which is just an accident.) So? There isn't much about C syntax that isn't an accident. Neither literal syntax is going to change, so I don't see why it's problematic to rely on them being the same. For example, if you write #define RELKIND_RELATION'\x72' then it won't work anymore. I was also curious whether #define FOO 'r' #define RELKIND_RELATION FOO would work. It appears it does. But this syntactic construction is quite hard to completely understand currently.
pure parsers and reentrant scanners
This patch series changes several parsers in the backend and contrib modules to use bison pure parsers and flex reentrant scanners. This is ultimately toward thread-safety, but I think it's also just nicer in general, and it might also fix a few possible small memory leaks. I organized this patch series into very small incremental changes so that it's easier to follow. The final commits should probably combined a bit more (e.g., one per module). In this patch series I have so far dealt with * contrib/cube/ * contrib/seg/ * src/backend/replication/repl_* * src/backend/replication/syncrep_* These four needed the whole treatment: pure parser, reentrant scanner, and updated memory handling. Also: * src/backend/utils/adt/jsonpath_scan.l This one already had a pure parser and palloc-based memory handling, but not a reentrant scanner, so I just did that. The above are all pretty similar, so it was relatively easy to work through them once I had the first one figured out. A couple of things that are still missing in the above: * For repl_scanner.l, I want to use yyextra to deal with the static variables marked /*FIXME*/, but somehow I made that buggy, I'll need to take another look later. * For both the replication parser and the syncrep parser, get rid of the global variables to pass back the results. Again, here I need another look later. I confused either myself or the compiler on these. cube, seg, and jsonpath are about as done as I would want them to be. Not done yet: * src/backend/utils/misc/guc-file.l * src/pl/plpgsql/src/pl_gram.y These have quite different structures and requirements, so I plan to deal with them separately. Not relevant for backend thread-safety: * src/backend/bootstrap/ It might make sense to eventually covert that one as well, just so that the APIs are kept similar. But that could be for later. Note that the core scanner and parser are already reentrant+pure. Also, there are various other scanners and parsers in frontends (psql, pgbench, ecpg) that are not relevant for this. (Again, it might make sense to convert some of them later, and some of them are already done.) AFAICT, all the options and coding techniques used here are already in use elsewhere in the tree, so there shouldn't be any concerns about bison or flex version compatibility. From 01823a7c975fda47ce220db5bb91ecd0959d5123 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 10:35:37 +0100 Subject: [PATCH v0 01/15] cube: pure parser and reentrant scanner Use the flex %option reentrant and the bison option %pure-parser to make the generated scanner and parser pure, reentrant, and thread-safe. (There are still some issues in the surrounding integration, see FIXMEs.) --- contrib/cube/cube.c | 7 +++--- contrib/cube/cubedata.h | 15 contrib/cube/cubeparse.y | 15 +--- contrib/cube/cubescan.l | 51 +++- 4 files changed, 45 insertions(+), 43 deletions(-) diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c index 1fc447511a1..bf8fc489dca 100644 --- a/contrib/cube/cube.c +++ b/contrib/cube/cube.c @@ -120,13 +120,14 @@ cube_in(PG_FUNCTION_ARGS) char *str = PG_GETARG_CSTRING(0); NDBOX *result; Sizescanbuflen; + yyscan_tscanner; - cube_scanner_init(str, &scanbuflen); + cube_scanner_init(str, &scanbuflen, &scanner); - cube_yyparse(&result, scanbuflen, fcinfo->context); + cube_yyparse(&result, scanbuflen, fcinfo->context, scanner); /* We might as well run this even on failure. */ - cube_scanner_finish(); + cube_scanner_finish(scanner); PG_RETURN_NDBOX_P(result); } diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h index 96fa41a04e7..8bfcc6e99a2 100644 --- a/contrib/cube/cubedata.h +++ b/contrib/cube/cubedata.h @@ -59,14 +59,21 @@ typedef struct NDBOX #define CubeKNNDistanceEuclid 17 /* <-> */ #define CubeKNNDistanceChebyshev 18 /* <=> */ +/* for cubescan.l and cubeparse.y */ +/* All grammar constructs return strings */ +#define YYSTYPE char * +typedef void *yyscan_t; + /* in cubescan.l */ -extern int cube_yylex(void); +extern int cube_yylex(YYSTYPE *yylval_param, yyscan_t yyscanner); extern void cube_yyerror(NDBOX **result, Size scanbuflen, struct Node *escontext, +yyscan_t yyscanner, const char *message); -extern void cube_scanner_init(const char *str, Size *scanbuflen); -extern void cube_scanner_finish(void); +extern void cube_scanner_init(const char *str, Size *scanbuflen, yyscan_t *yyscannerp); +extern void cube_scanner_finish(yyscan_t yyscanner); /* in cubeparse.y */ extern int cube_yyparse(NDBOX **result, Size scanbuflen, -
meson missing test dependencies
I have noticed that under meson many tests don't have dependencies on the build artifacts that they are testing. As an example among many, if you make a source code change in contrib/cube/cube.c (see patch 0001 for a demo) and then run make -C contrib/cube check the test run will reflect the changed code, because the "check" targets typically depend on the "all" targets. But if you do this under meson with meson test -C build --suite setup --suite cube the code will not be rebuilt first, and the test run will not reflect the changed code. This seems straightforward to fix, see patch 0002. The meson test setup has support for this, but it seems not widely used. Patch 0003 is another example, this time for a TAP-style test. Is there any reason this was not done yet? From 079a7f902177f5e146c004fb7789c9972c86c5ae Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 10:48:37 +0100 Subject: [PATCH v0 1/3] XXX source code modification for test demonstration --- contrib/cube/cube.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c index 1fc447511a1..e3ad25a2a39 100644 --- a/contrib/cube/cube.c +++ b/contrib/cube/cube.c @@ -278,7 +278,7 @@ cube_subset(PG_FUNCTION_ARGS) if ((dx[i] <= 0) || (dx[i] > DIM(c))) ereport(ERROR, (errcode(ERRCODE_ARRAY_ELEMENT_ERROR), -errmsg("Index out of bounds"))); +errmsg("Index out of bounds XXX"))); result->x[i] = c->x[dx[i] - 1]; if (!IS_POINT(c)) result->x[i + dim] = c->x[dx[i] + DIM(c) - 1]; -- 2.47.1 From 00b367c86dd14840d8412259f098f09737e73205 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 10:48:37 +0100 Subject: [PATCH v0 2/3] Add test dependency --- contrib/cube/meson.build | 1 + 1 file changed, 1 insertion(+) diff --git a/contrib/cube/meson.build b/contrib/cube/meson.build index 21b6f9c43ad..64ba300495d 100644 --- a/contrib/cube/meson.build +++ b/contrib/cube/meson.build @@ -53,6 +53,7 @@ tests += { 'sd': meson.current_source_dir(), 'bd': meson.current_build_dir(), 'regress': { +'deps': [cube], 'sql': [ 'cube', 'cube_sci', -- 2.47.1 From 176dc1eb383305025a3de6a895825de36f197678 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 10:48:37 +0100 Subject: [PATCH v0 3/3] Add test dependency for TAP module --- src/test/modules/test_json_parser/meson.build | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/test/modules/test_json_parser/meson.build b/src/test/modules/test_json_parser/meson.build index 059a8b71bde..f5a8ea44eed 100644 --- a/src/test/modules/test_json_parser/meson.build +++ b/src/test/modules/test_json_parser/meson.build @@ -61,5 +61,8 @@ tests += { 't/003_test_semantic.pl', 't/004_test_parser_perf.pl' ], +'test_kwargs': { + 'depends': [test_json_parser_incremental, test_json_parser_perf], +}, }, } -- 2.47.1
fixing tsearch locale support
Infamously, the tsearch locale support in src/backend/tsearch/ts_locale.c still depends on libc environment variable locale settings and is not caught up with pg_locale_t, collations, ICU, and all that newer stuff. This is used in the tsearch facilities themselves, but also in other modules such as ltree, pg_trgm, and unaccent. Several of the functions are wrappers around functions, like int t_isalpha(const char *ptr) { int clen = pg_mblen(ptr); wchar_t character[WC_BUF_LEN]; pg_locale_t mylocale = 0; /* TODO */ if (clen == 1 || database_ctype_is_c) return isalpha(TOUCHAR(ptr)); char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale); return iswalpha((wint_t) character[0]); } So this has multibyte and encoding awareness, but does not observe locale provider or collation settings. As an easy start toward fixing this, I think several of these functions we don't even need. t_isdigit() and t_isspace() are just used to parse various configuration and data files, and surely we don't need support for encoding-dependent multibyte support for parsing ASCII digits and ASCII spaces. At least, I didn't find any indications in the documentation of these file formats that they are supposed to support that kind of thing. So these can be replaced by the normal isdigit() and isspace(). There is one call to t_isprint(), which is similarly used only to parse some flags in a configuration file. From the surrounding code you can deduce that it's only called on single-byte characters, so it can similarly be replaced by plain issprint(). Note, pg_trgm has some compile-time options with macros such as KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the non-default variant is not supported by any test cases. So as part of this undertaking, I'm going to remove the non-default variants if they are in the way of cleanup. From 7abc0f2333d8045004911040c856f1522d03b050 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 2 Dec 2024 11:34:17 +0100 Subject: [PATCH 1/3] Remove t_isdigit() --- contrib/ltree/ltree_io.c| 8 src/backend/tsearch/spell.c | 4 ++-- src/backend/tsearch/ts_locale.c | 15 --- src/backend/utils/adt/tsquery.c | 2 +- src/backend/utils/adt/tsvector_parser.c | 4 ++-- src/include/tsearch/ts_locale.h | 1 - 6 files changed, 9 insertions(+), 25 deletions(-) diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c index 11eefc809b2..b54a15d6c68 100644 --- a/contrib/ltree/ltree_io.c +++ b/contrib/ltree/ltree_io.c @@ -411,7 +411,7 @@ parse_lquery(const char *buf, struct Node *escontext) case LQPRS_WAITFNUM: if (t_iseq(ptr, ',')) state = LQPRS_WAITSNUM; - else if (t_isdigit(ptr)) + else if (isdigit((unsigned char) *ptr)) { int low = atoi(ptr); @@ -429,7 +429,7 @@ parse_lquery(const char *buf, struct Node *escontext) UNCHAR; break; case LQPRS_WAITSNUM: - if (t_isdigit(ptr)) + if (isdigit((unsigned char) *ptr)) { int high = atoi(ptr); @@ -460,7 +460,7 @@ parse_lquery(const char *buf, struct Node *escontext) case LQPRS_WAITCLOSE: if (t_iseq(ptr, '}')) state = LQPRS_WAITEND; - else if (!t_isdigit(ptr)) + else if (!isdigit((unsigned char) *ptr)) UNCHAR; break; case LQPRS_WAITND: @@ -471,7 +471,7 @@ parse_lquery(const char *buf, struct Node *escontext) } else if (t_iseq(ptr, ',')) state = LQPRS_WAITSNUM; - else if (!t_isdigit(ptr)) + else if (!isdigit((unsigned char) *ptr)) UNCHAR; break; case LQPRS_WAITEND: diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c index aaedb0aa852..7800f794e84 100644 --- a/src/backend/tsearch/spell.c +++ b/src/backend/tsearch/spell.c @@ -390,7 +390,7 @@ getNextFlagFromString(IspellDict *Conf, const char **sflagset, char *sflag) *sflagset = next; while (**sflagset) { -
Re: Sample rate added to pg_stat_statements
On 26.11.2024 01:15, Ilia Evdokimov wrote: On 22.11.2024 09:08, Alexander Korotkov wrote: On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier wrote: On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues. FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that. If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). -- Regards, Alexander Korotkov Supabase I really liked your idea, and I’d like to propose an enhancement that I believe improves it further. Yes, if a query’s execution time exceeds the threshold, it should always be tracked without sampling. However, for queries with execution times below the threshold, the sampling logic should prioritize shorter queries over those closer to the threshold. In my view, the ideal approach is for shorter queries to have the highest probability of being sampled, while queries closer to the threshold are less likely to be sampled. This behavior can be achieved with the following logic: pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time Here’s how it works: * As a query’s execution time approaches zero, the probability of it being sampled approaches one. * Conversely, as a query’s execution time approaches the threshold, the probability of it being sampled approaches zero. In other words, the sampling probability decreases linearly from 1 to 0 as the execution time gets closer to the threshold. I believe this approach offers an ideal user experience. I have attached a new patch implementing this logic. Please let me know if you have any feedback regarding the comments in the code, the naming of variables or documentation. I’m always open to discussion. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. I’ve been closely reviewing my last (v5-*.patch) patch on implementing time-based sampling, and I’ve realized that it might not be the best approach. Let me explain the reasons. * We can only perform sampling before the 'pgss_planner()' function. However, at that point, we don’t yet know the query's execution time since it only becomes available during 'pgss_ExecutorEnd()' or 'pgss_ProcessUtility()'; * If we wait to sample until the execution completes and we have the actual execution time, this introduces a problem. By that point, we might have already recorded the query's statistics into shared memory from the 'pgss_planner()' making it too late to decide whether to sample the query; * Delaying sampling until the execution finishes would require waiting for the execution time, which could introduce performance overhead. This defeats the purpose of sampling, which aims to reduce the cost of tracking query. I believe we should reconsider the approach and revert to sampling based on v4-*.patch. If I’ve missed anything or there’s an alternative way to implement time threshold-based sampling efficiently, I’d be grateful to hear your insights. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Proper object locking for GRANT/REVOKE
On 25.11.24 02:24, Noah Misch wrote: commit d31bbfb wrote: --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) * objectNamesToOids * * Turn a list of object names of a given type into an Oid list. - * - * XXX: This function doesn't take any sort of locks on the objects whose - * names it looks up. In the face of concurrent DDL, we might easily latch - * onto an old version of an object, causing the GRANT or REVOKE statement - * to fail. To prevent "latch onto an old version" and remove the last sentence of the comment, we'd need two more things: - Use a self-exclusive lock here, not AccessShareLock. With two sessions doing GRANT under AccessShareLock, one will "latch onto an old version". - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is the xmax stamped on the old tuple. If GRANT switched to ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT to "latch onto an old version". Ok, we should probably put that comment back in slightly altered form, like "XXX This function intentionally takes only an AccessShareLock ... $REASON. In the face of concurrent DDL, we might easily latch onto an old version of an object, causing the GRANT or REVOKE statement to fail." I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA terminate every autovacuum running in the schema and consume a shared lock table entry per table in the schema. I think the user-visible benefit of commit d31bbfb plus this additional work is just changing "ERROR: tuple concurrently updated" to blocking. That's not nothing, but I don't see it outweighing autovacuum termination and lock table consumption spikes. What other benefits and drawbacks should we weigh? I think what are describing is a reasonable tradeoff. The user experience is tolerable: "tuple concurrently updated" is a mildly useful error message, and it's probably the table owner executing both commands. The change to AccessShareLock at least prevents confusing "cache lookup failed" messages, and might alleviate some security concerns about swapping in a completely different object concurrently (even if we currently think this is not an actual problem). --- a/src/test/isolation/expected/intra-grant-inplace.out +++ b/src/test/isolation/expected/intra-grant-inplace.out @@ -248,6 +248,6 @@ relhasindex --- (0 rows) -s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not exist The affected permutation existed to cover the first LockRelease() in SearchSysCacheLocked1(). Since this commit, that line no longer has coverage. Do you have an idea how such a test case could be constructed now?
Re: Interrupts vs signals
On 02/12/2024 09:32, Thomas Munro wrote: On Sat, Nov 23, 2024 at 10:58 AM Heikki Linnakangas wrote: Hmm, so this would replace the maybeSleepingOnInterrupts bitmask I envisioned. Makes a lot of sense. If it's a single bit though, that means that you'll still get woken up by interrupts that you're not waiting for. Maybe that's fine. Or we could merge the maybeSleepingOnInterrupts and pendingInterrupts bitmasks to a single atomic word, so that you would have a separate "maybe sleeping" bit for each interrupt bit, but could still use atomic_fetch_or atomically read the interrupt bits and announce the sleeping. I think one bit is fine for now. At least, until we have a serious problem with interrupts arriving when you're sleeping but not ready to service that particular interrupt. The 'interrupt bit already set, don't try to wake me' stuff discussed earlier would limit the number of useless wakeups to one, until you eventually are ready and consume the interrupt. The main case I can think of, if we fast forward to the all-procsignals-become-interrupts patch (which I'll be rebasing on top of this when the next version appears), is that you might receive a sinval catchup request, but you might be busy running a long query. Sinval catchup messages are only processed between queries, so you just keep ignoring them until end of query. I think that's fine, and unlikely. Do you have other cases in mind? Yeah, no, I think one bit is is good enough. Let's go with that. If there is legitimate use case for a more fine-grained maybe-sleeping and I've been too optimistic above, I don't think we should give one whole maybe-sleeping bit to each interrupt reason. We only have 32 bit atomics (splinlock-based emulation of 64 bit atomics is not good enough for this, it's not safe in SIGALRM handlers, at least not without a lot more pain; admittedly the SIGALRM handlers should eventually be replaced but not for a while) so if we used up two bits for every interrupt reason we could handle only 16 interrupt reasons, and that's already not enough. Perhaps we could add maybe-sleeping bits for classes of interrupt if we ever determine that one bit for all of them isn't enough? If we run out of bits in a single pendingInterrupt words, we can have multiple words. SendInterrupt and ClearInterrupt would still only need to manipulate one word, the one holding the bit it's setting/clearing. WaitEventSetWait() would need to touch all of them, or at least all the ones that hold bits you want to wait for. That seems OK from a performance point of view. I don't think we need to go there any time soon though, 32 bits should be enough for the use cases we've been discussing. -- Heikki Linnakangas Neon (https://neon.tech)
Re: meson missing test dependencies
Hi, On Mon, 2 Dec 2024 at 13:11, Peter Eisentraut wrote: > > I have noticed that under meson many tests don't have dependencies on > the build artifacts that they are testing. As an example among many, if > you make a source code change in contrib/cube/cube.c (see patch 0001 for > a demo) and then run > > make -C contrib/cube check > > the test run will reflect the changed code, because the "check" targets > typically depend on the "all" targets. But if you do this under meson with > > meson test -C build --suite setup --suite cube > > the code will not be rebuilt first, and the test run will not reflect > the changed code. > > This seems straightforward to fix, see patch 0002. The meson test setup > has support for this, but it seems not widely used. > > Patch 0003 is another example, this time for a TAP-style test. > > Is there any reason this was not done yet? This looks useful. I am not sure why this was not done before. I applied your patches and the cube example worked. But when I edited 'test_json_parser_perf.c' and 'test_json_parser_incremental.c', meson did not rebuild. I used the 'meson test -C build --suite setup --suite test_json_parser' command to test test_json_parser. Did I do something wrong? -- Regards, Nazir Bilal Yavuz Microsoft
Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE
> v4 patch attached Thank you Guillaume, and nice work! I tried to see if there was anywhere else in the documentation that would need updating, but it looks like you covered everywhere already. > I'm with Robert in that I've not found the buffer counts to be all that useful most of the time. I find the buffer counts especially helpful for educating newer folks on why things are slow, even when they are not necessary for spotting the issue (for more advanced users). One of my hopes is that by educating and empowering newer users on how I/O relates to performance issues, fewer cases will get escalated to more experienced folks. > the cases I've seen most recently are those where the output is mind-numbingly long already. Are you mostly seeing query plans that have stumped other people already (eg second or third line support), so perhaps seeing more complex plans than the average user? Both Depesz[1] and Tensor[2] have archives of publicly submitted plans, which I found helpful for checking how slow plans look for users of those tools. I have a similar archive, and while we do not publish them (and there are plenty of huge plans) it also suggests that the majority of slow plans people are reviewing have fewer than 20 nodes. I realise it’s optimistic to think that the time experienced hackers would lose having to sift through longer plans would be gained back by having to do so less often, but I thought it was worth raising as part of the aim. I also looked into the Slow Query Questions page on the wiki that we ask people to review before posting to pgsql-performance, and noticed that has suggested requesting buffers for the past 12 years[3]. — Michael [1]: https://explain.depesz.com/history [2]: https://explain.tensor.ru/archive [3]: https://wiki.postgresql.org/index.php?title=Slow_Query_Questions&diff=18308&oldid=16800
Re: Make tuple deformation faster
пн, 2 дек. 2024 г. в 13:24, David Rowley : > I ended up adjusting the code here so that TupleDescSize() returns the > full size and TupleDescAttrAddress() manually calculates the offset to > start the FormData_pg_attribute array. That allows > TupleDescFullSize() to be deleted. I changed how TupleDescCopy() > works as it used to perform the memcpy in 2 parts. I've changed that > to now perform a single memcpy() and reset the ->attrs field after the > memcpy so that it correctly points to the address for its own > TupleDesc rather than the one from the source. > > Nice! > I've attached a v5 set of patches, which I think addresses everything > you mentioned. I've also shuffled the patches around a little to how > I think they should be committed. > I'm glad that the patch from “More tuple deformation speedups” is moved here, I wanted to mention that both patches should be committed together. All is good, and the benefits are clearly visible (same setup used). -- Victor Yegorov
code contributions for 2024, WIP version
Hi, As many of you are probably aware, I have been doing an annual blog post on who contributes to PostgreSQL development for some years now. It includes information on lines of code committed to PostgreSQL, and also emails sent to the list. This year, I got a jump on analyzing the commit log, and a draft of the data covering January-November of 2024 has been uploaded in pg_dump format to here: https://sites.google.com/site/robertmhaas/contributions I'm sending this message to invite anyone who is interested to review the data in the commits2024 table and send me corrections. For example, it's possible that there are cases where I've failed to pick out the correct primary author for a commit; or where somebody's name is spelled in two different ways; or where somebody's name is not spelled the way that they prefer. You'll notice that the table has columns "lines" and "xlines". I have set xlines=0 in cases where (a) I considered the commit to be a large, mechanical commit such as a pgindent run or translation updates; or (b) the commit was reverting some other commit that occurred earlier in 2024; or (c) the commit was subsequently reverted. When I run the final statistics, those commits will still count for the statistics that count the number of commits, but the lines they inserted will not be counted as lines of code contributed in 2024. Also for clarity, please be aware that the "ncauthor" column is not used in the final reporting; that is just there so that I can set author=coalesce(ncauthor,committer) at a certain phase of the data preparation. Corrections should be made to the author column, not ncauthor. If you would like to correct the data, please send me your corrections off-list, as a reply to this email, ideally in the form of one or more UPDATE statements. If you would like to complain about the methodology, I can't stop you, but please bear in mind that (1) this is already a lot of work and (2) I've always been upfront in my blog post about what the limitations of the methodology are and I do my best not to suggest that this method is somehow perfect or unimpeachable and (3) you're welcome to publish your own blog post where you compute things differently. I'm open to reasonable suggestions for improvement, but if your overall view is that this sucks or that I suck for doing it, I'm sorry that you feel that way but giving me that feedback probably will not induce me to do anything differently. Donning my asbestos underwear, I remain yours faithfully, -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove useless casts to (void *)
Peter Eisentraut writes: > Committed, thanks. Now that we have a more-or-less full set of buildfarm results on this, I checked for new warnings, and found two: pg_shmem.c: In function 'PGSharedMemoryIsInUse': pg_shmem.c:323:33: warning: passing argument 1 of 'shmdt' from incompatible pointer type [-Wincompatible-pointer-types] 323 | if (memAddress && shmdt(memAddress) < 0) | ^~ | | | PGShmemHeader * In file included from pg_shmem.c:27: /usr/include/sys/shm.h:131:11: note: expected 'char *' but argument is of type 'PGShmemHeader *' 131 | int shmdt(char *); | ^~ pg_shmem.c: In function 'PGSharedMemoryCreate': pg_shmem.c:838:37: warning: passing argument 1 of 'shmdt' from incompatible pointer type [-Wincompatible-pointer-types] 838 | if (oldhdr && shmdt(oldhdr) < 0) | ^~ | | | PGShmemHeader * /usr/include/sys/shm.h:131:11: note: expected 'char *' but argument is of type 'PGShmemHeader *' 131 | int shmdt(char *); | ^~ This is from hake[1], which is running OpenIndiana/illumos. That platform shows a couple of other strange warnings, so maybe re-eliminating these two is not worth worrying about, but nonetheless the casts to void * were doing something here. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hake&dt=2024-12-02%2017%3A19%3A40&stg=make
Re: Add “FOR UPDATE NOWAIT” lock details to the log.
On 2024/10/18 19:07, Seino Yuki wrote: The choice between adding a new GUC or extending the existing one (e.g., log_lock_waits) is debatable, but I prefer the latter. I'm considering extending log_lock_waits to accept a value like "fail". If set to "on" (the current behavior), detailed logs are generated when the lock wait time exceeds deadlock_timeout. If set to "fail", logs are generated whenever a lock wait fails. If both are specified, logs would be triggered when the wait time exceeds deadlock_timeout or when a lock wait fails. Thanks for the idea. Changed log_lock_waits to an enum type and added fail and all. "off" : No log message(default). "on" : If over deadlock_timeout(the current behavior). "fail" : If lock failed. "all" : All pettern. I'm still thinking about how we should handle logging for lock failures caused by the nowait option. Extending the existing log_lock_waits parameter has the advantage of avoiding a new GUC, but it might make configuration more complicated. I'm starting to think adding a new GUC might be a better option. Regarding the patch, when I applied it to HEAD, it failed to compile with the following errors. Could you update the patch to address this? proc.c:1538:20: error: use of undeclared identifier 'buf' 1538 | initStringInfo(&buf); | ^ proc.c:1539:20: error: use of undeclared identifier 'lock_waiters_sbuf' 1539 | initStringInfo(&lock_waiters_sbuf); | ^ proc.c:1540:20: error: use of undeclared identifier 'lock_holders_sbuf' 1540 | initStringInfo(&lock_holders_sbuf); | ^ Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Incorrect result of bitmap heap scan.
Hi, On 2024-12-02 11:43:42 -0500, Tom Lane wrote: > Peter Geoghegan writes: > > This theory seems very believable. > > I'm not convinced. I think there are two assumptions underlying > bitmap scan: > > 1. Regardless of index contents, it's not okay for vacuum to remove > tuples that an open transaction could potentially see. So the heap > tuple will be there if we look, unless it was already dead (in which > case it could have been replaced, so we have to check visibility of > whatever we find). I think the problematic scenario involves tuples that *nobody* can see. During the bitmap index scan we don't know that though. Thus the tid gets inserted into the bitmap. Then, before we visit the heap, a concurrent vacuum removes the tuple from the indexes and then the heap and marks the page as all-visible, as the deleted row version has been removed. Then, during the bitmap heap scan, we do a VM lookup and see the page being all-visible and thus assume there's a visible tuple pointed to by the tid. No snapshot semantics protect against this, as the tuple is *not* visible to anyone. Greetings, Andres Freund
Re: Incorrect result of bitmap heap scan.
Peter Geoghegan writes: > On Mon, Dec 2, 2024 at 12:02 PM Tom Lane wrote: >> Yup. I am saying that that qualifies as too-aggressive setting of the >> all-visible bit. I'm not sure what rule we should adopt instead of >> the current one, but I'd much rather slow down page freezing than >> institute new page locking rules. > Freezing a page, and setting a page all-visible are orthogonal. Sorry, sloppy wording on my part. regards, tom lane
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 12:02 PM Tom Lane wrote: > Andres Freund writes: > > I think the problematic scenario involves tuples that *nobody* can see. > > During > > the bitmap index scan we don't know that though. Thus the tid gets inserted > > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > > the tuple from the indexes and then the heap and marks the page as > > all-visible, as the deleted row version has been removed. > > Yup. I am saying that that qualifies as too-aggressive setting of the > all-visible bit. I'm not sure what rule we should adopt instead of > the current one, but I'd much rather slow down page freezing than > institute new page locking rules. Freezing a page, and setting a page all-visible are orthogonal. Nothing has changed about when or how we set pages all-visible in a long time -- VACUUM has always done that to the maximum extent that its OldestXmin cutoff will allow. (The changes to freezing made freezing work a little bit more like that, the general idea being to batch-up work.) -- Peter Geoghegan
Re: Showing applied extended statistics in explain Part 2
On 19.11.2024 00:38, Tomas Vondra wrote: On 11/18/24 22:15, Tomas Vondra wrote: ... So I think the correct solution is to not pass any expressions with RestrictInfo to deparse_expression(). Either by stripping the nodes, or by not adding them at all. The patch tries to do the stripping by maybe_extract_actual_clauses(), but that only looks at the top node, and that is not sufficient here. Maybe it would be possible to walk the whole tree, and remove all the RestrictInfos nodes - including intermediate ones, not just the top. But I'm not quite sure it wouldn't cause issues elsewhere (assuming it modifies the existing nodes). It still feels a bit like fixing a problem we shouldn't really have ... To make this idea a bit more concrete, here's a patch removing all RestrictInfo nodes in show_stat_qual(). But still, it feels wrong. regards Yes, removing all 'RestrictInfos' during deparsing using 'expression_tree_mutator()' is not optimal. However, I don't see an alternative. Perhaps it could be done this earlier in extended_stats.c to avoid the need for cleanup later ... -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
[18] Unintentional behavior change in commit e9931bfb75
Commit e9931bfb75 (version 18) contained an unexpected behavior change. LOWER('I') returns: In the locale tr_TR.iso88599 (single byte encoding): 17 18 defaulti ı specified ı ı In the locale tr_TR.utf8: 17 18 defaultı ı specified ı ı (Look carefully to see the dotted vs dotless "i".) The behavior is commented (commit 176d5bae1d) in formatting.c: * ... When using the default * collation, we apply the traditional Postgres behavior that * forces ASCII-style treatment of I/i, but in non-default * collations you get exactly what the collation says. */ for (p = result; *p; p++) { if (mylocale) *p = tolower_l((unsigned char) *p, mylocale->info.lt); else *p = pg_tolower((unsigned char) *p); } That's a somewhat strange special case (along with similar ones for INITCAP() and UPPER()) that applies to single-byte encodings with the libc provider and the database default collation only. I assume it was done for backwards compatibility? My commit e9931bfb75 (version 18) unifies the code paths for default and explicit collations, and in the process it eliminates the special case, and always uses tolower_l() for single-byte libc (whether default collation or not). Should I put the special case back? If not, it could break expression indexes on LOWER()/UPPER() after an upgrade for users with the database default collation of tr_TR who use libc and a single-byte encoding. But preserving the special case seems weirdly inconsistent -- surely the results should not depend on the encoding, right? Regards, Jeff Davis
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 3:55 PM Andres Freund wrote: > I suspect one contributor to this avoiding attention till now was that the > optimization is fairly narrow: > > /* > * We can potentially skip fetching heap pages if we > do not need > * any columns of the table, either for checking > non-indexable > * quals or for returning data. This test is a bit > simplistic, as > * it checks the stronger condition that there's no > qual or return > * tlist at all. But in most cases it's probably not > worth working > * harder than that. > */ > need_tuples = (node->ss.ps.plan->qual != NIL || > > node->ss.ps.plan->targetlist != NIL); > > Even an entry in the targetlist that that does not depend on the current row > disables the optimization. Good point. I agree that that factor is likely to have masked the problem over the past 6 years. -- Peter Geoghegan
Re: Count and log pages set all-frozen by vacuum
On Fri, Nov 29, 2024 at 2:46 PM Masahiko Sawada wrote: > Finally, in case where we have: > > visibility map: 1 pages set all-visible, 1 pages set all-frozen. > > We can understand that 1 pages newly became all-frozen, but have > no idea how many pages became all-visible but not all-frozen. It could > be even 0. Users might want to know it to understand how (auto)vacuum > and freezing are working well. I agree that this is possible, but I am not clear why the user should care. In scenario #1, the same pages were set all-visible and also all-frozen. In scenario #2, one set of pages were set all-visible and a different set of pages were set all-frozen. Scenario #2 is a little worse, for a couple of reasons. First, in scenario #2, more pages were dirtied to achieve the same result. However, if the user is concerned about the amount of I/O that vacuum is doing, they will more likely look at the "buffer usage" and "WAL usage" lines in the VACUUM verbose output rather than at the "visibility map" line. Second, in scenario #2, we have deferred some work to the future and there is a risk that the pages will remain all-visible but not all-frozen until the next aggressive vacuum. I think it is possible someone could want to see more detailed information for this reason. However, I expect that it will be unimportant in a practical sense of Melanie's work in this area gets committed, because her goal is to set things up so that we'll revisit all-visible but not all-frozen pages sooner, so that the work doesn't build up so much prior to the next aggressive vacuum. And I think that work will have its own logging that will make it clear what it is doing, hence I don't foresee the need for more detail here. All that said, if you really want this broken out into three categories rather than two, I'm not overwhelmingly opposed to that. It is always possible that more detail could be useful; it is just difficult to decide where the likelihood is high enough to justify adding more output. -- Robert Haas EDB: http://www.enterprisedb.com
Incorrect result of bitmap heap scan.
Hi hackers, Attached script reproduces the problem with incorrect results of `select count(*)` (it returns larger number of records than really available in the table). It is not always reproduced, so you may need to repeat it multiple times - at my system it failed 3 times from 10. The problem takes place with pg16/17/18 (other versions I have not checked). The test is called `test_ios` (index-only-scan), but it is not correct. Index-only scan is not used in this case. And this is actually the first question to PG17/18: IOS is not used when number of records is less than 100k (for this particular table): postgres=# create table t(pk integer primary key); CREATE TABLE postgres=# set enable_seqscan = off; SET postgres=# set enable_indexscan = off; SET postgres=# insert into t values (generate_series(1,1000)); INSERT 0 1000 postgres=# vacuum t; VACUUM postgres=# explain select count(*) from t; QUERY PLAN --- Aggregate (cost=43.02..43.03 rows=1 width=8) -> Bitmap Heap Scan on t (cost=25.52..40.52 rows=1000 width=0) -> Bitmap Index Scan on t_pkey (cost=0.00..25.27 rows=1000 width=0) (3 rows) postgres=# set enable_bitmapscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN Aggregate (cost=17.50..17.51 rows=1 width=8) -> Seq Scan on t (cost=0.00..15.00 rows=1000 width=0) Disabled: true (3 rows) So, as you can see, Postgres prefers to use disabled seqscan, but not IOS. It is different from pg16 where disabling bitmap scan makes optimizer to choose index-only scan: postgres=# explain select count(*) from t; QUERY PLAN --- Aggregate (cost=41.88..41.88 rows=1 width=8) -> Seq Scan on t (cost=0.00..35.50 rows=2550 width=0) (2 rows) postgres=# set enable_seqscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN --- Aggregate (cost=75.54..75.55 rows=1 width=8) -> Bitmap Heap Scan on t (cost=33.67..69.17 rows=2550 width=0) -> Bitmap Index Scan on t_pkey (cost=0.00..33.03 rows=2550 width=0) (3 rows) postgres=# set enable_bitmapscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN --- Aggregate (cost=45.77..45.78 rows=1 width=8) -> Index Only Scan using t_pkey on t (cost=0.28..43.27 rows=1000 width=0) (2 rows) This is strange behavior of pg17 which for some reasons rejects IOS (but it is used if number of records in the table is 100k or more). But the main problem is that used plan Bitmap Heap Scan + Bitmap Index Scan may return incorrect result. Replacing `select count(*)` with `select count(pk)` eliminates the problem, as well as disabling of autovacuum. It seems to be clear that the problem is with visibility map. We have the following code in heap bitmap scan: /* * We can skip fetching the heap page if we don't need any fields from the * heap, the bitmap entries don't need rechecking, and all tuples on the * page are visible to our transaction. */ if (!(scan->rs_flags & SO_NEED_TUPLES) && !tbmres->recheck && VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer)) { /* can't be lossy in the skip_fetch case */ Assert(tbmres->ntuples >= 0); Assert(hscan->rs_empty_tuples_pending >= 0); hscan->rs_empty_tuples_pending += tbmres->ntuples; return true; } So if we do not need tuples (|count(*)|case) and page is marked as all-visible in VM, then we just count|tbmres->ntuples|elements without extra checks. I almost not so familiar with internals of executor, but it is not clear to me how we avoid race condition between VM update and heap bitmap scan? Assume that bitmap scan index marks all tids available in index. Some elements in this bitmap can refer old (invisible) versions. Then vacuum comes, removes dead elements and mark page as all-visible. After it we start heap bitmap scan, see that page is all-visible and count all marked elements on this page including dead (which are not present in the page any more). Which lock or check should prevent such scenario? import random import threading import time import psycopg2 def test_ios(): con = psycopg2.connect(database="postgres") n_records = 1000 n_oltp_writers = 10 n_oltp_readers = 5 n_olap_readers = 1 con.autocommit = True cur = con.cursor() cur.execute("DROP table if exists t") cur.execute("CREATE TABLE t(pk bigint not null)") cur.execute(f"insert into t values (generate_series(1,{n_records}))") cur.execute("create index on t(pk)") cur.execute("vacuum t") running = True def oltp_writer(): con = psycopg2.connect(database="postgres") con.autocommit = True cur = con.cursor() while running: pk = random.randrang
Re: Incorrect result of bitmap heap scan.
Andres Freund writes: > I think the problematic scenario involves tuples that *nobody* can see. During > the bitmap index scan we don't know that though. Thus the tid gets inserted > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > the tuple from the indexes and then the heap and marks the page as > all-visible, as the deleted row version has been removed. Yup. I am saying that that qualifies as too-aggressive setting of the all-visible bit. I'm not sure what rule we should adopt instead of the current one, but I'd much rather slow down page freezing than institute new page locking rules. regards, tom lane
Re: Memory leak in WAL sender with pgoutput (v10~)
On 2024-Dec-02, Michael Paquier wrote: > I am slightly concerned about the current design of GetPublication() > in the long-term, TBH. LoadPublications() has hidden the leak behind > two layers of routines in the WAL sender, and that's easy to miss once > you call anything that loads a Publication depending on how the caller > caches its data. So I would still choose for modifying the structure > on HEAD removing the pstrdup() for the publication name. Anyway, your > suggestion is also OK by me on top of that, that's less conflicts in > all the branches. TBH I'm not sure that wastefully allocating NAMEDATALEN for each relation is so great. Our strategy for easing memory management is to use appropriately timed contexts. I guess if you wanted to make a publication a single palloc block (so that it's easy to free) and not waste so much memory, you could stash the name string at the end of the struct. I think that'd be a change wholly contained in GetPublication. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Are you not unsure you want to delete Firefox? [Not unsure] [Not not unsure][Cancel] http://smylers.hates-software.com/2008/01/03/566e45b2.html
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 2:43 PM Andres Freund wrote: > Attached is an isolationtest that reliably shows wrong query results. Nice approach with the cursor! I took what you wrote, and repurposed it to prove my old theory about GiST index-only scans being broken due to the lack of an appropriate interlock against concurrent TID recycling. See the attached patch. -- Peter Geoghegan v1-0001-isolationtester-showing-broken-index-only-scans-w.patch Description: Binary data
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan wrote: > I took what you wrote, and repurposed it to prove my old theory about > GiST index-only scans being broken due to the lack of an appropriate > interlock against concurrent TID recycling. See the attached patch. BTW, if you change the test case to use the default B-Tree index AM (by removing "USING GIST"), you'll see that VACUUM blocks on acquiring a cleanup lock (and so the test just times out). The problem is that GiST VACUUM just doesn't care about cleanup locks/TID recycling safety -- though clearly it should. -- Peter Geoghegan
Re: Vacuum statistics
Hi! Thank you for your review! On 30.11.2024 07:48, Kirill Reshke wrote: Hello! After a brief glance, I think this patch set is good. But there isn't any more time in the current CF to commit this :(. So I moved to the next CF. I agree with you. Thank you!) I also like the 0001 commit message. This commit message is quite large and easy to understand. Actually, it might be too big. Perhaps rather of being a commit message, the final paragraph (pages_frozen - number of pages that..) need to be a part of the document. Perhaps delete the explanation on pages_frozen that we have in 0004? To be honest, I don't quite understand what you're suggesting. Are you suggesting moving the explanation about the pages_frozen from the commit message to the documentation or fixing something in the documentation about the pages_frozen? Can you please explain? -- Regards, Alena Rybakina Postgres Professional
Re: Vacuum statistics
On 02.12.2024 17:46, Ilia Evdokimov wrote: In my opinion, the patches are semantically correct. However, not all dead code has been removed - I'm referring to pgstat_update_snapshot(). Also, the tests need to be fixed. Thank you, I'll fix it -- Regards, Alena Rybakina Postgres Professional
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 11:32 AM Andres Freund wrote: > On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote: > > Concurrency timeline: > > > > Session 1. amgetbitmap() gets snapshot of index contents, containing > > references to dead tuples on heap P1. > > IIUC, an unstanted important aspect here is that these dead tuples are *not* > visible to S1. Otherwise the VACUUM in the next step could not remove the dead > tuples. I would state the same thing slightly differently, FWIW: I would say that the assumption that has been violated is that a TID is a stable identifier for a given index tuple/logical row/row version (stable for the duration of the scan). This emphasis/definition seems slightly more useful, because it makes it clear why this is hard to hit: you have to be fairly unlucky for a dead-to-everyone TID to be returned by your scan (no LP_DEAD bit can be set) and set in the scan's bitmap, only to later be concurrently set LP_UNUSED in the heap -- all without VACUUM randomly being prevented from setting the same page all-visible due to some unrelated not-all-visible heap item making it unsafe. > Ugh :/ > > Pretty depressing that we've only found this now, ~6 years after it's been > released. We're lacking some tooling to find this stuff in a more automated > fashion. FWIW I have suspicions about similar problems with index-only scans that run in hot standby, and about all GiST index-only scans: https://postgr.es/m/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5q...@mail.gmail.com In general there seems to be a lack of rigor in this area. I'm hoping that Tomas Vondra's work can tighten things up here in passing. > It seems pretty much impossible to fix that with the current interaction > between nodeBitmap* and indexam. I wonder if we could, at least in some cases, > and likely with some heuristics / limits, address this by performing some > visibility checks during the bitmap build. I'm bringing it up here because I > suspect that some of the necessary changes would overlap with what's needed to > repair bitmap index-only scans. This seems like it plays into some of the stuff I've discussed with Tomas, about caching visibility information in local state, as a means to avoiding holding onto pins in the index AM. For things like mark/restore support. > > This is quite non-trivial, however, as we don't have much in place regarding > > per-relation shared structures which we could put such a value into, nor a > > good place to signal changes of the value to bitmap heap-scanning backends. > > It doesn't have to be driven of table state, it could be state in > indexes. Most (all?) already have a metapage. FWIW GiST doesn't have a metapage (a historical oversight). > Unfortunately I don't see a better path either. > > I think it'd be good if we added a test that shows the failure mechanism so > that we don't re-introduce this in the future. Evidently this failure isn't > immediately obvious... Maybe a good use for injection points? -- Peter Geoghegan
Re: Incorrect result of bitmap heap scan.
Hi, On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote: > I guess it's natural to suspect more recent work -- commit 7c70996e is > about 6 years old. But I the race condition that I suspect is at play > here is very narrow. FWIW, the test I just posted shows the issue down to 11 (although for 11 one has to remove the (TRUNCATE false). 10 returns correct results. I don't think the race is particularly narrow. Having a vacuum complete between the start of the bitmap indexscan and the end of the bitmap heapscan, leaves a lot of time with an expensive query. I suspect one contributor to this avoiding attention till now was that the optimization is fairly narrow: /* * We can potentially skip fetching heap pages if we do not need * any columns of the table, either for checking non-indexable * quals or for returning data. This test is a bit simplistic, as * it checks the stronger condition that there's no qual or return * tlist at all. But in most cases it's probably not worth working * harder than that. */ need_tuples = (node->ss.ps.plan->qual != NIL || node->ss.ps.plan->targetlist != NIL); Even an entry in the targetlist that that does not depend on the current row disables the optimization. Due to not being able to return any content for those rows, it's also somewhat hard to actually notice that the results are wrong... > It's pretty unlikely that there'll be a dead-to-all TID returned to a > scan (not just dead to our MVCC snapshot, dead to everybody's) that is > subsequently concurrently removed from the index, and then set > LP_UNUSED in the heap. It's probably impossible if you don't have a > small table -- VACUUM just isn't going to be fast enough to get to the > leaf page after the bitmap index scan, but still be able to get to the > heap before its corresponding bitmap heap scan (that uses the VM as an > optimization) can do the relevant visibility checks (while it could > happen with a large table and a slow bitmap scan, the chances of the > VACUUM being precisely aligned with the bitmap scan, in just the wrong > way, seem remote in the extreme). A cursor, waiting for IO, waiting for other parts of the query, ... can all make this windows almost arbitrarily large. Greetings, Andres Freund
Re: Memory leak in WAL sender with pgoutput (v10~)
On 2024-Dec-02, Amit Kapila wrote: > We already have PGOutputData->cachectx which could be used for it. > I think we should be able to reset such a context when we are > revalidating the publications. I don't see that context being reset anywhere, so I have a hard time imagining that this would work without subtle risk of breakage elsewhere. > Even, if we want a new context for some localized handling, we should > add that in PGOutputData rather than a local context as the proposed > patch is doing at the very least for HEAD. I don't necessarily agree, given that this context is not needed anywhere else. > Can't we consider freeing the publication names individually that can > be backpatchable and have no or minimal risk of breaking anything? That was my first thought, but then it occurred to me that such a thing would be totally pointless. > > you call anything that loads a Publication depending on how the caller > > caches its data. So I would still choose for modifying the structure > > on HEAD removing the pstrdup() for the publication name. > > BTW, the subscription structure also used the name in a similar way. > This will make the publication/subscription names handled differently. True (with conninfo, slotname, synccommit, and origin). FWIW it seems FreeSubscription is incomplete, and not only because it fails to free the publication names ... (Why are we storing a string in Subscription->synccommit?) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Los trabajadores menos efectivos son sistematicamente llevados al lugar donde pueden hacer el menor daño posible: gerencia." (El principio Dilbert)
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 12:15 PM Peter Geoghegan wrote: > The underlying reason why nbtree can discriminate like this is that it > "knows" that plain index scans will always visit the heap proper. If a > TID points to an LP_UNUSED item, then it is considered dead to the > scan (even though in general the heap page itself might be marked > all-visible). If some completely unrelated, newly inserted heap tuple > is found instead, then it cannot be visible to the plain index scan's > MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get > dropped like this). If I add "SET enable_indexonlyscan = false" to the "setup" section of the v1-0001-isolationtester-showing-broken-index-only-scans-w.patch isolationtester test case I posted earlier today, the test will pass. There is no reason to think that plain GiST index scans are broken. The fact that GiST VACUUM won't acquire cleanup locks is a problem *only* because GiST supports index-only scans (AFAIK no other thing within GiST is broken). My point is that index-only scans are generally distinct from plain index scans from an interlocking point of view -- they're not equivalent (not quite). And yet the "62.4. Index Locking Considerations" docs nevertheless say nothing about index-only scans in particular (only about bitmap scans). The docs do say that an MVCC snapshot is protective, though -- which makes it sound like GiST can't be doing anything wrong here (GiST only uses MVCC snapshots). Actually, I don't think that GiST would be broken at all if we'd simply never added support for index-only scans to GiST. I'm fairly sure that index-only scans didn't exist when the bulk of this part of the docs was originally written. ISTM that we ought to do something about these obsolete docs, after we've fixed the bugs themselves. -- Peter Geoghegan
Re: Remove useless casts to (void *)
Thomas Munro writes: > I wouldn't change that. illumos is selecting the old pre-standard > declaration here, but it knows the standard one: > https://github.com/illumos/illumos-gate/blob/27ecbff00d8c86a2647d6fe325cacb220d712115/usr/src/uts/common/sys/shm.h#L129 > https://illumos.org/man/2/shmdt Oh! Kind of looks like we should be defining _POSIX_C_SOURCE=200112L, at least on that platform. > I don't know why we have only one tiny issue if the system headers > think we want pre-POSIX stuff. Agreed, I'd have expected more trouble than this. But persuading the system headers that we want a POSIX version from this century seems like it might be a good idea to forestall future issues. I'm inclined to propose adding something like CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L" to src/template/solaris. Not sure if we have a corresponding mechanism for meson, though. regards, tom lane
Re: Remove useless casts to (void *)
Hi, On 2024-12-02 17:11:30 -0500, Tom Lane wrote: > I'm inclined to propose adding something like > > CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L" > > to src/template/solaris. Not sure if we have a corresponding > mechanism for meson, though. elif host_system == 'sunos' portname = 'solaris' export_fmt = '-Wl,-M@0@' cppflags += '-D_POSIX_PTHREAD_SEMANTICS' Should be trivial to add there. Greetings, Andres Freund
Re: optimize file transfer in pg_upgrade
Here is a rebased patch set for cfbot. I'm planning to spend some time getting these patches into a more reviewable state in the near future. -- nathan >From 81fe66e0f0aa4f958a8707df669f60756c89bb85 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 5 Nov 2024 15:59:51 -0600 Subject: [PATCH v2 1/8] Export walkdir(). THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW. A follow-up commit will use this function to swap catalog files between database directories during pg_upgrade. --- src/common/file_utils.c | 5 + src/include/common/file_utils.h | 3 +++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 398fe1c334..3f488bf5ec 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -48,9 +48,6 @@ #ifdef PG_FLUSH_DATA_WORKS static int pre_sync_fname(const char *fname, bool isdir); #endif -static void walkdir(const char *path, - int (*action) (const char *fname, bool isdir), - bool process_symlinks); #ifdef HAVE_SYNCFS @@ -268,7 +265,7 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method) * * See also walkdir in fd.c, which is a backend version of this logic. */ -static void +void walkdir(const char *path, int (*action) (const char *fname, bool isdir), bool process_symlinks) diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h index e4339fb7b6..5a9519acfe 100644 --- a/src/include/common/file_utils.h +++ b/src/include/common/file_utils.h @@ -39,6 +39,9 @@ extern void sync_pgdata(const char *pg_data, int serverVersion, extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method); extern int durable_rename(const char *oldfile, const char *newfile); extern int fsync_parent_path(const char *fname); +extern void walkdir(const char *path, + int (*action) (const char *fname, bool isdir), + bool process_symlinks); #endif extern PGFileType get_dirent_type(const char *path, -- 2.39.5 (Apple Git-154) >From 36d6a1aad5cbfeb05954886bb336cfa9ec01c5c3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 6 Nov 2024 13:59:39 -0600 Subject: [PATCH v2 2/8] Add "void *arg" parameter to walkdir() that is passed to function. THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW. This will be used in follow up commits to pass private state to the functions called by walkdir(). --- src/bin/pg_basebackup/walmethods.c | 8 +++ src/bin/pg_dump/pg_backup_custom.c | 2 +- src/bin/pg_dump/pg_backup_tar.c| 2 +- src/bin/pg_dump/pg_dumpall.c | 2 +- src/common/file_utils.c| 38 +++--- src/include/common/file_utils.h| 6 ++--- 6 files changed, 29 insertions(+), 29 deletions(-) diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index 215b24597f..51640cb493 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -251,7 +251,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, */ if (wwmethod->sync) { - if (fsync_fname(tmppath, false) != 0 || + if (fsync_fname(tmppath, false, NULL) != 0 || fsync_parent_path(tmppath) != 0) { wwmethod->lasterrno = errno; @@ -486,7 +486,7 @@ dir_close(Walfile *f, WalCloseMethod method) */ if (f->wwmethod->sync) { - r = fsync_fname(df->fullpath, false); + r = fsync_fname(df->fullpath, false, NULL); if (r == 0) r = fsync_parent_path(df->fullpath); } @@ -617,7 +617,7 @@ dir_finish(WalWriteMethod *wwmethod) * Files are fsynced when they are closed, but we need to fsync the * directory entry here as well. */ - if (fsync_fname(dir_data->basedir, true) != 0) + if (fsync_fname(dir_data->basedir, true, NULL) != 0) { wwmethod->lasterrno = errno; return false; @@ -1321,7 +1321,7 @@ tar_finish(WalWriteMethod *wwmethod) if (wwmethod->sync) { - if (fsync_fname(tar_data->tarfilename, false) != 0 || + if (fsync_fname(tar_data->tarfilename, false, NULL) != 0 || fsync_parent_path(tar_data->tarfilename) != 0) { wwmethod->lasterrno = errno; diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index e44b887eb2..51edf147d6 100644 --- a/src/bin/pg_dump/pg_backup_
Re: CREATE SCHEMA ... CREATE DOMAIN support
Vik Fearing writes: > On 02/12/2024 03:15, Tom Lane wrote: >> Also, if SQL intended to constrain the search path for unqualified >> identifiers to be only the new schema, they'd hardly need a concept >> of at all. > I looked up the original paper (MUN-051) that introduced the path specification> and it says, "The paper is proposing the use of > paths only to resolve unqualified routine invocations." Interesting. But still, the spec allows within , so even that narrow interpretation opens them to the is-this-an-external-reference-or-a-forward-reference problem. For us, that's clouded further for functions by our overloading rules. If foo(bigint) exists in the search path, and we have a view or whatever that references foo() with an int argument, and there is a CREATE FUNCTION for foo(float8) later in the , what are we supposed to think is the user's intent? (Just to save people doing the experiment: we'd prefer foo(float8) if both are visible, but foo(bigint) would be perfectly acceptable if not. Other choices of the argument types would yield different results, and none of them seem especially open-and-shut to me.) I don't know offhand if the spec allows function overloading in the same way. regards, tom lane
Re: Proposal: Role Sandboxing for Secure Impersonation
> On 2 Dec 2024, at 18:39, Joe Conway wrote: > > I am very much in favor of functionality of this sort being built in to the > core database. Very similar functionality is available in an extension I > wrote years ago (without the SQL grammar support) -- see > https://github.com/pgaudit/set_user Coincidentally, I’ve created https://github.com/pgaudit/set_user/issues/87 BTW thanks for set_user - really needed and appreciated. — Michal
Re: Incorrect result of bitmap heap scan.
Hi, On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote: > Concurrency timeline: > > Session 1. amgetbitmap() gets snapshot of index contents, containing > references to dead tuples on heap P1. IIUC, an unstanted important aspect here is that these dead tuples are *not* visible to S1. Otherwise the VACUUM in the next step could not remove the dead tuples. > Session 2. VACUUM runs on the heap, removes TIDs for P1 from the > index, deletes those TIDs from the heap pages, and finally sets heap > pages' VM bits to ALL_VISIBLE, including the now cleaned page P1 > Session 1. Executes the bitmap heap scan that uses the bitmap without > checking tuples on P1 due to ALL_VISIBLE and a lack of output columns. Ugh :/ Pretty depressing that we've only found this now, ~6 years after it's been released. We're lacking some tooling to find this stuff in a more automated fashion. > PS. > I don't think the optimization itself is completely impossible, and we > can probably re-enable an optimization like that if (or when) we find > a way to reliably keep track of when to stop using the optimization. I > don't think that's an easy fix, though, and definitely not for > backbranches. One way to make the optimization safe could be to move it into the indexam. If an indexam would check the VM bit while blocking removal of the index entries, it should make it safe. Of course that has the disadvantage of needing more VM lookups than before, because it would not yet have been deduplicated... Another issue with bitmap index scans is that it currently can't use killtuples. I've seen several production outages where performance would degrade horribly over time whenever estimates lead to important queries to switch to bitmap scans, because lots of dead tuples would get accessed over and over. It seems pretty much impossible to fix that with the current interaction between nodeBitmap* and indexam. I wonder if we could, at least in some cases, and likely with some heuristics / limits, address this by performing some visibility checks during the bitmap build. I'm bringing it up here because I suspect that some of the necessary changes would overlap with what's needed to repair bitmap index-only scans. > The solution I could think to keep most of this optimization requires > the heap bitmap scan to notice that a concurrent process started > removing TIDs from the heap after amgetbitmap was called; i.e.. a > "vacuum generation counter" incremented every time heap starts the > cleanup run. I'm not sure this is a good path. We can already clean up pages during index accesses and we are working on being able to set all-visible during "hot pruning"/page-level-vacuuming. That'd probably actually be safe (because we couldn't remove dead items without a real vacuum), but it's starting to get pretty complicated... > This is quite non-trivial, however, as we don't have much in place regarding > per-relation shared structures which we could put such a value into, nor a > good place to signal changes of the value to bitmap heap-scanning backends. It doesn't have to be driven of table state, it could be state in indexes. Most (all?) already have a metapage. We could also add that state to pg_class? We already update pg_class after most vacuums, so I don't think that'd be an issue. Thomas had a WIP patch to add a shared-memory table of all open relations. Which we need for quite a few features by now (e.g. more efficient buffer mapping table, avoiding the relation size lookup during planning / execution, more efficient truncation of relations, ...). > From 07739e5af83664b67ea02d3db7a461a106d74040 Mon Sep 17 00:00:00 2001 > From: Matthias van de Meent > Date: Mon, 2 Dec 2024 15:59:35 +0100 > Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimization > > The optimization doesn't take concurrent vacuum's removal of TIDs into > account, which can remove dead TIDs and make ALL_VISIBLE pages for which > we have pointers to those dead TIDs in the bitmap. > > The optimization itself isn't impossible, but would require more work > figuring out that vacuum started removing TIDs and then stop using the > optimization. However, we currently don't have such a system in place, > making the optimization unsound to keep around. Unfortunately I don't see a better path either. I think it'd be good if we added a test that shows the failure mechanism so that we don't re-introduce this in the future. Evidently this failure isn't immediately obvious... Greetings, Andres Freund
Re: Proposal: Role Sandboxing for Secure Impersonation
Eric Hanson: a) Transaction ("local") Sandbox: - SET LOCAL ROLE alice NO RESET; - SET LOCAL ROLE alice WITHOUT RESET; - BEGIN AS ROLE alice; Transaction-level sandboxes have the benefit that a pooler can simply start a new sandboxed transaction for each request and never have to worry about resetting or reusing them. b) Session Sandbox: - SET ROLE alice NO RESET; - SET ROLE alice WITHOUT RESET; - SET UNRESETTABLE ROLE alice; --veto Session-level sandboxes have the benefit that they can do things that can't be done inside a transaction (e.g. create extensions, vacuum, analyze, etc.) It's a fully functional session. However if RESET ROLE is prohibited for the rest of the session, a connection pooler couldn't reuse it. c) "Guarded" Transaction/Session - SET [LOCAL] ROLE alice GUARDED BY reset_token; - RESET ROLE WITH TOKEN reset_token; Guarded sandboxes are nice because the session can also exit the sandbox if it has the token. d) SET [LOCAL] ROLE alice WITH ; Which would allow you to change to a role for which you don't have any grants, yet. Then, the "authenticator pattern" could be implemented by having an authenticator role without any privileges to start with. The client would provide the password to elevate their privileges. RESET ROLE would not be problematic anymore. This would be much cheaper than having those roles do full logins on a new connection and could be used with connection poolers nicely. Possibly, this could also be extended by supporting alternatives to just a password, for example Json Web Tokens. Maybe building on top of the ongoing OAuth effort? Those tokens would then contain a claim for the role they are allowed to set. Best, Wolfgang
Re: Changing shared_buffers without restart
> On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote: > > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote: > > > > I mean, we can do the following to get a nice contiguous empty address > > space no other mmap(NULL)s will get put into: > > > > /* reserve size bytes of memory */ > > base = mmap(NULL, size, PROT_NONE, ...flags, ...); > > /* use the first small_size bytes of that reservation */ > > allocated_in_reserved = mmap(base, small_size, PROT_READ | > > PROT_WRITE, MAP_FIXED, ...); > > > > With the PROT_NONE protection option the OS doesn't actually allocate > > any backing memory, but guarantees no other mmap(NULL, ...) will get > > placed in that area such that it overlaps with that allocation until > > the area is munmap-ed, thus allowing us to reserve a chunk of address > > space without actually using (much) memory. > > From what I understand it's not much different from the scenario when we > just map as much as we want in advance. The actual memory will not be > allocated in both cases due to CoW, oom_score seems to be the same. I > agree it sounds attractive, but after some experimenting it looks like > it won't work with huge pages insige a cgroup v2 (=container). > > The reason is Linux has recently learned to apply memory reservation > limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays > this feature is often configured out of the box in various container > orchestrators, meaning that a scenario "set hugetlb=1GB on a container, > reserve 32GB with PROT_NONE" will fail. I've also tried to mix and > match, reserve some address space via non-hugetlb mapping, and allocate > a hugetlb out of it, but it doesn't work either (the smaller mmap > complains about MAP_HUGETLB with EINVAL). I've asked about that in linux-mm [1]. To my surprise, the recommendations were to stick to creating a large mapping in advance, and slice smaller mappings out of that, which could be resized later. The OOM score should not be affected, and hugetlb could be avoided using MAP_NORESERVE flag for the initial mapping (I've experimented with that, seems to be working just fine, even if the slices are not using MAP_NORESERVE). I guess that would mean I'll try to experiment with this approach as well. But what others think? How much research do we need to do, to gain some confidence about large shared mappings and make it realistically acceptable? [1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
On Mon, Nov 25, 2024 at 08:54:48PM +, Devulapalli, Raghuveer wrote: > As Nathan suggested, we moved this to a separate thread. The latest set > of patches here need to applied on top of patches in that thread. Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch is committed? -- nathan
Re: [18] Unintentional behavior change in commit e9931bfb75
Jeff Davis writes: > The behavior is commented (commit 176d5bae1d) in formatting.c: >* ... When using the default >* collation, we apply the traditional Postgres behavior that >* forces ASCII-style treatment of I/i, but in non-default >* collations you get exactly what the collation says. > That's a somewhat strange special case (along with similar ones for > INITCAP() and UPPER()) that applies to single-byte encodings with the > libc provider and the database default collation only. I assume it was > done for backwards compatibility? Well, also for compatibility with our SQL parser's understanding of identifier lowercasing. > Should I put the special case back? I think so. It's stood for a lot of years now without field complaints, and I'm fairly sure there *were* field complaints before. (I think that behavior long predates 176d5bae1d, which was just restoring the status quo ante after somebody else's overenthusiastic application of system locale infrastructure.) regards, tom lane
Re: Remove useless casts to (void *)
Andres Freund writes: > On 2024-12-02 17:11:30 -0500, Tom Lane wrote: >> I'm inclined to propose adding something like >> CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L" >> to src/template/solaris. Not sure if we have a corresponding >> mechanism for meson, though. > elif host_system == 'sunos' > portname = 'solaris' > export_fmt = '-Wl,-M@0@' > cppflags += '-D_POSIX_PTHREAD_SEMANTICS' > Should be trivial to add there. Oh! The corresponding bit in configure.ac is # On Solaris, we need this #define to get POSIX-conforming versions # of many interfaces (sigwait, getpwuid_r, ...). if test "$PORTNAME" = "solaris"; then CPPFLAGS="$CPPFLAGS -D_POSIX_PTHREAD_SEMANTICS" fi Barely even need to adjust the comment ;-). I'll proceed with improving that (in master only, don't think we need it in back branches, at least not today) unless somebody pushes back soon. regards, tom lane
Re: Remove useless casts to (void *)
On 2024-12-02 17:42:33 -0500, Tom Lane wrote: > I'll proceed with improving that (in master only, don't think we need it in > back branches, at least not today) unless somebody pushes back soon. +1
Re: Remove useless casts to (void *)
On Tue, Oct 29, 2024 at 9:06 PM Peter Eisentraut wrote: > There are a bunch of (void *) casts in the code that don't make sense to > me. I think some of these were once necessary because char * was used > in place of void * for some function arguments. Some other places that sprang to my eye recently that reminded me of K&R C and the ongoing transition to the standard (/me ducks): Why do we bother with a "Pointer" type? The idiomatic way to do byte-oriented pointer arithmetic is to cast to char *, or uint8_t* etc, which doesn't require the reader to go and figure out what stride that non-standard type name implies. Clearly it has a history linked to the introduction of void *, and it's described as 'holding the address of any memory resident type' and is filed under the 'standard system types' section, but C89 has void * for objects of type unknown to the compiler, and if you want to do arithmetic over those objects you have to be more explicit about their size, which requires writing a suitably-alarming-to-the-reader cast to a pointer to a real type. Also, the explanation about "true ANSI compilers" means nothing to the modern reader who wasn't hacking C 30+ years ago. Maybe it has to do with matching up to DatumGetPointer(). DatumGetPointer() isn't defined to return it though, it returns char *, callers usually/always cast to a real time, and I don't see Pointer being used near it. DatumGetPointer() should arguably return void * too, to force users to provide a type if they're going to dereference or perform arithmetic. What problem does PointerIsValid(x) solve, when you could literally just write x or if you insist x != NULL in most contexts and it would be 100% idiomatic C, and shorter? Why introduce extra terminological load with "validity"? C programmers had better know all about NULL. Why do all the XLogRegister*() calls have to cast their argument to char *? Seems like a textbook use for void * argument, not requiring a cast. If the XLog* interfaces want to do byte-oriented arithmetic internally, they should do the cast internally, instead of having an argument that requires all callers to cast their arguments. (While grepping for casts to char *, I had a mistake in my regex and finished up seeing how many places in our code check sizeof(char), which is funny because sizeof is defined as telling you how many chars it takes to hold a type/value; perhaps it has documentation value :-))
Re: Showing applied extended statistics in explain Part 2
Ilia Evdokimov writes: > On 19.11.2024 00:38, Tomas Vondra wrote: >> On 11/18/24 22:15, Tomas Vondra wrote: >>> So I think the correct solution is to not pass any expressions with >>> RestrictInfo to deparse_expression(). Either by stripping the nodes, or >>> by not adding them at all. >>> >>> The patch tries to do the stripping by maybe_extract_actual_clauses(), >>> but that only looks at the top node, and that is not sufficient here. Pardon me for being late to the party, but I don't understand why this is difficult. There should never be more than one layer of RestrictInfos, at the top level of an implicitly-ANDed list of quals. The only exception to this is that RestrictInfos representing OR clauses have an additional field "orclause" that attaches RestrictInfos to the elements of the OR list --- but the main "clause" field doesn't look like that, and you can just ignore "orclause" if it doesn't suit you. So ISTM this doesn't need to be any harder than what extract_actual_clauses() does (and has done for decades). regards, tom lane
Re: UUID v7
On Sat, Nov 30, 2024 at 7:01 AM Daniel Verite wrote: > > Andrey M. Borodin wrote: > > > > I'm sending amendments addressing your review as a separate step in patch > > set. Step 1 of this patch set is identical to v39. > > Some comments about the implementation of monotonicity: > > +/* > + * Get the current timestamp with nanosecond precision for UUID generation. > + * The returned timestamp is ensured to be at least SUBMS_MINIMAL_STEP > greater > + * than the previous returned timestamp (on this backend). > + */ > +static inline int64 > +get_real_time_ns_ascending() > +{ > + static int64 previous_ns = 0; > > [...] > > + /* Guarantee the minimal step advancement of the timestamp */ > + if (previous_ns + SUBMS_MINIMAL_STEP_NS >= ns) > + ns = previous_ns + SUBMS_MINIMAL_STEP_NS; > + previous_ns = ns; > > In the case of parallel execution (uuidv7() being parallel-safe), if > there have been previous calls to uuidv7() in that backend, > previous_ns will be set in the backend process, > but zero in a newly spawned worker process. > If (previous_ns + SUBMS_MINIMAL_STEP_NS >= ns) ever happens > to be true in the main process, it will start at false in the workers, > leading to non-monotonic results within the same query. The monotonicity of generated UUIDv7 is guaranteed only within a single backend. I think your point is that UUIDs in parallel query results might not be ordered. But is it guaranteed that without ORDER BY clause, the results returned by parallel queries are in the same order as the results from non-parallel queries in the first place? > > Also in the case of a backward clock change, we can end up with some > backends sticking to the "old time" plus increment per invocation > until they die, while some other backends spawned after the clock > change are on the "new time". These backends may produce series of > UUIDv7 that would be completely out of sync with each others. > A backward clock change is an abnormality, but if it occurs, what's > the best choice? Take the bullet and switch to the new time , or > stick to a time that is permanently decorrelated from the OS > clock? I would think that the latter is worse. IIUC after generating a UUIDv7 with the correct time, even if the system time goes back, the time in the next UUIDv7 will be SUBMS_MINIMAL_STEP_NS nanoseconds ahead of the last correct time. Also, in case where the backend generates its first UUIDv7 with an incorrect (e.g. an old) time, it generates UUIDv7 based on the incorrect timestamp. However, it starts generating UUIDv7 with the correct timestamp as soon as the system time goes back to the correct time. So I think that it doesn't happen that one backend is sticking to an old time while another backend is using the correct timestamp to generate UUIDv7. Note that we use (the previous timestamp + SUBMS_MINIMAL_STEP_NS) only if the system clock didn't move forward by SUBMS_MINIMAL_STEP_NS. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Use streaming read API in pgstattuple.
Hi, On Fri, 29 Nov 2024 at 20:05, Matheus Alcantara wrote: > > Hi, > > > On 29/11/24 04:28, Nazir Bilal Yavuz wrote: > > -for (; blkno < nblocks; blkno++) > > +p.last_exclusive = nblocks; > > + > > +while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL))) > > { > > CHECK_FOR_INTERRUPTS(); > > > > -pagefn(&stat, rel, blkno, bstrategy); > > +pagefn(&stat, rel, buf); > > } > > + > > +Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); > > > > With this change we assume that if stream returns an invalid buffer > > that means stream must be finished. We don't check if the stream > > didn't finish but somehow read an invalid buffer. In the upstream > > version, if we read an invalid buffer then postgres server will fail. > > But in the patched version, the server will continue to run because it > > will think that stream has reached to end. This could be happening for > > other streaming read users; for example at least the > > read_stream_next_buffer() calls in the collect_corrupt_items() > > function face the same issue. > > Just for reference; On pg_prewarm() for example this loop is > implemented as: > > for (block = first_block; block <= last_block; ++block) > { > Buffer buf; > ... > buf = read_stream_next_buffer(stream, NULL); > ... > } > Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); > > > Would this approach make sense on these cases? (this patch and on > collect_corrupt_items) Yes, that should work and I guess it would be easy to apply this approach to this patch but it would be hard to apply this to collect_corrupt_items(). In cases like pg_prewarm or the current scenario, we know the exact number of blocks to read, which allows us to determine when the stream should finish (e.g., after the loop runs blockno times) and return an invalid buffer. But in collect_corrupt_items(), the number of loop iterations required is not directly tied to blockno since some blocks may be skipped. This makes it difficult to know when the stream should terminate and return an invalid buffer. -- Regards, Nazir Bilal Yavuz Microsoft
Re: More CppAsString2() in psql's describe.c
On 28.11.24 07:34, Michael Paquier wrote: - "WHERE p.prokind = 'a'\n", + "WHERE p.prokind = " CppAsString2(PROKIND_AGGREGATE) "\n", I noticed something here. If the symbol that is the argument of CppAsString2() is not defined, maybe because there is a typo or because the required header file is not included, then there is no compiler diagnostic. Instead, the symbol is just used as is, which might then result in an SQL error or go unnoticed if there is no test coverage. Now, the same is technically true for the previous code with the hardcoded character values, but I think at least something like "WHERE p.prokind = 'a'\n", is easier to eyeball for correctness, whereas, you know, CppAsString2(PROPARALLEL_RESTRICTED), is quite something. I think this would be more robust if it were written something like "WHERE p.prokind = '%c'\n", PROKIND_AGGREGATE (Moreover, the current structure assumes that the C character literal syntax used by the PROKIND_* and other symbols happens to be the same as the SQL string literal syntax required in those queries, which is just an accident.)
Drop back the redundant "Lock" suffix from LWLock wait event names
Hi hackers, da952b415f unintentionally added back the "Lock" suffix into the LWLock wait event names: - "added back" because the "Lock" suffix was removed in 14a9101091 - "unintentionally" because there is nothing in the thread [2] that explicitly mentions that the idea was also to revert 14a9101091 Please find attached a patch to remove it back so that the pg_stat_activity view now reports back the LWLock without the "Lock" suffix (as pg_wait_events and the related documentation do). It has been reported in bug #18728 (see [1]). [1]: https://www.postgresql.org/message-id/flat/18728-450924477056a339%40postgresql.org [2]: https://www.postgresql.org/message-id/flat/202401231025.gbv4nnte5fmm%40alvherre.pgsql Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 86c30d2daf939a83b578de33fe96cf1790dbd427 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 2 Dec 2024 07:43:35 + Subject: [PATCH v1] Drop back the redundant "Lock" suffix from LWLock wait event names da952b415f unintentionally added back the "Lock" suffix into the LWLock wait event names. Per bug #18728 from Christophe Courtois. --- src/backend/storage/lmgr/generate-lwlocknames.pl | 1 + src/backend/storage/lmgr/lwlock.c| 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) 100.0% src/backend/storage/lmgr/ diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl index eaddd9d3b9..c4267328d3 100644 --- a/src/backend/storage/lmgr/generate-lwlocknames.pl +++ b/src/backend/storage/lmgr/generate-lwlocknames.pl @@ -107,6 +107,7 @@ while (<$lwlocklist>) $lastlockidx = $lockidx; $continue = ",\n"; + # Add the Lock suffix only here as the C code depends of it print $h "#define ${lockname}Lock (&MainLWLockArray[$lockidx].lock)\n"; } diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index db6ed784ab..87cf295262 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -124,7 +124,7 @@ StaticAssertDecl(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS, * ... and do not forget to update the documentation's list of wait events. */ static const char *const BuiltinTrancheNames[] = { -#define PG_LWLOCK(id, lockname) [id] = CppAsString(lockname) "Lock", +#define PG_LWLOCK(id, lockname) [id] = CppAsString(lockname), #include "storage/lwlocklist.h" #undef PG_LWLOCK [LWTRANCHE_XACT_BUFFER] = "XactBuffer", -- 2.34.1
Re: Vacuum statistics
Hi, Alena! On Wed, Nov 13, 2024 at 6:21 PM Alena Rybakina wrote: > Updated 0001-v13 attached, as well as the diff between v12 and v13. > > Thank you) > > And I agree with your changes. And included them in patches. Thank you for the updated patchset. Some points from me. * I've read the previous discussion on how important to keep all these fields regarding vacuum statistics including points by Andrei and Jim. It still worrying me that statistics volume is going to burst in about 3 times, but I don't have a particular proposal on how to make more granular approach. I wonder if you could propose something. * Previously PGSTAT_FILE_FORMAT_ID got increased by 1. Your 0001 patch increases it by 2. It's minor note, but I'd like to keep the tradition. * Commit message for 0001 looks nice, but commit messages of 0002, 0003, and 0004 look messy. Could you please, rearrange them. * The distinction between 0001 and 0002 is not clear. The first line of 0001 is "Machinery for grabbing an extended vacuum statistics on heap relations", the first line of 0002 is "Machinery for grabbing an extended vacuum statistics on heap and index relations." I guess 0001 should be about heap relations while 0002 should be about just index relations. Is this correct? * I guess this statistics should work for any table AM, based on what has been done in relation_vacuum() interface method. If that's correct, we need to get rid of "heap" terminology and use "table" instead. * 0004 should be pure documentation patch, but it seems containing changes to isolation tests. Please, move them into a more appropriate place. -- Regards, Alexander Korotkov Supabase
Re: RISC-V animals sporadically produce weird memory-related failures
02.12.2024 18:25, Tom Turelinckx wrote: These crashes are hardly related to code changes, so maybe there are platform-specific issues still... I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and enable --with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert the --with-llvm again. Sorry about not first testing without submitting results. Thank you for the clarification! I hadn't noticed the "--with-llvm" option added in the configuration... Now I've re-run `make check` for a llvm-enabled build (made with clang 19.1.4) locally and got the same: 2024-12-02 16:49:47.620 UTC postmaster[21895] LOG: server process (PID 21933) was terminated by signal 11: Segmentation fault 2024-12-02 16:49:47.620 UTC postmaster[21895] DETAIL: Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, BOOLTBL2.* FROM BOOLTBL1, BOOLTBL2 WHERE BOOLTBL2.f1 <> BOOLTBL1.f1; A build made with clang-19 without llvm passed `make check` successfully. I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default /etc/security/limits.d/10-coredump-debian.conf. Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but apport is not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core file being written to the current directory, like it was before the upgrade. Thank you for fixing this! Best regards, Alexander
Re: Using read stream in autoprewarm
On Fri, 29 Nov 2024 at 16:19, Nazir Bilal Yavuz wrote: > v4 is attached. Hi! I feel like we are ready to mark this as RFC, WDYT? -- Best regards, Kirill Reshke
Re: Doc: typo in config.sgml
On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote: > Bruce Momjian writes: > > Now that we have a warning about non-emittable characters in the PDF > > build, do you want me to put back the Latin1 characters in the SGML > > files or leave them as HTML entities? > > I think going forward we're going to be putting in people's names > in UTF8 --- I was certainly planning to start doing that. It doesn't Yes, I expected that, and added an item to my release checklist to make a PDF file and check for the warning. I don't normally do that. > matter that much what we do with existing cases, though. Okay, I think Peter had an opinion but I wasn't sure what it was. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
RE: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
On Friday, November 29, 2024 9:08 PM Shlok Kyal wrote: > > > Thanks for reviewing the patch. I have fixed the issue and updated the patch. Thanks for updating the patch. I have reviewed and have few suggestions. Please check the attached diff which includes: 1) Comments in CheckCmdReplicaIdentity() * Removed some duplicate descriptions. * Fixed the comments for the column list which I think is not correct. * Mentioned the column list and generated column in XXX part as well. 2) Doc * Since we mentioned the restriction for UPDATEs and DELTEs when row filter or column list is defined in the create_publication.sgml, I feel we may need to mention the generated part as well. So, added in the diff. 3) pub_contains_invalid_column * Simplified one condition a bit. Please check and merge if it looks good to you. Best Regards, Hou zj From abdc729ab5e880543e4702d65e8ea66533325d71 Mon Sep 17 00:00:00 2001 From: Hou Zhijie Date: Tue, 3 Dec 2024 12:19:50 +0800 Subject: [PATCH] improvements --- doc/src/sgml/ref/create_publication.sgml | 9 src/backend/commands/publicationcmds.c | 13 +--- src/backend/executor/execReplication.c | 27 3 files changed, 27 insertions(+), 22 deletions(-) diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml index f8e217d661..ae21018697 100644 --- a/doc/src/sgml/ref/create_publication.sgml +++ b/doc/src/sgml/ref/create_publication.sgml @@ -311,6 +311,15 @@ CREATE PUBLICATION name system columns. + + If tables added to a publication include generated columns that are part of + the REPLICA IDENTITY, it is essential to publish these + columns by explicitly listing them in the column list or by enabling the + publish_generated_columns option. Otherwise, + UPDATE or DELETE operations will be + disallowed on those tables. + + The row filter on a table becomes redundant if FOR TABLES IN SCHEMA is specified and the table diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c index d2b4c8e9a6..323f59fae1 100644 --- a/src/backend/commands/publicationcmds.c +++ b/src/backend/commands/publicationcmds.c @@ -427,16 +427,13 @@ pub_contains_invalid_column(Oid pubid, Relation relation, List *ancestors, if (columns == NULL) { /* -* Check if pubgencols is false and generated column is part of -* REPLICA IDENTITY +* Break the loop if an unpublished generated column is part of the +* REPLICA IDENTITY. */ - if (!pubgencols) + if (!pubgencols && att->attgenerated) { - *unpublished_gen_col |= att->attgenerated; - - /* Break the loop if unpublished generated columns exist. */ - if (*unpublished_gen_col) - break; + *unpublished_gen_col = true; + break; } /* Skip validating the column list since it is not defined */ diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 227a8aeea0..f66ff21159 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -785,27 +785,26 @@ CheckCmdReplicaIdentity(Relation rel, CmdType cmd) return; /* -* It is only safe to execute UPDATE/DELETE when: +* It is only safe to execute UPDATE/DELETE if the relation does not +* publish UPDATEs or DELETEs, or all the following conditions are +* satisfied: * * 1. All columns, referenced in the row filters from publications which -* the relation is in, are valid - i.e. when all referenced columns are -* part of REPLICA IDENTITY or the table does not publish UPDATEs or -* DELETEs. +*the relation is in, are valid - i.e. when all referenced columns are +*part of REPLICA IDENTITY. * * 2. All columns, referenced in the column lists from publications which -* the relation is in, are valid - i.e. when all referenced columns are -* part of REPLICA IDENTITY or the table does not publish UPDATEs or -* DELETEs. +*the relation is in, are valid - i.e. when all columns referenced in +*the REPLICA IDENTITY are covered by the column list. * -* 3. All generated columns in REPLICA IDENTITY of the relation, for all -* the publications which the relation is in, are valid - i.e. when -* unpublished generated columns are not part of REPLICA IDENTITY or the -* table does not publish UPDATEs or DELETEs
Re: [18] Unintentional behavior change in commit e9931bfb75
On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: > Well, also for compatibility with our SQL parser's understanding > of identifier lowercasing. But why only for single-byte encodings? And why not for ICU? > > Should I put the special case back? > > I think so. Done. I put the special case back in (commit e3fa2b037c) because the earlier commit wasn't intended to be a behavior change. I'm still not convinced that the special case behavior is great, but a lot of users are unaffected because they are on UTF8 anyway, so I'm fine keeping it. Regards, Jeff Davis
Re: [18] Unintentional behavior change in commit e9931bfb75
Jeff Davis writes: > On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: >> Well, also for compatibility with our SQL parser's understanding >> of identifier lowercasing. > But why only for single-byte encodings? And why not for ICU? I think the not-for-utf8 exclusion was mostly because that was how it was before, which was probably mostly historical accident. (I do vaguely recall that there was discussion on the point, but I'm too tired to go look in the archives for it.) As for ICU, that didn't exist back then, and I'm not going to defend whether it was a good idea for that code path to fail to reproduce this behavior. regards, tom lane
Re: Memory leak in WAL sender with pgoutput (v10~)
On Tue, Dec 03, 2024 at 09:50:42AM +0530, Amit Kapila wrote: > But that suits the current design more. We allocate PGOutputData and > other contexts in that structure in a "Logical decoding context". A > few of its members (publications, publication_names) residing in > totally unrelated contexts sounds odd. In the first place, we don't > need to allocate publications under CacheMemoryContext, they should be > allocated in PGOutputData->cachectx. However, because we need to free > those entirely at one-shot during invalidation processing, we could > use a new context as a child context of PGOutputData->cachectx. Unless > I am missing something, the current memory context usage appears more > like a coding convenience than a thoughtful design decision. PGOutputData->cachectx has been introduced in 2022 in commit 52e4f0cd4, while the decision to have RelationSyncEntry and the publication list in CacheMemoryContext gets down to v10 where this logical replication has been introduced. This was a carefully-thought choice back then because this is data that belongs to the process cache, so yes, this choice makes sense to me. Even today this choice makes sense: this is still data cached in the process, and it's stored in the memory context for cached data. The problem is that we have two locations where the cached data is stored, and I'd agree to make the whole leaner by relying on one or the other entirely, but not both. If you want to move all that to the single memory context in PGOutputData, perhaps that makes sense in the long-run and the code gets simpler. Perhaps also it could be simpler to get everything under CacheMemoryContext and remove PGOutputData->cachectx. However, if you do so, I'd suggest to use the same parent context for RelationSyncData as well rather than have two concepts, not only the publication list. That's digressing from the original subject a bit, btw.. -- Michael signature.asc Description: PGP signature
Re: speedup ALTER TABLE ADD CHECK CONSTRAINT.
On Mon, Dec 2, 2024 at 12:06 AM Sergei Kornilov wrote: > > Hello! > Thanks for the patch, but I think using SPI is still not allowed here: > https://www.postgresql.org/message-id/9878.1511970441%40sss.pgh.pa.us > > > if it uses SPI for sub-operations of ALTER TABLE I think that is sufficient > > reason to reject it out of hand. > Thanks for dropping this link. i wonder what was your thoughts about this (https://www.postgresql.org/message-id/900056D1-32DF-4927-8251-3E0C0DC407FD%40anarazel.de) at that time?
Re: Memory leak in WAL sender with pgoutput (v10~)
On Mon, Dec 02, 2024 at 03:29:56PM +0530, Amit Kapila wrote: > We can look at it from a different angle which is that the > FreePublication(s) relies on how the knowledge of Publication > structure is built. So, it doesn't look weird if we think from that > angle. OK, I can live with that on all the stable branches with an extra list free rather than a deep list free. I agree that the memory handling of this whole area needs some rework to make such leaks harder to introduce in the WAL sender. Still, let's first solve the problem at hand :) So how about the attached that introduces a FreePublication() matching with GetPublication(), used to do the cleanup? Feel free to comment. -- Michael From f65c7bf220ba152a868ef52bb7df245809ce6d73 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 3 Dec 2024 15:45:04 +0900 Subject: [PATCH v3] Fix memory leak in pgoutput with publication list cache Blah. Analyzed-by: Jeff Davis, Michael Paquier Discussion: https://postgr.es/m/z0khf9evmvloc...@paquier.xyz Backpatch-through: 13 --- src/include/catalog/pg_publication.h| 1 + src/backend/catalog/pg_publication.c| 10 ++ src/backend/replication/pgoutput/pgoutput.c | 8 +--- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/src/include/catalog/pg_publication.h b/src/include/catalog/pg_publication.h index 9a83a72d6b..4e2de947e9 100644 --- a/src/include/catalog/pg_publication.h +++ b/src/include/catalog/pg_publication.h @@ -118,6 +118,7 @@ typedef struct PublicationRelInfo } PublicationRelInfo; extern Publication *GetPublication(Oid pubid); +extern void FreePublication(Publication *pub); extern Publication *GetPublicationByName(const char *pubname, bool missing_ok); extern List *GetRelationPublications(Oid relid); diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 9bbb60463f..2efe86d953 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -1075,6 +1075,16 @@ GetPublication(Oid pubid) return pub; } +/* + * Free memory allocated by Publication structure. + */ +void +FreePublication(Publication *pub) +{ + pfree(pub->name); + pfree(pub); +} + /* * Get Publication using name. */ diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 5e23453f07..a51b1c15fb 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -2064,11 +2064,13 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) if (!publications_valid) { oldctx = MemoryContextSwitchTo(CacheMemoryContext); - if (data->publications) + foreach(lc, data->publications) { -list_free_deep(data->publications); -data->publications = NIL; +Publication *pub = lfirst(lc); + +FreePublication(pub); } + list_free(data->publications); data->publications = LoadPublications(data->publication_names); MemoryContextSwitchTo(oldctx); publications_valid = true; -- 2.45.2 signature.asc Description: PGP signature
Re: Memory leak in WAL sender with pgoutput (v10~)
On Tue, Dec 3, 2024 at 5:47 PM Michael Paquier wrote: > > On Mon, Dec 02, 2024 at 03:29:56PM +0530, Amit Kapila wrote: > > We can look at it from a different angle which is that the > > FreePublication(s) relies on how the knowledge of Publication > > structure is built. So, it doesn't look weird if we think from that > > angle. > > OK, I can live with that on all the stable branches with an extra > list free rather than a deep list free. > > I agree that the memory handling of this whole area needs some rework > to make such leaks harder to introduce in the WAL sender. Still, > let's first solve the problem at hand :) > > So how about the attached that introduces a FreePublication() matching > with GetPublication(), used to do the cleanup? Feel free to comment. > -- Perhaps the patch can use foreach_ptr macro instead of foreach? == Kind Regards, Peter Smith. Fujitsu Australia
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi Nisha, here are a couple of review comments for patch v52-0001. == Commit Message Add check if slot is already acquired, then mark it invalidate directly. ~ /slot/the slot/ "mark it invalidate" ? Maybe you meant: "then invalidate it directly", or "then mark it 'invalidated' directly", or etc. == src/backend/replication/logical/slotsync.c 1. @@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len) static void update_synced_slots_inactive_since(void) { - TimestampTz now = 0; + TimestampTz now; /* * We need to update inactive_since only when we are promoting standby to @@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void) /* The slot sync worker or SQL function mustn't be running by now */ Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing); + /* Use same inactive_since time for all slots */ + now = GetCurrentTimestamp(); + Something is broken with these changes. AFAICT, the result after applying patch 0001 still has code: /* Use the same inactive_since time for all the slots. */ if (now == 0) now = GetCurrentTimestamp(); So the end result has multiple/competing assignments to variable 'now'. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
On Tue, 3 Dec 2024 at 10:21, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 29, 2024 9:08 PM Shlok Kyal > wrote: > > > > > > Thanks for reviewing the patch. I have fixed the issue and updated the > > patch. > > Thanks for updating the patch. I have reviewed and have few suggestions. > > Please check the attached diff which includes: > > 1) Comments in CheckCmdReplicaIdentity() > * Removed some duplicate descriptions. > * Fixed the comments for the column list which I think is not correct. > * Mentioned the column list and generated column in XXX part as well. > > 2) Doc > * Since we mentioned the restriction for UPDATEs and DELTEs when row filter or > column list is defined in the create_publication.sgml, I feel we may need to > mention the generated part as well. So, added in the diff. > > 3) pub_contains_invalid_column > * Simplified one condition a bit. > > Please check and merge if it looks good to you. > The changes look good to me. I have included it in the updated patch. Thanks and Regards, Shlok Kyal v14-0001-Disallow-UPDATE-DELETE-on-table-with-generated-c.patch Description: Binary data
Re: Consider pipeline implicit transaction as a transaction block
On Mon, Dec 02, 2024 at 09:14:11AM -0500, Robert Haas wrote: > On Thu, Nov 28, 2024 at 2:53 AM Anthonin Bonnefoy > wrote: >> Sorry about that. I didn't have a strong need for this to be >> backpatched and should have made this clearer. > > FWIW, I don't think you did anything wrong. To me, the thread reads > like you just submitted this as a normal patch and Michael decided to > back-patch. Yep, this one's on me. Please don't worry, Anthonin. -- Michael signature.asc Description: PGP signature
Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
On Wed, Nov 3, 2021 at 7:33 PM Peter Geoghegan wrote: > But what about index-only scans, which GiST also supports? I think > that the rules are different there, even though index-only scans use > an MVCC snapshot. (Reviving this old thread after 3 years) I was reminded of this old thread during today's discussion of a tangentially related TID-recycle-safety bug that affects bitmap index scans that use the visibility map as an optimization [1]. Turns out I was right to be concerned. This GiST bug is causally unrelated to that other bug, so I thought it would be helpful to move discussion of the GiST bug to this old thread. Attached is a refined version of a test case I posted earlier on [2], decisively proving that GiST index-only scans are in fact subtly broken. Right now it fails, showing a wrong answer to a query. The patch adds an isolationtest test case to btree_gist, based on a test case of Andres'. Offhand, I think that the only way to fix this is to bring GiST in line with nbtree: we ought to teach GiST VACUUM to start acquiring cleanup locks (previously known as super-exclusive locks), and then teach GiST index-only scans to hold onto a leaf page buffer pin as the visibility map (or the heap proper) is accessed for the TIDs to be returned from the leaf page. Arguably, GiST isn't obeying the current contract for amgettuple index AMs at all right now (whether or not it violates the contract as written is open to interpretation, I suppose, but either way the current behavior is wrong). We probably shouldn't hold onto a buffer pin during plain GiST index-only scans, though -- plain GiST index scans *aren't* broken, and so we should change as little as possible there. More concretely, we should probably copy more nbtree scan related code into GiST to deal with all this: we could copy nbtree's _bt_drop_lock_and_maybe_pin into GiST to fix this bug, while avoiding changing the performance characteristics of GiST plain index scans. This will also entail adding a new buffer field to GiST's GISTScanOpaqueData struct -- something similar to nbtree's BTScanOpaqueData.currPos.buf field (it'll go next to the current GISTScanOpaqueData.curBlkno field, just like the nbtree equivalent goes next to its own currPage blkno field). Long term, code like nbtree's _bt_drop_lock_and_maybe_pin should be generalized and removed from every individual index AM, nbtree included -- I think that the concepts generalize to every index AM that supports amgettuple (the race condition in question has essentially nothing to do with individual index AM requirements). I've discussed this kind of approach with Tomas Vondra (CC'd) recently. That's not going to be possible within the scope of a backpatchable fix, though. [1] https://postgr.es/m/873c33c5-ef9e-41f6-80b2-2f5e11869...@garret.ru [2] https://postgr.es/m/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw%40mail.gmail.com -- Peter Geoghegan v2-0001-isolationtester-showing-broken-index-only-scans-w.patch Description: Binary data
Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan wrote: > I took what you wrote, and repurposed it to prove my old theory about > GiST index-only scans being broken due to the lack of an appropriate > interlock against concurrent TID recycling. See the attached patch. I've moved discussion of this GiST bug over to the old 2021 thread where I first raised concerns about the issue: https://postgr.es/m/CAH2-Wz=jjinl9fch8c1l-guh15f4wftwub2x+_nucngcddc...@mail.gmail.com The GiST bug is actually causally unrelated to the bitmap index scan bug under discussion, despite the high-level similarity. Seems best to keep discussion of GiST on its own thread, for that reason. -- Peter Geoghegan