date:20241202

On Fri, Nov 29, 2024 at 10:01:41AM +, Dean Rasheed wrote:
> On Thu, 21 Nov 2024 at 18:00, Nathan Bossart  wrote:
>> The attached patch accomplishes this by establishing a global queue of
>> row-security "nest levels" that the aforementioned higher-level callers can
>> use.
> 
> I'm not convinced that this is an improvement.

Thanks for reviewing.

> The code in check_sql_fn_retval() is building a Query struct from
> scratch, so it seems perfectly natural for it to be responsible for
> setting all the required fields, based on the information it has
> available. With this patch, check_sql_fn_retval() is returning a
> potentially incorrectly marked Query at the end of the querytree list,
> which the caller is responsible for fixing up, which doesn't seem
> ideal.

While it is indeed natural for the code that builds a Query to be
responsible for setting it correctly, unfortunately there's no backstop if
someone forgets to do so (as was the case in the recent CVE).  I don't
think my v1 patch would necessarily prevent all such problems, but I do
think it would help prevent some.

> There is exactly one place where RLS policies are applied, and it
> seems much more natural for it to have responsibility for setting this
> flag. I think that a slightly neater way for it to handle that would
> be to modify fireRIRrules(), adding an extra parameter "bool
> *hasRowSecurity" that it would set to true if RLS is enabled for the
> query it is rewriting. Doing that forces all callers to think about
> whether or not that affects some outer query. For example,
> ApplyRetrieveRule() would then do:
> 
> rule_action = fireRIRrules(rule_action, activeRIRs,
>&parsetree->hasRowSecurity);
> 
> rather than having a separate second step to update the flag on
> "parsetree", and similarly for fireRIRrules()'s recursive calls to
> itself. If, in the future, it becomes necessary to invoke
> fireRIRrules() on more parts of a Query, it's then much more likely
> that the new code won't forget to update the parent query's flag.

I've attempted this in the attached v2 patch.  I do think this is an
improvement over the status quo, but I worry that it doesn't go far enough.

-- 
nathan
>From 0c81e04c3e62cd178dff9fc5b5f671e41fd39f28 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 2 Dec 2024 10:24:49 -0600
Subject: [PATCH v2 1/1] revamp row security tracking

---
 src/backend/rewrite/rewriteHandler.c | 42 +---
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/src/backend/rewrite/rewriteHandler.c 
b/src/backend/rewrite/rewriteHandler.c
index ab2e2cd647..c9ad429703 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -94,7 +94,7 @@ static void markQueryForLocking(Query *qry, Node *jtnode,
bool 
pushedDown);
 static List *matchLocks(CmdType event, Relation relation,
int varno, Query *parsetree, 
bool *hasUpdate);
-static Query *fireRIRrules(Query *parsetree, List *activeRIRs);
+static Query *fireRIRrules(Query *parsetree, List *activeRIRs, bool 
*hasRowSecurity);
 static Bitmapset *adjust_view_column_set(Bitmapset *cols, List *targetlist);
 
 
@@ -1730,6 +1730,7 @@ ApplyRetrieveRule(Query *parsetree,
RangeTblEntry *rte;
RowMarkClause *rc;
int numCols;
+   boolhasRowSecurity = false;
 
if (list_length(rule->actions) != 1)
elog(ERROR, "expected just one rule action");
@@ -1843,13 +1844,13 @@ ApplyRetrieveRule(Query *parsetree,
/*
 * Recursively expand any view references inside the view.
 */
-   rule_action = fireRIRrules(rule_action, activeRIRs);
+   rule_action = fireRIRrules(rule_action, activeRIRs, &hasRowSecurity);
 
/*
 * Make sure the query is marked as having row security if the view 
query
 * does.
 */
-   parsetree->hasRowSecurity |= rule_action->hasRowSecurity;
+   parsetree->hasRowSecurity |= hasRowSecurity;
 
/*
 * Now, plug the view query in as a subselect, converting the relation's
@@ -1971,15 +1972,17 @@ fireRIRonSubLink(Node *node, fireRIRonSubLink_context 
*context)
if (IsA(node, SubLink))
{
SubLink*sub = (SubLink *) node;
+   boolhasRowSecurity = false;
 
/* Do what we came for */
sub->subselect = (Node *) fireRIRrules((Query *) sub->subselect,
-   
   context->activeRIRs);
+   
   context->activeRIRs,
+   
   &hasRowSecurity);
 
/*
 * Remember if any of

Re: Consider pipeline implicit transaction as a transaction block

2024-12-02 Thread Robert Haas

On Thu, Nov 28, 2024 at 2:53 AM Anthonin Bonnefoy
 wrote:
> On Thu, Nov 28, 2024 at 12:26 AM Michael Paquier  wrote:
> > I don't mind being more careful here based on your concerns, so I'll
> > go remove that in the stable branches.
>
> Sorry about that. I didn't have a strong need for this to be
> backpatched and should have made this clearer.

FWIW, I don't think you did anything wrong. To me, the thread reads
like you just submitted this as a normal patch and Michael decided to
back-patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 12:11 PM Tom Lane  wrote:
> > Freezing a page, and setting a page all-visible are orthogonal.
>
> Sorry, sloppy wording on my part.

Freezing doesn't affect the contents of the visibility map in any way
that seems relevant. The executor only cares about the all-visible bit
(and never the all-frozen bit), and the rules around when and how
VACUUM sets the all-visible bit (and how everybody else unsets the
all-visible bit) haven't changed in forever. So I just can't see it.

I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.

It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme). Finally, none of this will happen if
some other factor hinders VACUUM from setting the relevant heap page
all-visible.

AFAICT this is only a problem because of the involvement of the VM,
specifically -- an MVCC snapshot *is* generally sufficient to make
bitmap index scans safe from the dangers of concurrent TID recycling,
as explained in "62.4. Index Locking Considerations". That only ceases
to be true when the visibility map becomes involved (the VM lacks the
granular visibility information required to make all this safe). This
is essentially the same VM race issue that nbtree's
_bt_drop_lock_and_maybe_pin protects against during conventional
index-only scans.

-- 
Peter Geoghegan

Re: Partition-wise join with whole row vars

2024-12-02 Thread Dmitry Dolgov

> On Tue, Oct 08, 2024 at 09:24:15AM GMT, Alexander Pyhalov wrote:
>
> Attaching rebased patches.

Just to let you know, looks like CFBot tests are red again, but this
time there are some unexpected differences in some test query plan.

Proposal: Role Sandboxing for Secure Impersonation

2024-12-02 Thread Eric Hanson

Hi all,

I'd like to revisit a previously discussed feature [1] that PostgreSQL
could benefit from a "role sandbox", a feature that would build on SET
[LOCAL] ROLE, and prevent or restrict RESET ROLE.

Rationale: Connection pooling is widely used to optimize database
performance by reducing use of memory, process creation, etc. However,
connection pools typically operate on a "pool-per-role" basis, because each
connection is bound to a single role and can't be reused by another role.
For systems that make use of many roles, this limits the effectiveness of
connection pooling because each role has their own "pool space" and
max_connections puts a hard limit on how many connections can exist.

To work around this, projects (e.g. PostgREST) employ the "user
impersonation" pattern:

- All connections use a shared "authenticator" role

- When a user (e.g. Alice) sends a request to the connection pooler, it
temporarily sets the role using:

SET [LOCAL] ROLE alice;

- After processing Alice's request, the session resets the role back to the
"authenticator" role by either issuing a "RESET ROLE" or ending the "local"
transaction.

This approach works well in theory, but poses a significant security
concern:

RESET ROLE allows a client to reset the role back to the "authenticator"
role, *before* handing the session back to the pooler. Any SQL injection
vulnerability or anything else that allows arbitrary SQL allows the client
to issue a `RESET ROLE; SET ROLE anybody_else;`, bypassing authentication.
Depending on the privileges of the "authenticator" role, the client can
become any other user, or worse.

Proposal: What if PostgreSQL had a "role sandbox", a state where RESET
ROLE was prohibited or restricted? If PostgreSQL could guarantee that
RESET ROLE was not allowed, even SQL injection vulnerabilities would not
allow a client to bypass database privileges and RLS when using user
impersonation. Systems with many roles could safely and efficiently use
many roles in parallel with connection pooling. The feature probably has
other applications as well.

Sandboxing could happen at the session level, or the transaction level;
both seem to have benefits. Here are some syntax ideas floating around:

SET ROLE IDEAS

a) Transaction ("local") Sandbox:
- SET LOCAL ROLE alice NO RESET;
- SET LOCAL ROLE alice WITHOUT RESET;
- BEGIN AS ROLE alice;

Transaction-level sandboxes have the benefit that a pooler can simply start
a new sandboxed transaction for each request and never have to worry about
resetting or reusing them.

b) Session Sandbox:
- SET ROLE alice NO RESET;
- SET ROLE alice WITHOUT RESET;
- SET UNRESETTABLE ROLE alice; --veto

Session-level sandboxes have the benefit that they can do things that can't
be done inside a transaction (e.g. create extensions, vacuum, analyze,
etc.) It's a fully functional session. However if RESET ROLE is
prohibited for the rest of the session, a connection pooler couldn't reuse
it.

c) "Guarded" Transaction/Session
- SET [LOCAL] ROLE alice GUARDED BY reset_token;
- RESET ROLE WITH TOKEN reset_token;

Guarded sandboxes are nice because the session can also exit the sandbox if
it has the token.

Another aspect of this is SET SESSION AUTHORIZATION. I don't see
preventing reset as particularly useful at least for connection poolers,
since it then couldn't be reused. However, the GUARDED BY token idea would
make it restricted but not prevented, which could be useful.

I'd love to hear your thoughts on this feature. If we can finalize the
design, I would be willing to try implementing this. I haven't coded C for
years though so I will probably need some help depending on how complex it
is. SET ROLE is intertwined with the rest of the SET variable grammar but
doesn't seem too hard to extend, if we go that route. Steve Chavez of
PostgREST said he'd be willing to help, and could use the feature
in PostgREST if it existed. I think other poolers could benefit from it as
well.

Thanks,
Eric

[1]
https://postgr.es/m/flat/CACA6kxgdzt-oForijaxfXHHhnZ1WBoVGMXVwFrJqUu-Hg3C-jA%40mail.gmail.com

Re: Consider the number of columns in the sort cost model

2024-12-02 Thread Dmitry Dolgov

Hi folks,

Just wanted to mention, that looks like some CFBot test are failing,
something around level_tracking in pgss.

Re: Drop back the redundant "Lock" suffix from LWLock wait event names

2024-12-02 Thread Alvaro Herrera

On 2024-Dec-02, Bertrand Drouvot wrote:

> Hi hackers,
> 
> da952b415f unintentionally added back the "Lock" suffix into the LWLock wait
> event names:
> 
> - "added back" because the "Lock" suffix was removed in 14a9101091
> - "unintentionally" because there is nothing in the thread [2] that explicitly
> mentions that the idea was also to revert 14a9101091

Oh, you're right, this was unintentional and unnoticed.  I'll push this
shortly, to both 17 and master.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Using read stream in autoprewarm

2024-12-02 Thread Andrey M. Borodin




> On 2 Dec 2024, at 16:16, Kirill Reshke  wrote:
> 
> I feel like we are ready to mark this as RFC, WDYT?

+1

Best regards, Andrey Borodin.

Re: Using read stream in autoprewarm

2024-12-02 Thread Nazir Bilal Yavuz

Hi,

On Mon, 2 Dec 2024 at 16:30, Andrey M. Borodin  wrote:
>
> > On 2 Dec 2024, at 16:16, Kirill Reshke  wrote:
> >
> > I feel like we are ready to mark this as RFC, WDYT?
>
> +1

Done.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
 wrote:
> The running theory is that bitmap executor nodes incorrectly assume
> that the rows contained in the bitmap all are still present in the
> index, and thus assume they're allowed to only check the visibility
> map to see if the reference contained in the bitmap is visible.
> However, this seems incorrect: Note that index AMs must hold at least
> pins on the index pages that contain their results when those results
> are returned by amgettuple() [0], and that amgetbitmap() doesn't do
> that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
> from the index (and later, heap) that are still present in the bitmap
> used in the scan.

This theory seems very believable.

We hold onto a leaf page buffer pin for index-only scans as an
interlock against concurrent TID recycling. If we assume for the sake
of argument that the optimization from commit 7c70996e is correct,
then why do we even bother with holding onto the pin during index-only
scans?

In theory we should either do the "buffer pin interlock against TID
recycling" thing everywhere, or nowhere -- how could bitmap scans
possibly be different here?

> I think this might be an oversight when the feature was originally
> committed in 7c70996e (PG11): we don't know when the VM bit was set,
> and the bitmap we're scanning may thus be out-of-date (and should've
> had TIDs removed it it had been an index), so I propose disabling this
> optimization for now, as attached.

I have a hard time imagining any alternative fix that is suitable for
backpatch. Can we save the optimization on the master branch?

Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans. The only
thing that I can think of that might work is a scheme that establishes
a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later
than the "safe LSN", it's not okay to trust its all-visible bits. At
least not in the context of bitmap index scans that use the
optimization from 7c70996e.

-- 
Peter Geoghegan

Re: CREATE SCHEMA ... CREATE DOMAIN support

2024-12-02 Thread Vik Fearing




On 02/12/2024 03:15, Tom Lane wrote:

Michael Paquier  writes:

If I'm parsing the spec right, the doc mentions in its 5)~6) of the
syntax rules in CREATE SCHEMA that non-schema-qualified objects should
use the new schema name defined in the CREATE SCHEMA query.  So that
pretty much settles the rules to use when having a new object that has
a reference to a non-qualified object created in the same CREATE
SCHEMA query?

I don't see where you're getting that from?  DB2 says that unqualified
reference names (not to be confused with unqualified creation-target
names) are taken to be in the new schema, but I don't see any
corresponding restriction in the spec.

What I do see (11.1 SR 6 in SQL:2021) is:

 If  is not specified, then a  containing an implementation-defined  that contains the  contained in  is implicit.

What I read this as is that the "search path" during schema-element
creation must include at least the new schema, but can also include
some other schemas as defined by the implementation.  That makes
our behavior compliant, because we can define the other schemas
as those in the session's prevailing search_path.  (DB2's behavior
is also compliant, but they're defining the path as containing only
the new schema.)

Also, if SQL intended to constrain the search path for unqualified
identifiers to be only the new schema, they'd hardly need a concept
of  at all.



I looked up the original paper (MUN-051) that introduced the path specification> and it says, "The paper is proposing the use of 
paths only to resolve unqualified routine invocations."



That doesn't seem to have been explained much by the rest of the spec, 
but it is visible in the definition of  which says, 
"Specify an order for searching for an SQL-invoked routine."



I can find nowhere that says that the path can or cannot be used for 
other objects.


--

Vik Fearing

Re: Vacuum statistics

2024-12-02 Thread Ilia Evdokimov

In my opinion, the patches are semantically correct. However, not all 
dead code has been removed - I'm referring to pgstat_update_snapshot(). 
Also, the tests need to be fixed.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 11:31:48 -0500, Andres Freund wrote:
> I think it'd be good if we added a test that shows the failure mechanism so
> that we don't re-introduce this in the future. Evidently this failure isn't
> immediately obvious...

Attached is an isolationtest that reliably shows wrong query results.

Greetings,

Andres Freund
>From a666e6da7af9a0af31ae506de8c2c229b713f8a6 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 2 Dec 2024 14:38:11 -0500
Subject: [PATCH v1] isolationtester showing broken index-only bitmap heap scan

---
 .../expected/index-only-bitmapscan.out| 28 ++
 src/test/isolation/isolation_schedule |  1 +
 .../specs/index-only-bitmapscan.spec  | 85 +++
 3 files changed, 114 insertions(+)
 create mode 100644 src/test/isolation/expected/index-only-bitmapscan.out
 create mode 100644 src/test/isolation/specs/index-only-bitmapscan.spec

diff --git a/src/test/isolation/expected/index-only-bitmapscan.out b/src/test/isolation/expected/index-only-bitmapscan.out
new file mode 100644
index 000..132ff1bda70
--- /dev/null
+++ b/src/test/isolation/expected/index-only-bitmapscan.out
@@ -0,0 +1,28 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_mod s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s2_mod: 
+  DELETE FROM ios_bitmap WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare: 
+DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+
+step s1_fetch_1: 
+FETCH FROM foo;
+
+row_number
+--
+ 1
+(1 row)
+
+step s2_vacuum: VACUUM (TRUNCATE false) ios_bitmap;
+step s1_fetch_all: 
+FETCH ALL FROM foo;
+
+row_number
+--
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 143109aa4da..e3c669a29c7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
 test: two-ids
 test: multiple-row-versions
 test: index-only-scan
+test: index-only-bitmapscan
 test: predicate-lock-hot-tuple
 test: update-conflict-out
 test: deadlock-simple
diff --git a/src/test/isolation/specs/index-only-bitmapscan.spec b/src/test/isolation/specs/index-only-bitmapscan.spec
new file mode 100644
index 000..9962b8dc531
--- /dev/null
+++ b/src/test/isolation/specs/index-only-bitmapscan.spec
@@ -0,0 +1,85 @@
+# index-only-bitmapscan test showing wrong results
+#
+setup
+{
+-- by using a low fillfactor and a wide tuple we can get multiple blocks
+-- with just few rows
+CREATE TABLE ios_bitmap (a int NOT NULL, b int not null, pad char(1024) default '')
+WITH (AUTOVACUUM_ENABLED = false, FILLFACTOR = 10);
+
+INSERT INTO ios_bitmap SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+CREATE INDEX ios_bitmap_a ON ios_bitmap(a);
+CREATE INDEX ios_bitmap_b ON ios_bitmap(b);
+}
+
+teardown
+{
+DROP TABLE ios_bitmap;
+}
+
+
+session s1
+
+setup {
+SET enable_seqscan = false;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+
+# The test query uses an or between two indexes to ensure make it more likely
+# to use a bitmap index scan
+#
+# The row_number() hack is a way to have something returned (isolationtester
+# doesn't display empty rows) while still allowing for the index-only scan
+# optimization in bitmap heap scans, which requires an empty targetlist.
+step s1_prepare {
+DECLARE foo NO SCROLL CURSOR FOR SELECT row_number() OVER () FROM ios_bitmap WHERE a > 0 or b > 0;
+}
+
+step s1_fetch_1 {
+FETCH FROM foo;
+}
+
+step s1_fetch_all {
+FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Don't delete row 1 so we have a row for the cursor to "rest" on.
+step s2_mod
+{
+  DELETE FROM ios_bitmap WHERE a > 1;
+}
+
+# Disable truncation, as otherwise we'll just wait for a timeout while trying
+# to acquire the lock
+step s2_vacuum  { VACUUM (TRUNCATE false) ios_bitmap; }
+
+permutation
+  # Vacuum first, to ensure VM exists, otherwise the bitmapscan will consider
+  # VM to be size 0, due to caching. Can't do that in setup because
+  s2_vacuum
+
+  # delete nearly all rows, to make issue visible
+  s2_mod
+  # create a cursor
+  s1_begin
+  s1_prepare
+
+  # fetch one row from the cursor, that ensures the index scan portion is done
+  # before the vacuum in the next step
+  s1_fetch_1
+
+  # with the bug this vacuum will mark pages as all-visible that the scan in
+  # the next step then considers all-visible, despite all rows from those
+  # pages having been removed.
+  s2_vacuum
+  # if this returns any rows, we're busted
+  s1_fetch_all
+
+  s1_commit
-- 
2.45.2.746.g06e570c0df.dirty

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:43 AM Tom Lane  wrote:
> Whether that's true or not, it seems like it'd be worth bisecting
> to see if we can finger a commit where the behavior changed (and
> the same goes for the question of why-isnt-it-an-IOS-scan).  However,
> the reproducer seems to have quite a low failure probability for me,
> which makes it hard to do bisection testing with much confidence.
> Can we do anything to make the test more reliable?  If I'm right
> to suspect autovacuum, maybe triggering lots of manual vacuums
> would improve the odds?

I agree that autovacuum (actually, VACUUM) is important here.

I find that the test becomes much more reliable if I create the test
table "with (autovacuum_analyze_scale_factor=0.99,
vacuum_truncate=off)". More importantly, rather than relying on
autovacuum, I just run VACUUM manually from psql. I find it convenient
to use "\watch 0.01" to run VACUUM repeatedly.

-- 
Peter Geoghegan

Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 12:02:39 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I think the problematic scenario involves tuples that *nobody* can see. 
> > During
> > the bitmap index scan we don't know that though. Thus the tid gets inserted
> > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> > the tuple from the indexes and then the heap and marks the page as
> > all-visible, as the deleted row version has been removed.
> 
> Yup.  I am saying that that qualifies as too-aggressive setting of the
> all-visible bit.  I'm not sure what rule we should adopt instead of
> the current one, but I'd much rather slow down page freezing than
> institute new page locking rules.

How? This basically would mean we could never set all-visible if there is
*any* concurrent scan on the current relation, because any concurrent scan
could have an outdated view of all-visible.  Afaict this isn't an issue of
"too-aggressive setting of the all-visible bit", it's an issue of setting it
at all.

Greetings,

Andres Freund

Re: meson missing test dependencies


On 02.12.24 11:50, Nazir Bilal Yavuz wrote:

I applied your patches and the cube example worked. But when I edited
'test_json_parser_perf.c' and 'test_json_parser_incremental.c', meson
did not rebuild. I used the 'meson test -C build --suite setup --suite
test_json_parser' command to test test_json_parser. Did I do something
wrong?


Seems to work for me. I edited test_json_parser_incremental.c and then ran

$ meson test -C build --suite setup --suite test_json_parser -v
ninja: Entering directory `/Users/peter/devel/postgresql/postgresql/build'
[2/2] Linking target 
src/test/modules/test_json_parser/test_json_parser_incremental
1/7 postgresql:setup / tmp_install 
   RUNNING

...

Without my patch, you don't get the "Linking target ..." output.

Re: Remove useless casts to (void *)

2024-12-02 Thread Thomas Munro

On Tue, Dec 3, 2024 at 7:06 AM Tom Lane  wrote:
> This is from hake[1], which is running OpenIndiana/illumos.
> That platform shows a couple of other strange warnings, so maybe
> re-eliminating these two is not worth worrying about, but
> nonetheless the casts to void * were doing something here.

I wouldn't change that.  illumos is selecting the old pre-standard
declaration here, but it knows the standard one:

https://github.com/illumos/illumos-gate/blob/27ecbff00d8c86a2647d6fe325cacb220d712115/usr/src/uts/common/sys/shm.h#L129
https://illumos.org/man/2/shmdt

I don't know why we have only one tiny issue if the system headers
think we want pre-POSIX stuff.  I wonder if the particular header has
incorrect guarding, but I don't know how that is supposed to work.

Re: sunsetting md5 password support

On Wed, Nov 20, 2024 at 08:17:07PM -0500, Greg Sabino Mullane wrote:
> Sounds good to me. I think my hesitation was more that the hint was
> overpromising help, so big +1 to more detail and keeping it.

Committed.  If anyone wants to try putting together a patch that expands
the "migrating to SCRAM" section of the docs before I get to it, please be
my guest.

-- 
nathan

Re: More CppAsString2() in psql's describe.c

2024-12-02 Thread Daniel Gustafsson

> On 2 Dec 2024, at 08:52, Tom Lane  wrote:

> But isn't there a way to improve the macro so this'd lead to an error?

That sounds like a pretty decent improvement in general.  I experimented with
quick hack using a typeof check on the passed symbol which catches symbolname
typos. It might be totally unfit for purpose but it was an interesting hack.

#define CppAsString2(x) ((__builtin_types_compatible_p(__typeof__(x),char *) ?: 
CppAsString(x)))

It does require changing the any uses of the macro in string generation from
f("pre" CppAsString2(SYM) "post"); into f_v("pre%spost", CppAsString2(SYM));
however, and using it as part of another macro (TABLESPACE_VERSION_DIRECTORY)
doesn't work.

--
Daniel Gustafsson

Re: RISC-V animals sporadically produce weird memory-related failures

2024-12-02 Thread Thomas Munro

On Tue, Dec 3, 2024 at 7:00 AM Alexander Lakhin  wrote:
> A build made with clang-19 without llvm passed `make check` successfully.

We heard in another thread[1] that we'd need to use the JITLink API
for RISCV, instead of the RuntimeDyld API we're using.  I have a newer
patch to use JITLink on all architectures, starting at some LLVM
version, but it needs a bit more polish and research before sharing.
I'm surprised it's segfaulting instead of producing an error of some
sort, though.  I wonder why.  It would be nice if we could fail
gracefully instead.

Hmm, from a quick look in the LLVM main branch, it looks like a bunch
of RISCV stuff just landed in recent months under
llvm/lib/ExecutionEngine/RuntimeDyld, so maybe that's not true anymore
on bleeding-edge LLVM (20-devel), I have no idea what state that's in,
but IIUC there is no way RuntimeDyld could work on LLVM 16 or 19.

[1] 
https://www.postgresql.org/message-id/flat/20220829074622.2474104-1-alex.fan.q%40gmail.com

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:52 AM Andres Freund  wrote:
> I think the problematic scenario involves tuples that *nobody* can see. During
> the bitmap index scan we don't know that though.

Right, exactly.

FWIW, this same issue is why it is safe for nbtree to drop its pin
early during plain index scans, but not during index-only scans -- see
_bt_drop_lock_and_maybe_pin, and the nbtree/README section on making
concurrent TID recycling safe. Weirdly, nbtree is specifically aware
that it needs to *not* drop its pin in the context of index-only scans
(to make sure that VACUUM cannot do unsafe concurrent TID recycling)
-- even though an equivalent index scan would be able to drop its pin
like this.

The underlying reason why nbtree can discriminate like this is that it
"knows" that plain index scans will always visit the heap proper. If a
TID points to an LP_UNUSED item, then it is considered dead to the
scan (even though in general the heap page itself might be marked
all-visible). If some completely unrelated, newly inserted heap tuple
is found instead, then it cannot be visible to the plain index scan's
MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
dropped like this).

-- 
Peter Geoghegan

Re: Vacuum statistics

2024-12-02 Thread Alena Rybakina


On 02.12.2024 11:27, Alexander Korotkov wrote:

Hi, Alena!

On Wed, Nov 13, 2024 at 6:21 PM Alena Rybakina
  wrote:

Updated 0001-v13 attached, as well as the diff between v12 and v13.

Thank you)

And I agree with your changes. And included them in patches.

Thank you for the updated patchset.  Some points from me.

* I've read the previous discussion on how important to keep all these
fields regarding vacuum statistics including points by Andrei and Jim.
It still worrying me that statistics volume is going to burst in about
3 times, but I don't have a particular proposal on how to make more
granular approach.  I wonder if you could propose something.
* Previously PGSTAT_FILE_FORMAT_ID got increased by 1. Your 0001 patch
increases it by 2.  It's minor note, but I'd like to keep the
tradition.
* Commit message for 0001 looks nice, but commit messages of 0002,
0003, and 0004 look messy.  Could you please, rearrange them.
* The distinction between 0001 and 0002 is not clear. The first line
of 0001 is "Machinery for grabbing an extended vacuum statistics on
heap relations", the first line of 0002 is "Machinery for grabbing an
extended vacuum statistics on heap and index relations."  I guess 0001
should be about heap relations while 0002 should be about just index
relations.  Is this correct?
* I guess this statistics should work for any table AM, based on what
has been done in relation_vacuum() interface method.  If that's
correct, we need to get rid of "heap" terminology and use "table"
instead.
* 0004 should be pure documentation patch, but it seems containing
changes to isolation tests.  Please, move them into a more appropriate
place.

Thank you for your valuable feedback, I am already carefully processing 
your comments and will update the patches soon.


I will think about what can be done to address the problem of increasing 
the volume of statistics; perhaps it will be possible to implement a guc 
that, when enabled, will accumulate additional information on vacuum 
statistics. For example, this way you can group statistics by buffers 
and vacuum statistics.


--
Regards,
Alena Rybakina
Postgres Professional

Re: RISC-V animals sporadically produce weird memory-related failures

2024-12-02 Thread Tom Turelinckx

Hi Alexander,

On Mon, Dec 2, 2024, at 2:00 PM, Alexander Lakhin wrote:
> These crashes are hardly related to code changes, so maybe there are
> platform-specific issues still...

I naively assumed that because llvm and clang are available in Trixie on 
riscv64 that I could simply install them and enable --with-llvm on copperhead, 
but I then discovered that this caused lots of segmentation faults and I had to 
revert the --with-llvm again. Sorry about not first testing without submitting 
results.

> Unfortunately, the log files saved don't include coredump information,
> maybe because of inappropriate core_pattern.

I had increased the core file size limit in /etc/security/limits.conf, but in 
Trixie this is overruled by a default 
/etc/security/limits.d/10-coredump-debian.conf. Moreover, the core_pattern was 
set by apport on the Ubuntu lxc host, but apport is not available in the Trixie 
lxc guest. I have now corrected both issues, and a simple test resulted in a 
core file being written to the current directory, like it was before the 
upgrade.

Best regards,
Tom

Re: Support POSITION with nondeterministic collations


On 26.08.24 08:09, Peter Eisentraut wrote:
This patch allows using text position search functions with 
nondeterministic collations.  These functions are


- position, strpos
- replace
- split_part
- string_to_array
- string_to_table

which all use common internal infrastructure.


Some exploratory testing could be useful here.  The present test 
coverage was already quite helpful during development, but there is 
always the possibility that something was overlooked.


Here is a freshly rebased patch version, no functionality changes.From 272befa6ab78d94ac35d24203da415545b3ad2d9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 10:06:28 +0100
Subject: [PATCH v2] Support POSITION with nondeterministic collations

This allows using text position search functions with nondeterministic
collations.  These functions are

- position, strpos
- replace
- split_part
- string_to_array
- string_to_table

which all use common internal infrastructure.

There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the search cannot use any
byte-by-byte optimized techniques but has to go substring by
substring.  We also need to consider that the found match could have a
different length than the needle and that there could be substrings of
different length matching at a position.  In most cases, we need to
find the longest such substring (greedy semantics).

Discussion: 
https://www.postgresql.org/message-id/flat/582b2613-0900-48ca-8b0d-340c06f4d...@eisentraut.org
---
 src/backend/utils/adt/varlena.c   | 105 ++--
 .../regress/expected/collate.icu.utf8.out | 154 +++---
 src/test/regress/sql/collate.icu.utf8.sql |  36 +++-
 3 files changed, 247 insertions(+), 48 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 533bebc1c7b..3fda7a8aeea 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -54,7 +54,9 @@ typedef struct varlena VarString;
  */
 typedef struct
 {
+   pg_locale_t locale; /* collation used for substring 
matching */
boolis_multibyte_char_in_char;  /* need to check char 
boundaries? */
+   boolgreedy; /* find longest possible 
substring? */
 
char   *str1;   /* haystack string */
char   *str2;   /* needle string */
@@ -65,7 +67,13 @@ typedef struct
int skiptablemask;  /* mask for ANDing with 
skiptable subscripts */
int skiptable[256]; /* skip distance for given 
mismatched char */
 
+   /*
+* Note that with nondeterministic collations, the length of the last
+* match is not necessarily equal to the length of the "needle" passed 
in.
+*/
char   *last_match; /* pointer to last match in 'str1' */
+   int last_match_len; /* length of last match */
+   int last_match_len_tmp; /* same but for internal 
use */
 
/*
 * Sometimes we need to convert the byte position of a match to a
@@ -1178,15 +1186,21 @@ text_position(text *t1, text *t2, Oid collid)
TextPositionState state;
int result;
 
+   check_collation_set(collid);
+
/* Empty needle always matches at position 1 */
if (VARSIZE_ANY_EXHDR(t2) < 1)
return 1;
 
/* Otherwise, can't match if haystack is shorter than needle */
-   if (VARSIZE_ANY_EXHDR(t1) < VARSIZE_ANY_EXHDR(t2))
+   if (VARSIZE_ANY_EXHDR(t1) < VARSIZE_ANY_EXHDR(t2) &&
+   pg_newlocale_from_collation(collid)->deterministic)
return 0;
 
text_position_setup(t1, t2, collid, &state);
+   /* don't need greedy mode here */
+   state.greedy = false;
+
if (!text_position_next(&state))
result = 0;
else
@@ -1217,18 +1231,17 @@ text_position_setup(text *t1, text *t2, Oid collid, 
TextPositionState *state)
 {
int len1 = VARSIZE_ANY_EXHDR(t1);
int len2 = VARSIZE_ANY_EXHDR(t2);
-   pg_locale_t mylocale;
 
check_collation_set(collid);
 
-   mylocale = pg_newlocale_from_collation(collid);
+   state->locale = pg_newlocale_from_collation(collid);
 
-   if (!mylocale->deterministic)
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("nondeterministic collations are not 
supported for substring searches")));
+   /*
+* Most callers need greedy mode, but some might want to unset this to
+* optimize.
+*/
+   state->greedy = true;
 
-   Assert(len1 > 0);
Assert(len2 > 0);

Re: Make tuple deformation faster

2024-12-02 Thread David Rowley

On Sat, 30 Nov 2024 at 02:54, Victor Yegorov  wrote:
> I've been testing this patch for the last week, I have M3 and i7 based MBP 
> around.

Thanks for having a look at this and running the benchmarks.

> Construct
>sizeof(FormData_pg_attribute) * (src)->natts
> is used in 7 places (in various forms), I thought it might be good
> to use a macro here, say TupleArraySize(natts).

I ended up adjusting the code here so that TupleDescSize() returns the
full size and TupleDescAttrAddress() manually calculates the offset to
start the FormData_pg_attribute array.  That allows
TupleDescFullSize() to be deleted.  I changed how TupleDescCopy()
works as it used to perform the memcpy in 2 parts. I've changed that
to now perform a single memcpy() and reset the ->attrs field after the
memcpy so that it correctly points to the address for its own
TupleDesc rather than the one from the source.

> In v4-0002-Introduce-CompactAttribute-array-in-TupleDesc.patch
>
> +#define COMPACT_ATTR_IS_PACKABLE(att) \
> +>  ((att)->attlen == -1 && att->attispackable)
>
> Seems second att needs parenthesis around it.

Adjusted. Thanks.

> Although I haven't seen 30% speedup, I find this change very good to have.

I think there's a bit more juice to squeeze out still. I started
another thread for a much different approach to increasing the tuple
deform performance over in [1]. The benchmarks I showed over there
show the results with all the v4 patches on this thread plus that
patch, and also another set of results from just the v4 patches from
here. My Apple M2 very much likes the patch from the other thread. I
don't have any decent Intel hardware to test on.

I've attached a v5 set of patches, which I think addresses everything
you mentioned.  I've also shuffled the patches around a little to how
I think they should be committed. Here's a summary:

v5-0001: Adds the CompactAttribute struct, includes it in TupleDesc
and adds all the code to populate it. Also includes a very small
number of users of CompactAttribute.

v5-0002: Adjusts dozens of locations to use CompactAttribute struct
instead of the Form_pg_attribute struct. Lots of churn, but not
complex changes. Separated out from v5-0001 so it's easier to see the
important changes 0001 is making.

v5-0003: Change CompactAttribute.attalign char field to attalignby to
uint8 field to optimise alignment calculations and remove branching.

v5-0004: Delete the now unused pg_attribute.attcacheoff column and
Form_pg_attribute field.

v5-0005: This is the patch from [1] rebased atop of this patch set.
I'll pick up the discussion on that thread, but offering a rebased
version here in case you'd like to try that one.

I spent all day today reviewing and fixing up a few missing comments
for the v5 patch series. I'm quite happy with these now. If nobody
else wants to look or test, I plan on pushing these tomorrow (Tuesday
UTC+13). If anyone wants me to delay so they can look, they better let
me know soon.

David

[1] 
https://postgr.es/m/caaphdvo9e0xg71wrefyarv5n4xnplk4k8ljd0msr3c9kr2v...@mail.gmail.com

v5-0001-Introduce-CompactAttribute-array-in-TupleDesc.patch
Description: Binary data

v5-0002-Use-CompactAttribute-instead-of-FormData_pg_attri.patch
Description: Binary data

v5-0003-Optimize-alignment-calculations-in-tuple-form-def.patch
Description: Binary data

v5-0004-Remove-pg_attribute.attcacheoff-column.patch
Description: Binary data

v5-0005-Speedup-tuple-deformation-with-additional-functio.patch
Description: Binary data

RE: Conflict detection for update_deleted in logical replication

2024-12-02 Thread Zhijie Hou (Fujitsu)

On Friday, November 29, 2024 6:35 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> Thanks for updating the patch! Here are my comments mainly for 0001.

Thanks for the comments!

> 
> 02. maybe_advance_nonremovable_xid
> 
> ```
> +case RCI_REQUEST_PUBLISHER_STATUS:
> +request_publisher_status(data);
> +break;
> ```
> 
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATU
> S is done in get_candidate_xid()->request_publisher_status().
> Can we remove this?

I changed to call the maybe_advance_nonremovable_xid() after changing the phase
in get_candidate_xid/wait_for_publisher_status, so that the code is reachable.

> 
> 
> 05. request_publisher_status
> 
> ```
> +if (!reply_message)
> +{
> +MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
> +
> +reply_message = makeStringInfo();
> +MemoryContextSwitchTo(oldctx);
> +}
> +else
> +resetStringInfo(reply_message);
> ```
> 
> Same lines exist in two functions: can we provide an inline function?

I personally feel these codes may not worth a separate function since it’s 
simple.
So didn't change in this version.

> 
> 06. wait_for_publisher_status
> 
> ```
> +if (!FullTransactionIdIsValid(data->last_phase_at))
> +data->last_phase_at =
> FullTransactionIdFromEpochAndXid(data->remote_epoch,
> +
> + data->remote_nextxid);
> +
> ```
> 
> Not sure, is there a possibility that data->last_phase_at is valid here? It is
> initialized just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

Oh. I think last_phase_at should be initialized only in the first phase. Fixed.

Other comments look good to me and have been addressed in V13.

Best Regards,
Hou zj

Re: Re: Added prosupport function for estimating numeric generate_series rows

2024-12-02 Thread Dean Rasheed

On Sat, 30 Nov 2024 at 00:38, tsinghualucky...@foxmail.com
 wrote:
>
> Dear Dean Rasheed, I have reviewed the v4 patch and it is very thoughtful and 
> reasonable, with a very clever attention to detail (plus I am very happy that 
> we can get rid of the goto, which I was not a big fan of).
>
> This patch looks very good and I have no complaints about it. Thanks again 
> for your help from beginning to end!
>

Cool. Patch committed.

Regards,
Dean

Re: Virtual generated columns

2024-12-02 Thread Amit Kapila

On Fri, Nov 29, 2024 at 3:16 PM Peter Eisentraut  wrote:
>
> On 14.11.24 10:46, Amit Kapila wrote:
> >> Moreover, we would have to implement some elaborate cross-checks if a
> >> table gets added to a publication.  How would that work?  "Can't add
> >> table x to publication because it contains a virtual generated column
> >> with a non-simple expression"?  With row filters, this is less of a
> >> problem, because the row filter a property of the publication.
> >>
> > Because virtual generated columns work in row filters, so I thought it
> > could follow the rules for column lists as well. If the virtual column
> > doesn't adhere to the rules of the row filter then it shouldn't even
> > work there. My response was based on the theory that the expression
> > for virtual columns could be computed during logical decoding. So,
> > let's first clarify that before discussing this point further.
>
> Row filter expressions have restrictions that virtual columns do not
> have.  For example, row filter expressions cannot use user-defined
> functions.  If you have a virtual column that uses a user-defined
> function and then you create a row filter using that virtual column, you
> get an error when you create the publication.  (This does not work
> correctly in the posted patches, but it will in v10 that I will post
> shortly.)  This behavior is ok, I think, you get the error when you
> write the faulty expression, and it's straightforward to implement.
>

Fair enough but the same argument applies to the column list. I mean
to say based on the same theory, users will get the ERROR when an
unsupported virtual column type will be used in column the list.

> Now let's say that we implement what you suggest that we compute virtual
> columns during logical decoding.  Then we presumably need similar
> restrictions, like not allowing user-defined functions.
>
> Firstly, I don't know if that would be such a good restriction.  For row
> filters, that's maybe ok, but for virtual columns, you want to be able
> to write complex and interesting expressions, otherwise you wouldn't
> need a virtual column.
>
> And secondly, we'd then need to implement logic to check that you can't
> add a table with a virtual column with a user-defined function to a
> publication.  This would happen not when you write the expression but
> only later when you operate on the table or publication.  So it's
> already a dubious user experience.
>
> And the number of combinations and scenarios that you'd need to check
> there is immense.  (Not just CREATE PUBLICATION and ALTER PUBLICATION,
> but also CREATE TABLE when a FOR ALL TABLES publication exists, ALTER
> TABLE when new columns are added, new partitions are attached, and so
> on.)  Maybe someone wants to work on that, but that's more than I am
> currently signed up for.  And given the first point, I'm not sure if
> it's even such a useful feature.
>
> I think, for the first iteration of this virtual generated columns
> feature, the publish_generated_columns option should just not apply to
> it.
>

Ok. But as mentioned above, we should consider it for the column list.

>
  Whether that means renaming the option or just documenting this is
> something for discussion.
>

We can go either way. Say, if we just document it and in the future,
if we want to support it for virtual columns then we need to introduce
another boolean option like publish_generated_virtual_columns. The
other possibility is that we change publish_generated_columns to enum
or string and allow values 's' (stored), 'v' (virtual), and 'n'
(none). Now, only 's' and 'n' will be supported. In the future, if one
wishes to add support for virtual columns, we have a provision to
extend the existing option.

-- 
With Regards,
Amit Kapila.

Use Python "Limited API" in PL/Python

This patch changes PL/Python to use the Python "limited API". This API 
has stronger ABI stability guarantees.[0] This means, you can build 
PL/Python against any Python 3.x version and use any other Python 3.x 
version at run time.


This is especially useful for binary packages where the operating system 
does not come with a fixed suitable version of Python. For example, 
Postgres.app (for macOS) would prefer to link against the Python version 
supplied by python.org (Python.app). But that has a 3.x version that 
changes over time. So instead they bundle a Python version inside 
Postgres.app. The Windows installer used to also bundle Python but as of 
PG17 you have to get it yourself, but you have to get a very specific 
version [1], which is unsatisfactory. This patch fixes that: You can use 
any Python version independent of what PL/Python was built against. 
(There is a mechanism to say "at least 3.N", but for this patch, we 
don't need that, we can stick with the current minimum of 3.2.)


(I have only tested the macOS side of this, not the Windows side. In 
fact, the patch currently doesn't build on Windows on CI. I haven't 
figured out why.)


For Linux-style packaging, I don't think this would have any benefit for 
users right now, since the OS comes with a Python installation and all 
the packages are built against that. But it could potentially be helpful 
for packagers. For example, on Debian, this could detach the postgresql 
packages from python version transitions. But AFAICT, the Python 
packaging layout is not prepared for that. (There are only 
libpython3.x.so libraries, no libpython3.so that one would have to link 
against.)


Finally, I think this patch is part of a path toward making PL/Python 
thread-safe. I don't think the patch by itself changes anything, but if 
you read through [2], using heap types is part of the things mentioned 
there.


[0]: https://docs.python.org/3/c-api/stable.html
[1]: 
https://github.com/EnterpriseDB/edb-installers/blob/REL-17/server/resources/installation-notes.html#L34-L36

[2]: https://docs.python.org/3/howto/isolating-extensions.html
From 76d4e4b628f9e7913b5c361dbbd4c0ea9d570146 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 08:53:42 +0100
Subject: [PATCH 1/2] Remove obsolete Python version check

The checked version is already the current minimum supported version
(3.2).
---
 src/pl/plpython/plpy_exec.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c
index 0e84bb90829..00747bb811b 100644
--- a/src/pl/plpython/plpy_exec.c
+++ b/src/pl/plpython/plpy_exec.c
@@ -1066,13 +1066,7 @@ PLy_procedure_call(PLyProcedure *proc, const char 
*kargs, PyObject *vargs)
 
PG_TRY();
{
-#if PY_VERSION_HEX >= 0x0302
-   rv = PyEval_EvalCode(proc->code,
-proc->globals, 
proc->globals);
-#else
-   rv = PyEval_EvalCode((PyCodeObject *) proc->code,
-proc->globals, 
proc->globals);
-#endif
+   rv = PyEval_EvalCode(proc->code, proc->globals, proc->globals);
 
/*
 * Since plpy will only let you close subtransactions that you

base-commit: 2f696453d2b39fea800d5f7d8e5d3e1a2266de24
-- 
2.47.1

From a031a128140f3f4c61e6a5468c1a603f547a986b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 08:53:42 +0100
Subject: [PATCH 2/2] Use Python "Limited API" in PL/Python

This allows building PL/Python against any Python 3.x version and
using another Python 3.x version at run time.

Implementation details:

- Convert static types to heap types
  (https://docs.python.org/3/howto/isolating-extensions.html#heap-types).

- Replace PyRun_String() with component functions.

- Replace PyList_SET_ITEM() with PyList_SetItem().
---
 src/pl/plpython/plpy_cursorobject.c  | 71 +---
 src/pl/plpython/plpy_planobject.c| 61 +++--
 src/pl/plpython/plpy_procedure.c |  5 +-
 src/pl/plpython/plpy_resultobject.c  | 98 +---
 src/pl/plpython/plpy_subxactobject.c | 41 +++-
 src/pl/plpython/plpy_typeio.c|  6 +-
 src/pl/plpython/plpython.h   |  2 +
 7 files changed, 179 insertions(+), 105 deletions(-)

diff --git a/src/pl/plpython/plpy_cursorobject.c 
b/src/pl/plpython/plpy_cursorobject.c
index 6108384c9a5..39bcae3f1d9 100644
--- a/src/pl/plpython/plpy_cursorobject.c
+++ b/src/pl/plpython/plpy_cursorobject.c
@@ -20,7 +20,7 @@
 #include "utils/memutils.h"
 
 static PyObject *PLy_cursor_query(const char *query);
-static void PLy_cursor_dealloc(PyObject *arg);
+static void PLy_cursor_dealloc(PLyCursorObject *self);
 static PyObject *PLy_cursor_iternext(PyObject *self);
 static PyObject *PLy_cursor_fetch(PyObject *self, PyObject *args);
 static PyObject *PLy_cursor_close(PyObject *self, PyObject *unused);
@@ -33,22

Re: Converting README documentation to Markdown


On 01.10.24 22:15, Daniel Gustafsson wrote:

On 1 Oct 2024, at 16:53, Jelte Fennema-Nio  wrote:
On Tue, 1 Oct 2024 at 15:52, Daniel Gustafsson  wrote:



Apart from this, I don't changing the placeholders like  to < foo >.  In 
some cases, this really decreases readability.  Maybe we should look for different 
approaches there.


Agreed.  I took a stab at some of them in the attached.  The usage in
src/test/isolation/README is seemingly the hardest to replace and I'm not sure
how we should proceed there.


One way to improve the isolation/README situation is by:
1. indenting the standalone lines by four spaces to make it a code block
2. for the inline cases, replace  with `` or `foo`


If we go for following Markdown syntax then for sure, if not it will seem a bit
off I think.


I took another look through this discussion.  I think the v4 patches 
from 2024-10-01 are a good improvement.  I suggest you commit them and 
then we can be done here.

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Amit Kapila

On Mon, Dec 2, 2024 at 12:49 PM Michael Paquier  wrote:
>
> On Mon, Dec 02, 2024 at 11:47:09AM +0530, Amit Kapila wrote:
> > We already have PGOutputData->cachectx which could be used for it. I
> > think we should be able to reset such a context when we are
> > revalidating the publications. Even, if we want a new context for some
> > localized handling, we should add that in PGOutputData rather than a
> > local context as the proposed patch is doing at the very least for
> > HEAD.
>
> cachectx is used for the publications and the hash table holding
> all the RelationSyncEntry entries, but we lack control of individual
> parts within it.  So you cannot reset the whole context when
> processing a publication invalication.  Perhaps adding that to
> PGOutputData would be better, but that would be inconsistent with
> RelationSyncCache.
>

AFAICS, RelationSyncCache is not allocated in PGOutputData->cachectx.
It is allocated in CacheMemoryContext, see caller of
init_rel_sync_cache(). I think you are talking about individual hash
entries. Ideally, we can free all entries together and reset cachectx
but right now, we are freeing the allocated memory in those entries,
if required, at the next access. So, resetting the entire
PGOutputData->cachectx won't be possible. But, I don't get why adding
new context in PGOutputData for publications would be inconsistent
with RelationSyncCache? Anyway, I think it would be okay to
retail-free in this case, see the below responses.

> > Can't we consider freeing the publication names individually that can
> > be backpatchable and have no or minimal risk of breaking anything?
>
> Sure.  The first thing I did was a loop that goes through the
> publication list and does individual pfree() for the publication
> names.  That works, but IMO that's weird as we rely on the internals
> of GetPublication() hidden two levels down in pg_publication.c.
>

We can look at it from a different angle which is that the
FreePublication(s) relies on how the knowledge of Publication
structure is built. So, it doesn't look weird if we think from that
angle.

> >> I am slightly concerned about the current design of GetPublication()
> >> in the long-term, TBH.  LoadPublications() has hidden the leak behind
> >> two layers of routines in the WAL sender, and that's easy to miss once
> >> you call anything that loads a Publication depending on how the caller
> >> caches its data.  So I would still choose for modifying the structure
> >> on HEAD removing the pstrdup() for the publication name.
> >>
> >
> > BTW, the subscription structure also used the name in a similar way.
> > This will make the publication/subscription names handled differently.
>
> Good point about the inconsistency, so the name could also be switched
> to a fixed-NAMEDATALEN there if we were to do that.  The subscription
> has much more pstrdup() fields, though..  How about having some Free()
> routines instead that deal with the whole cleanup of a single list
> entry?  If that's kept close to the GetPublication() and
> GetSubscription() routines, a refresh when changing these structures
> would be hard to miss.
>

We already have FreeSubscription() which free name and other things
before calling list_free_deep. So, I thought a call on those lines for
publications wouldn't be a bad idea.

-- 
With Regards,
Amit Kapila.

Re: Truncate logs by max_log_size

2024-12-02 Thread Jim Jones




On 29.11.24 21:57, Kirill Gavrilov wrote:
> Same thing applies to log_parameter_max_length, for example.
>
> postgres=# set log_parameter_max_length = '1foo';
> ERROR:  invalid value for parameter "log_parameter_max_length": "1foo"
> HINT:  Valid units for this parameter are "B", "kB", "MB", "GB", and "TB".
> postgres=# set log_parameter_max_length = '1TB';
> ERROR:  invalid value for parameter "log_parameter_max_length": "1TB"
> HINT:  Value exceeds integer range. 
>
> I think we can leave it as is.

I see. So I guess it is out of scope to change this message here.

Small nitpicks:

1) The indentation of the comment at postgresql.conf.sample is a little
bit off

#max_log_size = 0                 # max size of logged statement
                # 0 disables the feature

IMHO it looks better like this:

#max_log_size = 0   # max size of logged statement
                # 0 disables the feature


2) You introduced a trailing whitespace at L34 (Not critical :))

+    Zero disables the setting.

It happens to me all the time, so I usually try to apply my patches in a
clean branch just to make sure I didn't miss anything.

Other than that, I have nothing more to add at this point.

Thanks

-- 
Jim

Re: More CppAsString2() in psql's describe.c


On 02.12.24 08:52, Tom Lane wrote:

(Moreover, the current structure assumes that the C character literal
syntax used by the PROKIND_* and other symbols happens to be the same as
the SQL string literal syntax required in those queries, which is just
an accident.)

So?  There isn't much about C syntax that isn't an accident.
Neither literal syntax is going to change, so I don't see why
it's problematic to rely on them being the same.


For example, if you write

#define  RELKIND_RELATION'\x72'

then it won't work anymore.

I was also curious whether

#define FOO 'r'
#define RELKIND_RELATION FOO

would work.  It appears it does.  But this syntactic construction is 
quite hard to completely understand currently.

pure parsers and reentrant scanners

This patch series changes several parsers in the backend and contrib 
modules to use bison pure parsers and flex reentrant scanners. This is 
ultimately toward thread-safety, but I think it's also just nicer in 
general, and it might also fix a few possible small memory leaks.


I organized this patch series into very small incremental changes so 
that it's easier to follow. The final commits should probably combined a 
bit more (e.g., one per module).


In this patch series I have so far dealt with

* contrib/cube/
* contrib/seg/
* src/backend/replication/repl_*
* src/backend/replication/syncrep_*

These four needed the whole treatment: pure parser, reentrant scanner, 
and updated memory handling.


Also:
* src/backend/utils/adt/jsonpath_scan.l

This one already had a pure parser and palloc-based memory handling, but 
not a reentrant scanner, so I just did that.


The above are all pretty similar, so it was relatively easy to work 
through them once I had the first one figured out.


A couple of things that are still missing in the above:

* For repl_scanner.l, I want to use yyextra to deal with the static 
variables marked /*FIXME*/, but somehow I made that buggy, I'll need to 
take another look later.
* For both the replication parser and the syncrep parser, get rid of the 
global variables to pass back the results. Again, here I need another 
look later. I confused either myself or the compiler on these.


cube, seg, and jsonpath are about as done as I would want them to be.

Not done yet:
* src/backend/utils/misc/guc-file.l
* src/pl/plpgsql/src/pl_gram.y

These have quite different structures and requirements, so I plan to 
deal with them separately.


Not relevant for backend thread-safety:
* src/backend/bootstrap/

It might make sense to eventually covert that one as well, just so that 
the APIs are kept similar. But that could be for later.


Note that the core scanner and parser are already reentrant+pure.

Also, there are various other scanners and parsers in frontends (psql, 
pgbench, ecpg) that are not relevant for this. (Again, it might make 
sense to convert some of them later, and some of them are already done.)


AFAICT, all the options and coding techniques used here are already in 
use elsewhere in the tree, so there shouldn't be any concerns about 
bison or flex version compatibility.
From 01823a7c975fda47ce220db5bb91ecd0959d5123 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 10:35:37 +0100
Subject: [PATCH v0 01/15] cube: pure parser and reentrant scanner

Use the flex %option reentrant and the bison option %pure-parser to
make the generated scanner and parser pure, reentrant, and
thread-safe.

(There are still some issues in the surrounding integration, see
FIXMEs.)
---
 contrib/cube/cube.c  |  7 +++---
 contrib/cube/cubedata.h  | 15 
 contrib/cube/cubeparse.y | 15 +---
 contrib/cube/cubescan.l  | 51 +++-
 4 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 1fc447511a1..bf8fc489dca 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -120,13 +120,14 @@ cube_in(PG_FUNCTION_ARGS)
char   *str = PG_GETARG_CSTRING(0);
NDBOX  *result;
Sizescanbuflen;
+   yyscan_tscanner;
 
-   cube_scanner_init(str, &scanbuflen);
+   cube_scanner_init(str, &scanbuflen, &scanner);
 
-   cube_yyparse(&result, scanbuflen, fcinfo->context);
+   cube_yyparse(&result, scanbuflen, fcinfo->context, scanner);
 
/* We might as well run this even on failure. */
-   cube_scanner_finish();
+   cube_scanner_finish(scanner);
 
PG_RETURN_NDBOX_P(result);
 }
diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h
index 96fa41a04e7..8bfcc6e99a2 100644
--- a/contrib/cube/cubedata.h
+++ b/contrib/cube/cubedata.h
@@ -59,14 +59,21 @@ typedef struct NDBOX
 #define CubeKNNDistanceEuclid  17  /* <-> */
 #define CubeKNNDistanceChebyshev   18  /* <=> */
 
+/* for cubescan.l and cubeparse.y */
+/* All grammar constructs return strings */
+#define YYSTYPE char *
+typedef void *yyscan_t;
+
 /* in cubescan.l */
-extern int cube_yylex(void);
+extern int cube_yylex(YYSTYPE *yylval_param, yyscan_t yyscanner);
 extern void cube_yyerror(NDBOX **result, Size scanbuflen,
 struct Node *escontext,
+yyscan_t yyscanner,
 const char *message);
-extern void cube_scanner_init(const char *str, Size *scanbuflen);
-extern void cube_scanner_finish(void);
+extern void cube_scanner_init(const char *str, Size *scanbuflen, yyscan_t 
*yyscannerp);
+extern void cube_scanner_finish(yyscan_t yyscanner);
 
 /* in cubeparse.y */
 extern int cube_yyparse(NDBOX **result, Size scanbuflen,
-

meson missing test dependencies

I have noticed that under meson many tests don't have dependencies on 
the build artifacts that they are testing.  As an example among many, if 
you make a source code change in contrib/cube/cube.c (see patch 0001 for 
a demo) and then run


make -C contrib/cube check

the test run will reflect the changed code, because the "check" targets 
typically depend on the "all" targets.  But if you do this under meson with


meson test -C build --suite setup --suite cube

the code will not be rebuilt first, and the test run will not reflect 
the changed code.


This seems straightforward to fix, see patch 0002.  The meson test setup 
has support for this, but it seems not widely used.


Patch 0003 is another example, this time for a TAP-style test.

Is there any reason this was not done yet?
From 079a7f902177f5e146c004fb7789c9972c86c5ae Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 10:48:37 +0100
Subject: [PATCH v0 1/3] XXX source code modification for test demonstration

---
 contrib/cube/cube.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 1fc447511a1..e3ad25a2a39 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -278,7 +278,7 @@ cube_subset(PG_FUNCTION_ARGS)
if ((dx[i] <= 0) || (dx[i] > DIM(c)))
ereport(ERROR,
(errcode(ERRCODE_ARRAY_ELEMENT_ERROR),
-errmsg("Index out of bounds")));
+errmsg("Index out of bounds XXX")));
result->x[i] = c->x[dx[i] - 1];
if (!IS_POINT(c))
result->x[i + dim] = c->x[dx[i] + DIM(c) - 1];
-- 
2.47.1

From 00b367c86dd14840d8412259f098f09737e73205 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 10:48:37 +0100
Subject: [PATCH v0 2/3] Add test dependency

---
 contrib/cube/meson.build | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contrib/cube/meson.build b/contrib/cube/meson.build
index 21b6f9c43ad..64ba300495d 100644
--- a/contrib/cube/meson.build
+++ b/contrib/cube/meson.build
@@ -53,6 +53,7 @@ tests += {
   'sd': meson.current_source_dir(),
   'bd': meson.current_build_dir(),
   'regress': {
+'deps': [cube],
 'sql': [
   'cube',
   'cube_sci',
-- 
2.47.1

From 176dc1eb383305025a3de6a895825de36f197678 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 10:48:37 +0100
Subject: [PATCH v0 3/3] Add test dependency for TAP module

---
 src/test/modules/test_json_parser/meson.build | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
index 059a8b71bde..f5a8ea44eed 100644
--- a/src/test/modules/test_json_parser/meson.build
+++ b/src/test/modules/test_json_parser/meson.build
@@ -61,5 +61,8 @@ tests += {
   't/003_test_semantic.pl',
   't/004_test_parser_perf.pl'
 ],
+'test_kwargs': {
+  'depends': [test_json_parser_incremental, test_json_parser_perf],
+},
   },
 }
-- 
2.47.1

fixing tsearch locale support

Infamously, the tsearch locale support in 
src/backend/tsearch/ts_locale.c still depends on libc environment 
variable locale settings and is not caught up with pg_locale_t, 
collations, ICU, and all that newer stuff.  This is used in the tsearch 
facilities themselves, but also in other modules such as ltree, pg_trgm, 
and unaccent.


Several of the functions are wrappers around  functions, like

int
t_isalpha(const char *ptr)
{
int clen = pg_mblen(ptr);
wchar_t character[WC_BUF_LEN];
pg_locale_t mylocale = 0;   /* TODO */

if (clen == 1 || database_ctype_is_c)
return isalpha(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe 
locale provider or collation settings.


As an easy start toward fixing this, I think several of these functions 
we don't even need.


t_isdigit() and t_isspace() are just used to parse various configuration 
and data files, and surely we don't need support for encoding-dependent 
multibyte support for parsing ASCII digits and ASCII spaces.  At least, 
I didn't find any indications in the documentation of these file formats 
that they are supposed to support that kind of thing.  So these can be 
replaced by the normal isdigit() and isspace().


There is one call to t_isprint(), which is similarly used only to parse 
some flags in a configuration file.  From the surrounding code you can 
deduce that it's only called on single-byte characters, so it can 
similarly be replaced by plain issprint().


Note, pg_trgm has some compile-time options with macros such as 
KEEPONLYALNUM and IGNORECASE.  AFAICT, these are not documented, and the 
non-default variant is not supported by any test cases.  So as part of 
this undertaking, I'm going to remove the non-default variants if they 
are in the way of cleanup.
From 7abc0f2333d8045004911040c856f1522d03b050 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH 1/3] Remove t_isdigit()

---
 contrib/ltree/ltree_io.c|  8 
 src/backend/tsearch/spell.c |  4 ++--
 src/backend/tsearch/ts_locale.c | 15 ---
 src/backend/utils/adt/tsquery.c |  2 +-
 src/backend/utils/adt/tsvector_parser.c |  4 ++--
 src/include/tsearch/ts_locale.h |  1 -
 6 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 11eefc809b2..b54a15d6c68 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -411,7 +411,7 @@ parse_lquery(const char *buf, struct Node *escontext)
case LQPRS_WAITFNUM:
if (t_iseq(ptr, ','))
state = LQPRS_WAITSNUM;
-   else if (t_isdigit(ptr))
+   else if (isdigit((unsigned char) *ptr))
{
int low = atoi(ptr);
 
@@ -429,7 +429,7 @@ parse_lquery(const char *buf, struct Node *escontext)
UNCHAR;
break;
case LQPRS_WAITSNUM:
-   if (t_isdigit(ptr))
+   if (isdigit((unsigned char) *ptr))
{
int high = 
atoi(ptr);
 
@@ -460,7 +460,7 @@ parse_lquery(const char *buf, struct Node *escontext)
case LQPRS_WAITCLOSE:
if (t_iseq(ptr, '}'))
state = LQPRS_WAITEND;
-   else if (!t_isdigit(ptr))
+   else if (!isdigit((unsigned char) *ptr))
UNCHAR;
break;
case LQPRS_WAITND:
@@ -471,7 +471,7 @@ parse_lquery(const char *buf, struct Node *escontext)
}
else if (t_iseq(ptr, ','))
state = LQPRS_WAITSNUM;
-   else if (!t_isdigit(ptr))
+   else if (!isdigit((unsigned char) *ptr))
UNCHAR;
break;
case LQPRS_WAITEND:
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index aaedb0aa852..7800f794e84 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -390,7 +390,7 @@ getNextFlagFromString(IspellDict *Conf, const char 
**sflagset, char *sflag)
*sflagset = next;
while (**sflagset)
{
-

Re: Sample rate added to pg_stat_statements

2024-12-02 Thread Ilia Evdokimov



On 26.11.2024 01:15, Ilia Evdokimov wrote:



On 22.11.2024 09:08, Alexander Korotkov wrote:

On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier  wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core.  This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans).  The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck.  And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread.  But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold.  The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

--
Regards,
Alexander Korotkov
Supabase


I really liked your idea, and I’d like to propose an enhancement that 
I believe improves it further.


Yes, if a query’s execution time exceeds the threshold, it should 
always be tracked without sampling. However, for queries with 
execution times below the threshold, the sampling logic should 
prioritize shorter queries over those closer to the threshold. In my 
view, the ideal approach is for shorter queries to have the highest 
probability of being sampled, while queries closer to the threshold 
are less likely to be sampled.


This behavior can be achieved with the following logic:

pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time

Here’s how it works:

  * As a query’s execution time approaches zero, the probability of it
being sampled approaches one.
  * Conversely, as a query’s execution time approaches the threshold,
the probability of it being sampled approaches zero.

In other words, the sampling probability decreases linearly from 1 to 
0 as the execution time gets closer to the threshold.


I believe this approach offers an ideal user experience. I have 
attached a new patch implementing this logic. Please let me know if 
you have any feedback regarding the comments in the code, the naming 
of variables or documentation. I’m always open to discussion.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.



I’ve been closely reviewing my last (v5-*.patch) patch on implementing 
time-based sampling, and I’ve realized that it might not be the best 
approach. Let me explain the reasons.


 * We can only perform sampling before the 'pgss_planner()' function.
   However, at that point, we don’t yet know the query's execution time
   since it only becomes available during 'pgss_ExecutorEnd()' or
   'pgss_ProcessUtility()';
 * If we wait to sample until the execution completes and we have the
   actual execution time, this introduces a problem. By that point, we
   might have already recorded the query's statistics into shared
   memory from the 'pgss_planner()' making it too late to decide
   whether to sample the query;
 * Delaying sampling until the execution finishes would require waiting
   for the execution time, which could introduce performance overhead.
   This defeats the purpose of sampling, which aims to reduce the cost
   of tracking query.

I believe we should reconsider the approach and revert to sampling based 
on v4-*.patch. If I’ve missed anything or there’s an alternative way to 
implement time threshold-based sampling efficiently, I’d be grateful to 
hear your insights.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Proper object locking for GRANT/REVOKE


On 25.11.24 02:24, Noah Misch wrote:

commit d31bbfb wrote:

--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
   * objectNamesToOids
   *
   * Turn a list of object names of a given type into an Oid list.
- *
- * XXX: This function doesn't take any sort of locks on the objects whose
- * names it looks up.  In the face of concurrent DDL, we might easily latch
- * onto an old version of an object, causing the GRANT or REVOKE statement
- * to fail.


To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:

- Use a self-exclusive lock here, not AccessShareLock.  With two sessions
   doing GRANT under AccessShareLock, one will "latch onto an old version".

- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
   GRANT/REVOKE can affect.  For example, today's locking in ALTER FUNCTION is
   the xmax stamped on the old tuple.  If GRANT switched to
   ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
   to "latch onto an old version".


Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ... 
$REASON.  In the face of concurrent DDL, we might easily latch

onto an old version of an object, causing the GRANT or REVOKE statement
to fail."


I wouldn't do those, however.  It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema.  I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR:  tuple
concurrently updated" to blocking.  That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes.  What
other benefits and drawbacks should we weigh?


I think what are describing is a reasonable tradeoff.  The user 
experience is tolerable: "tuple concurrently updated" is a mildly useful 
error message, and it's probably the table owner executing both commands.


The change to AccessShareLock at least prevents confusing "cache lookup 
failed" messages, and might alleviate some security concerns about 
swapping in a completely different object concurrently (even if we 
currently think this is not an actual problem).




--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
  ---
  (0 rows)
  
-s4: WARNING:  got: cache lookup failed for relation REDACTED

+s4: WARNING:  got: relation "intra_grant_inplace" does not exist


The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1().  Since this commit, that line no longer has coverage.


Do you have an idea how such a test case could be constructed now?

Re: Interrupts vs signals

2024-12-02 Thread Heikki Linnakangas


On 02/12/2024 09:32, Thomas Munro wrote:

On Sat, Nov 23, 2024 at 10:58 AM Heikki Linnakangas  wrote:

Hmm, so this would replace the maybeSleepingOnInterrupts bitmask I
envisioned. Makes a lot of sense. If it's a single bit though, that
means that you'll still get woken up by interrupts that you're not
waiting for. Maybe that's fine. Or we could merge the
maybeSleepingOnInterrupts and pendingInterrupts bitmasks to a single
atomic word, so that you would have a separate "maybe sleeping" bit for
each interrupt bit, but could still use atomic_fetch_or atomically read
the interrupt bits and announce the sleeping.


I think one bit is fine for now.  At least, until we have a serious
problem with interrupts arriving when you're sleeping but not ready to
service that particular interrupt.  The 'interrupt bit already set,
don't try to wake me' stuff discussed earlier would limit the number
of useless wakeups to one, until you eventually are ready and consume
the interrupt.  The main case I can think of, if we fast forward to
the all-procsignals-become-interrupts patch (which I'll be rebasing on
top of this when the next version appears), is that you might receive
a sinval catchup request, but you might be busy running a long query.
Sinval catchup messages are only processed between queries, so you
just keep ignoring them until end of query.  I think that's fine, and
unlikely.  Do you have other cases in mind?


Yeah, no, I think one bit is is good enough. Let's go with that.


If there is legitimate use case for a more fine-grained maybe-sleeping
and I've been too optimistic above, I don't think we should give one
whole maybe-sleeping bit to each interrupt reason.  We only have 32
bit atomics (splinlock-based emulation of 64 bit atomics is not good
enough for this, it's not safe in SIGALRM handlers, at least not
without a lot more pain; admittedly the SIGALRM handlers should
eventually be replaced but not for a while) so if we used up two bits
for every interrupt reason we could handle only 16 interrupt reasons,
and that's already not enough.  Perhaps we could add maybe-sleeping
bits for classes of interrupt if we ever determine that one bit for
all of them isn't enough?


If we run out of bits in a single pendingInterrupt words, we can have 
multiple words. SendInterrupt and ClearInterrupt would still only need 
to manipulate one word, the one holding the bit it's setting/clearing. 
WaitEventSetWait() would need to touch all of them, or at least all the 
ones that hold bits you want to wait for. That seems OK from a 
performance point of view.


I don't think we need to go there any time soon though, 32 bits should 
be enough for the use cases we've been discussing.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: meson missing test dependencies

2024-12-02 Thread Nazir Bilal Yavuz

Hi,

On Mon, 2 Dec 2024 at 13:11, Peter Eisentraut  wrote:
>
> I have noticed that under meson many tests don't have dependencies on
> the build artifacts that they are testing.  As an example among many, if
> you make a source code change in contrib/cube/cube.c (see patch 0001 for
> a demo) and then run
>
>  make -C contrib/cube check
>
> the test run will reflect the changed code, because the "check" targets
> typically depend on the "all" targets.  But if you do this under meson with
>
>  meson test -C build --suite setup --suite cube
>
> the code will not be rebuilt first, and the test run will not reflect
> the changed code.
>
> This seems straightforward to fix, see patch 0002.  The meson test setup
> has support for this, but it seems not widely used.
>
> Patch 0003 is another example, this time for a TAP-style test.
>
> Is there any reason this was not done yet?

This looks useful. I am not sure why this was not done before.

I applied your patches and the cube example worked. But when I edited
'test_json_parser_perf.c' and 'test_json_parser_incremental.c', meson
did not rebuild. I used the 'meson test -C build --suite setup --suite
test_json_parser' command to test test_json_parser. Did I do something
wrong?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-12-02 Thread Michael Christofides

> v4 patch attached

Thank you Guillaume, and nice work! I tried to see if there was anywhere
else in the documentation that would need updating, but it looks like you
covered everywhere already.

> I'm with Robert in that I've not found the buffer counts to be all that
useful most of the time.

I find the buffer counts especially helpful for educating newer folks on
why things are slow, even when they are not necessary for spotting the
issue (for more advanced users). One of my hopes is that by educating and
empowering newer users on how I/O relates to performance issues, fewer
cases will get escalated to more experienced folks.

> the cases I've seen most recently are those where the output is
mind-numbingly long already.

Are you mostly seeing query plans that have stumped other people already
(eg second or third line support), so perhaps seeing more complex plans
than the average user?

Both Depesz[1] and Tensor[2] have archives of publicly submitted plans,
which I found helpful for checking how slow plans look for users of those
tools. I have a similar archive, and while we do not publish them (and
there are plenty of huge plans) it also suggests that the majority of slow
plans people are reviewing have fewer than 20 nodes.

I realise it’s optimistic to think that the time experienced hackers would
lose having to sift through longer plans would be gained back by having to
do so less often, but I thought it was worth raising as part of the aim.

I also looked into the Slow Query Questions page on the wiki that we ask
people to review before posting to pgsql-performance, and noticed that has
suggested requesting buffers for the past 12 years[3].

—
Michael

[1]: https://explain.depesz.com/history
[2]: https://explain.tensor.ru/archive
[3]:
https://wiki.postgresql.org/index.php?title=Slow_Query_Questions&diff=18308&oldid=16800

Re: Make tuple deformation faster

2024-12-02 Thread Victor Yegorov

пн, 2 дек. 2024 г. в 13:24, David Rowley :

> I ended up adjusting the code here so that TupleDescSize() returns the
> full size and TupleDescAttrAddress() manually calculates the offset to
> start the FormData_pg_attribute array.  That allows
> TupleDescFullSize() to be deleted.  I changed how TupleDescCopy()
> works as it used to perform the memcpy in 2 parts. I've changed that
> to now perform a single memcpy() and reset the ->attrs field after the
> memcpy so that it correctly points to the address for its own
> TupleDesc rather than the one from the source.
>
>
Nice!


> I've attached a v5 set of patches, which I think addresses everything
> you mentioned.  I've also shuffled the patches around a little to how
> I think they should be committed.
>

I'm glad that the patch from “More tuple deformation speedups” is moved
here, I wanted
to mention that both patches should be committed together.

All is good, and the benefits are clearly visible (same setup used).


-- 
Victor Yegorov

code contributions for 2024, WIP version

2024-12-02 Thread Robert Haas

Hi,

As many of you are probably aware, I have been doing an annual blog
post on who contributes to PostgreSQL development for some years now.
It includes information on lines of code committed to PostgreSQL, and
also emails sent to the list. This year, I got a jump on analyzing the
commit log, and a draft of the data covering January-November of 2024
has been uploaded in pg_dump format to here:

https://sites.google.com/site/robertmhaas/contributions

I'm sending this message to invite anyone who is interested to review
the data in the commits2024 table and send me corrections. For
example, it's possible that there are cases where I've failed to pick
out the correct primary author for a commit; or where somebody's name
is spelled in two different ways; or where somebody's name is not
spelled the way that they prefer.

You'll notice that the table has columns "lines" and "xlines". I have
set xlines=0 in cases where (a) I considered the commit to be a large,
mechanical commit such as a pgindent run or translation updates; or
(b) the commit was reverting some other commit that occurred earlier
in 2024; or (c) the commit was subsequently reverted. When I run the
final statistics, those commits will still count for the statistics
that count the number of commits, but the lines they inserted will not
be counted as lines of code contributed in 2024. Also for clarity,
please be aware that the "ncauthor" column is not used in the final
reporting; that is just there so that I can set
author=coalesce(ncauthor,committer) at a certain phase of the data
preparation. Corrections should be made to the author column, not
ncauthor.

If you would like to correct the data, please send me your corrections
off-list, as a reply to this email, ideally in the form of one or more
UPDATE statements. If you would like to complain about the
methodology, I can't stop you, but please bear in mind that (1) this
is already a lot of work and (2) I've always been upfront in my blog
post about what the limitations of the methodology are and I do my
best not to suggest that this method is somehow perfect or
unimpeachable and (3) you're welcome to publish your own blog post
where you compute things differently. I'm open to reasonable
suggestions for improvement, but if your overall view is that this
sucks or that I suck for doing it, I'm sorry that you feel that way
but giving me that feedback probably will not induce me to do anything
differently.

Donning my asbestos underwear, I remain yours faithfully,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remove useless casts to (void *)

Peter Eisentraut  writes:
> Committed, thanks.

Now that we have a more-or-less full set of buildfarm results
on this, I checked for new warnings, and found two:

pg_shmem.c: In function 'PGSharedMemoryIsInUse':
pg_shmem.c:323:33: warning: passing argument 1 of 'shmdt' from incompatible 
pointer type [-Wincompatible-pointer-types]
  323 | if (memAddress && shmdt(memAddress) < 0)
  | ^~
  | |
  | PGShmemHeader *
In file included from pg_shmem.c:27:
/usr/include/sys/shm.h:131:11: note: expected 'char *' but argument is of type 
'PGShmemHeader *'
  131 | int shmdt(char *);
  |   ^~
pg_shmem.c: In function 'PGSharedMemoryCreate':
pg_shmem.c:838:37: warning: passing argument 1 of 'shmdt' from incompatible 
pointer type [-Wincompatible-pointer-types]
  838 | if (oldhdr && shmdt(oldhdr) < 0)
  | ^~
  | |
  | PGShmemHeader *
/usr/include/sys/shm.h:131:11: note: expected 'char *' but argument is of type 
'PGShmemHeader *'
  131 | int shmdt(char *);
  |   ^~

This is from hake[1], which is running OpenIndiana/illumos.
That platform shows a couple of other strange warnings, so maybe
re-eliminating these two is not worth worrying about, but
nonetheless the casts to void * were doing something here.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hake&dt=2024-12-02%2017%3A19%3A40&stg=make

Re: Add “FOR UPDATE NOWAIT” lock details to the log.

2024-12-02 Thread Fujii Masao





On 2024/10/18 19:07, Seino Yuki wrote:

The choice between adding a new GUC or extending the existing one
(e.g., log_lock_waits)
is debatable, but I prefer the latter. I'm considering extending log_lock_waits
to accept a value like "fail". If set to "on" (the current behavior),
detailed logs are generated when the lock wait time exceeds deadlock_timeout.
If set to "fail", logs are generated whenever a lock wait fails. If both are
specified, logs would be triggered when the wait time exceeds
deadlock_timeout or
when a lock wait fails.


Thanks for the idea.
Changed log_lock_waits to an enum type and added fail and all.
"off"  : No log message(default).
"on"   : If over deadlock_timeout(the current behavior).
"fail" : If lock failed.
"all"  : All pettern.


I'm still thinking about how we should handle logging for lock failures
caused by the nowait option. Extending the existing log_lock_waits parameter
has the advantage of avoiding a new GUC, but it might make configuration
more complicated. I'm starting to think adding a new GUC might be a better 
option.

Regarding the patch, when I applied it to HEAD, it failed to compile with
the following errors. Could you update the patch to address this?

proc.c:1538:20: error: use of undeclared identifier 'buf'
 1538 | initStringInfo(&buf);
  | ^
proc.c:1539:20: error: use of undeclared identifier 'lock_waiters_sbuf'
 1539 | initStringInfo(&lock_waiters_sbuf);
  | ^
proc.c:1540:20: error: use of undeclared identifier 'lock_holders_sbuf'
 1540 | initStringInfo(&lock_holders_sbuf);
  | ^



Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 11:43:42 -0500, Tom Lane wrote:
> Peter Geoghegan  writes:
> > This theory seems very believable.
> 
> I'm not convinced.  I think there are two assumptions underlying
> bitmap scan:
> 
> 1. Regardless of index contents, it's not okay for vacuum to remove
> tuples that an open transaction could potentially see.  So the heap
> tuple will be there if we look, unless it was already dead (in which
> case it could have been replaced, so we have to check visibility of
> whatever we find).

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.  Then, during the
bitmap heap scan, we do a VM lookup and see the page being all-visible and
thus assume there's a visible tuple pointed to by the tid.

No snapshot semantics protect against this, as the tuple is *not* visible to
anyone.

Greetings,

Andres Freund

Re: Incorrect result of bitmap heap scan.

Peter Geoghegan  writes:
> On Mon, Dec 2, 2024 at 12:02 PM Tom Lane  wrote:
>> Yup.  I am saying that that qualifies as too-aggressive setting of the
>> all-visible bit.  I'm not sure what rule we should adopt instead of
>> the current one, but I'd much rather slow down page freezing than
>> institute new page locking rules.

> Freezing a page, and setting a page all-visible are orthogonal.

Sorry, sloppy wording on my part.

regards, tom lane

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 12:02 PM Tom Lane  wrote:
> Andres Freund  writes:
> > I think the problematic scenario involves tuples that *nobody* can see. 
> > During
> > the bitmap index scan we don't know that though. Thus the tid gets inserted
> > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> > the tuple from the indexes and then the heap and marks the page as
> > all-visible, as the deleted row version has been removed.
>
> Yup.  I am saying that that qualifies as too-aggressive setting of the
> all-visible bit.  I'm not sure what rule we should adopt instead of
> the current one, but I'd much rather slow down page freezing than
> institute new page locking rules.

Freezing a page, and setting a page all-visible are orthogonal.
Nothing has changed about when or how we set pages all-visible in a
long time -- VACUUM has always done that to the maximum extent that
its OldestXmin cutoff will allow. (The changes to freezing made
freezing work a little bit more like that, the general idea being to
batch-up work.)

-- 
Peter Geoghegan

Re: Showing applied extended statistics in explain Part 2

2024-12-02 Thread Ilia Evdokimov




On 19.11.2024 00:38, Tomas Vondra wrote:

On 11/18/24 22:15, Tomas Vondra wrote:

...

So I think the correct solution is to not pass any expressions with
RestrictInfo to deparse_expression(). Either by stripping the nodes, or
by not adding them at all.

The patch tries to do the stripping by maybe_extract_actual_clauses(),
but that only looks at the top node, and that is not sufficient here.
Maybe it would be possible to walk the whole tree, and remove all the
RestrictInfos nodes - including intermediate ones, not just the top. But
I'm not quite sure it wouldn't cause issues elsewhere (assuming it
modifies the existing nodes). It still feels a bit like fixing a problem
we shouldn't really have ...


To make this idea a bit more concrete, here's a patch removing all
RestrictInfo nodes in show_stat_qual(). But still, it feels wrong.


regards



Yes, removing all 'RestrictInfos' during deparsing using 
'expression_tree_mutator()' is not optimal. However, I don't see an 
alternative. Perhaps it could be done this earlier in extended_stats.c 
to avoid the need for cleanup later ...


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

[18] Unintentional behavior change in commit e9931bfb75

2024-12-02 Thread Jeff Davis

Commit e9931bfb75 (version 18) contained an unexpected behavior change.
LOWER('I') returns:

  In the locale tr_TR.iso88599 (single byte encoding):
  17  18
defaulti   ı
specified  ı   ı

  In the locale tr_TR.utf8:
  17  18
defaultı   ı
specified  ı   ı

(Look carefully to see the dotted vs dotless "i".)

The behavior is commented (commit 176d5bae1d) in formatting.c:

   * ...  When using the default
   * collation, we apply the traditional Postgres behavior that
   * forces ASCII-style treatment of I/i, but in non-default
   * collations you get exactly what the collation says.
   */
  for (p = result; *p; p++)
  {
  if (mylocale)
  *p = tolower_l((unsigned char) *p, mylocale->info.lt);
  else
  *p = pg_tolower((unsigned char) *p);
  }

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

My commit e9931bfb75 (version 18) unifies the code paths for default
and explicit collations, and in the process it eliminates the special
case, and always uses tolower_l() for single-byte libc (whether default
collation or not).

Should I put the special case back? If not, it could break expression
indexes on LOWER()/UPPER() after an upgrade for users with the database
default collation of tr_TR who use libc and a single-byte encoding. But
preserving the special case seems weirdly inconsistent -- surely the
results should not depend on the encoding, right?

Regards,
Jeff Davis

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 3:55 PM Andres Freund  wrote:
> I suspect one contributor to this avoiding attention till now was that the
> optimization is fairly narrow:
>
> /*
>  * We can potentially skip fetching heap pages if we 
> do not need
>  * any columns of the table, either for checking 
> non-indexable
>  * quals or for returning data.  This test is a bit 
> simplistic, as
>  * it checks the stronger condition that there's no 
> qual or return
>  * tlist at all. But in most cases it's probably not 
> worth working
>  * harder than that.
>  */
> need_tuples = (node->ss.ps.plan->qual != NIL ||
>
> node->ss.ps.plan->targetlist != NIL);
>
> Even an entry in the targetlist that that does not depend on the current row
> disables the optimization.

Good point. I agree that that factor is likely to have masked the
problem over the past 6 years.

-- 
Peter Geoghegan

Re: Count and log pages set all-frozen by vacuum

2024-12-02 Thread Robert Haas

On Fri, Nov 29, 2024 at 2:46 PM Masahiko Sawada  wrote:
> Finally, in case where we have:
>
> visibility map: 1 pages set all-visible, 1 pages set all-frozen.
>
> We can understand that 1 pages newly became all-frozen, but have
> no idea how many pages became all-visible but not all-frozen. It could
> be even 0. Users might want to know it to understand how (auto)vacuum
> and freezing are working well.

I agree that this is possible, but I am not clear why the user should
care. In scenario #1, the same pages were set all-visible and also
all-frozen. In scenario #2, one set of pages were set all-visible and
a different set of pages were set all-frozen. Scenario #2 is a little
worse, for a couple of reasons. First, in scenario #2, more pages were
dirtied to achieve the same result. However, if the user is concerned
about the amount of I/O that vacuum is doing, they will more likely
look at the "buffer usage" and "WAL usage" lines in the VACUUM verbose
output rather than at the "visibility map" line. Second, in scenario
#2, we have deferred some work to the future and there is a risk that
the pages will remain all-visible but not all-frozen until the next
aggressive vacuum. I think it is possible someone could want to see
more detailed information for this reason.

However, I expect that it will be unimportant in a practical sense of
Melanie's work in this area gets committed, because her goal is to set
things up so that we'll revisit all-visible but not all-frozen pages
sooner, so that the work doesn't build up so much prior to the next
aggressive vacuum. And I think that work will have its own logging
that will make it clear what it is doing, hence I don't foresee the
need for more detail here.

All that said, if you really want this broken out into three
categories rather than two, I'm not overwhelmingly opposed to that. It
is always possible that more detail could be useful; it is just
difficult to decide where the likelihood is high enough to justify
adding more output.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Incorrect result of bitmap heap scan.

2024-12-02 Thread Konstantin Knizhnik


Hi hackers,

Attached script reproduces the problem with incorrect results of `select 
count(*)` (it returns larger number of records than really available in 
the table).
It is not always reproduced, so you may need to repeat it multiple times 
- at my system it failed 3 times from 10.


The problem takes place with pg16/17/18 (other versions I have not checked).

The test is called `test_ios` (index-only-scan), but it is not correct. 
Index-only scan is not used in this case.
And this is actually the first question to PG17/18: IOS is not used when 
number of records is less than 100k (for this particular table):


postgres=# create table t(pk integer primary key); CREATE TABLE 
postgres=# set enable_seqscan = off; SET postgres=# set enable_indexscan 
= off; SET postgres=# insert into t values (generate_series(1,1000)); 
INSERT 0 1000 postgres=# vacuum t; VACUUM postgres=# explain select 
count(*) from t; QUERY PLAN 
--- 
Aggregate (cost=43.02..43.03 rows=1 width=8) -> Bitmap Heap Scan on t 
(cost=25.52..40.52 rows=1000 width=0) -> Bitmap Index Scan on t_pkey 
(cost=0.00..25.27 rows=1000 width=0) (3 rows) postgres=# set 
enable_bitmapscan = off; SET postgres=# explain select count(*) from t; 
QUERY PLAN  
Aggregate (cost=17.50..17.51 rows=1 width=8) -> Seq Scan on t 
(cost=0.00..15.00 rows=1000 width=0) Disabled: true (3 rows)


So, as you can see, Postgres prefers to use disabled seqscan, but not 
IOS. It is different from pg16 where disabling bitmap scan makes 
optimizer to choose index-only scan:


postgres=# explain select count(*) from t; QUERY PLAN 
--- Aggregate 
(cost=41.88..41.88 rows=1 width=8) -> Seq Scan on t (cost=0.00..35.50 
rows=2550 width=0) (2 rows) postgres=# set enable_seqscan = off; SET 
postgres=# explain select count(*) from t; QUERY PLAN 
--- 
Aggregate (cost=75.54..75.55 rows=1 width=8) -> Bitmap Heap Scan on t 
(cost=33.67..69.17 rows=2550 width=0) -> Bitmap Index Scan on t_pkey 
(cost=0.00..33.03 rows=2550 width=0) (3 rows) postgres=# set 
enable_bitmapscan = off; SET postgres=# explain select count(*) from t; 
QUERY PLAN 
--- 
Aggregate (cost=45.77..45.78 rows=1 width=8) -> Index Only Scan using 
t_pkey on t (cost=0.28..43.27 rows=1000 width=0) (2 rows)


This is strange behavior of pg17 which for some reasons rejects IOS (but 
it is used if number of records in the table is 100k or more). But the 
main problem is that used plan Bitmap Heap Scan + Bitmap Index Scan may 
return incorrect result.


Replacing `select count(*)` with `select count(pk)` eliminates the 
problem, as well as disabling of autovacuum. It seems to be clear that 
the problem is with visibility map.


We have the following code in heap bitmap scan: /* * We can skip 
fetching the heap page if we don't need any fields from the * heap, the 
bitmap entries don't need rechecking, and all tuples on the * page are 
visible to our transaction. */ if (!(scan->rs_flags & SO_NEED_TUPLES) && 
!tbmres->recheck && VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, 
&hscan->rs_vmbuffer)) { /* can't be lossy in the skip_fetch case */ 
Assert(tbmres->ntuples >= 0); Assert(hscan->rs_empty_tuples_pending >= 
0); hscan->rs_empty_tuples_pending += tbmres->ntuples; return true; }


So if we do not need tuples (|count(*)|case) and page is marked as 
all-visible in VM, then we just count|tbmres->ntuples|elements without 
extra checks.
I almost not so familiar with internals of executor, but it is not clear 
to me how we avoid race condition between VM update and heap bitmap scan?


Assume that bitmap scan index marks all tids available in index. Some 
elements in this bitmap can refer old (invisible) versions. Then vacuum 
comes, removes dead elements and mark page as all-visible. After it we 
start heap bitmap scan, see that page is all-visible and count all 
marked elements on this page including dead (which are not present in 
the page any more).

Which lock or check should prevent such scenario?
import random
import threading
import time
import psycopg2

def test_ios():
con = psycopg2.connect(database="postgres")
n_records = 1000
n_oltp_writers = 10
n_oltp_readers = 5
n_olap_readers = 1

con.autocommit = True
cur = con.cursor()

cur.execute("DROP table if exists t")
cur.execute("CREATE TABLE t(pk bigint not null)")
cur.execute(f"insert into t values (generate_series(1,{n_records}))")
cur.execute("create index on t(pk)")
cur.execute("vacuum t")

running = True

def oltp_writer():
con = psycopg2.connect(database="postgres")
con.autocommit = True
cur = con.cursor()
while running:
pk = random.randrang

Re: Incorrect result of bitmap heap scan.

Andres Freund  writes:
> I think the problematic scenario involves tuples that *nobody* can see. During
> the bitmap index scan we don't know that though. Thus the tid gets inserted
> into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> the tuple from the indexes and then the heap and marks the page as
> all-visible, as the deleted row version has been removed.

Yup.  I am saying that that qualifies as too-aggressive setting of the
all-visible bit.  I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

regards, tom lane

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Alvaro Herrera

On 2024-Dec-02, Michael Paquier wrote:

> I am slightly concerned about the current design of GetPublication()
> in the long-term, TBH.  LoadPublications() has hidden the leak behind
> two layers of routines in the WAL sender, and that's easy to miss once
> you call anything that loads a Publication depending on how the caller
> caches its data.  So I would still choose for modifying the structure
> on HEAD removing the pstrdup() for the publication name.  Anyway, your 
> suggestion is also OK by me on top of that, that's less conflicts in
> all the branches.

TBH I'm not sure that wastefully allocating NAMEDATALEN for each
relation is so great.  Our strategy for easing memory management is to
use appropriately timed contexts.

I guess if you wanted to make a publication a single palloc block (so
that it's easy to free) and not waste so much memory, you could stash
the name string at the end of the struct.  I think that'd be a change
wholly contained in GetPublication.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
 Are you not unsure you want to delete Firefox?
   [Not unsure] [Not not unsure][Cancel]
   http://smylers.hates-software.com/2008/01/03/566e45b2.html

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 2:43 PM Andres Freund  wrote:
> Attached is an isolationtest that reliably shows wrong query results.

Nice approach with the cursor!

I took what you wrote, and repurposed it to prove my old theory about
GiST index-only scans being broken due to the lack of an appropriate
interlock against concurrent TID recycling. See the attached patch.

-- 
Peter Geoghegan


v1-0001-isolationtester-showing-broken-index-only-scans-w.patch
Description: Binary data

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 3:56 PM Peter Geoghegan  wrote:
> I took what you wrote, and repurposed it to prove my old theory about
> GiST index-only scans being broken due to the lack of an appropriate
> interlock against concurrent TID recycling. See the attached patch.

BTW, if you change the test case to use the default B-Tree index AM
(by removing "USING GIST"), you'll see that VACUUM blocks on acquiring
a cleanup lock (and so the test just times out). The problem is that
GiST VACUUM just doesn't care about cleanup locks/TID recycling safety
-- though clearly it should.

-- 
Peter Geoghegan

Re: Vacuum statistics

2024-12-02 Thread Alena Rybakina


Hi! Thank you for your review!

On 30.11.2024 07:48, Kirill Reshke wrote:

Hello!
After a brief glance, I think this patch set is good.
But there isn't any more time in the current CF to commit this :(.
So I moved to the next CF.

I agree with you. Thank you!)

I also like the 0001 commit message. This commit message is quite
large and easy to understand. Actually, it might be too big. Perhaps
rather of being a commit message, the final paragraph (pages_frozen -
number of pages that..) need to be a part of the document. Perhaps
delete the explanation on pages_frozen that we have in 0004?
To be honest, I don't quite understand what you're suggesting. Are you 
suggesting moving the explanation about the pages_frozen from the commit 
message to the documentation or fixing something in the documentation 
about the pages_frozen? Can you please explain?


--
Regards,
Alena Rybakina
Postgres Professional

Re: Vacuum statistics

2024-12-02 Thread Alena Rybakina


On 02.12.2024 17:46, Ilia Evdokimov wrote:
In my opinion, the patches are semantically correct. However, not all 
dead code has been removed - I'm referring to 
pgstat_update_snapshot(). Also, the tests need to be fixed.



Thank you, I'll fix it

--
Regards,
Alena Rybakina
Postgres Professional

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 11:32 AM Andres Freund  wrote:
> On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:
> > Concurrency timeline:
> >
> > Session 1. amgetbitmap() gets snapshot of index contents, containing
> > references to dead tuples on heap P1.
>
> IIUC, an unstanted important aspect here is that these dead tuples are *not*
> visible to S1. Otherwise the VACUUM in the next step could not remove the dead
> tuples.

I would state the same thing slightly differently, FWIW: I would say
that the assumption that has been violated is that a TID is a stable
identifier for a given index tuple/logical row/row version (stable for
the duration of the scan).

This emphasis/definition seems slightly more useful, because it makes
it clear why this is hard to hit: you have to be fairly unlucky for a
dead-to-everyone TID to be returned by your scan (no LP_DEAD bit can
be set) and set in the scan's bitmap, only to later be concurrently
set LP_UNUSED in the heap -- all without VACUUM randomly being
prevented from setting the same page all-visible due to some unrelated
not-all-visible heap item making it unsafe.

> Ugh :/
>
> Pretty depressing that we've only found this now, ~6 years after it's been
> released. We're lacking some tooling to find this stuff in a more automated
> fashion.

FWIW I have suspicions about similar problems with index-only scans
that run in hot standby, and about all GiST index-only scans:

https://postgr.es/m/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5q...@mail.gmail.com

In general there seems to be a lack of rigor in this area. I'm hoping
that Tomas Vondra's work can tighten things up here in passing.

> It seems pretty much impossible to fix that with the current interaction
> between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
> and likely with some heuristics / limits, address this by performing some
> visibility checks during the bitmap build.  I'm bringing it up here because I
> suspect that some of the necessary changes would overlap with what's needed to
> repair bitmap index-only scans.

This seems like it plays into some of the stuff I've discussed with
Tomas, about caching visibility information in local state, as a means
to avoiding holding onto pins in the index AM. For things like
mark/restore support.

> > This is quite non-trivial, however, as we don't have much in place regarding
> > per-relation shared structures which we could put such a value into, nor a
> > good place to signal changes of the value to bitmap heap-scanning backends.
>
> It doesn't have to be driven of table state, it could be state in
> indexes. Most (all?) already have a metapage.

FWIW GiST doesn't have a metapage (a historical oversight).

> Unfortunately I don't see a better path either.
>
> I think it'd be good if we added a test that shows the failure mechanism so
> that we don't re-introduce this in the future. Evidently this failure isn't
> immediately obvious...

Maybe a good use for injection points?

-- 
Peter Geoghegan

Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote:
> I guess it's natural to suspect more recent work -- commit 7c70996e is
> about 6 years old. But I the race condition that I suspect is at play
> here is very narrow.

FWIW, the test I just posted shows the issue down to 11 (although for 11 one
has to remove the (TRUNCATE false). 10 returns correct results.

I don't think the race is particularly narrow. Having a vacuum complete
between the start of the bitmap indexscan and the end of the bitmap heapscan,
leaves a lot of time with an expensive query.

I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:

/*
 * We can potentially skip fetching heap pages if we do 
not need
 * any columns of the table, either for checking 
non-indexable
 * quals or for returning data.  This test is a bit 
simplistic, as
 * it checks the stronger condition that there's no 
qual or return
 * tlist at all. But in most cases it's probably not 
worth working
 * harder than that.
 */
need_tuples = (node->ss.ps.plan->qual != NIL ||
   node->ss.ps.plan->targetlist 
!= NIL);

Even an entry in the targetlist that that does not depend on the current row
disables the optimization.

Due to not being able to return any content for those rows, it's also somewhat
hard to actually notice that the results are wrong...

> It's pretty unlikely that there'll be a dead-to-all TID returned to a
> scan (not just dead to our MVCC snapshot, dead to everybody's) that is
> subsequently concurrently removed from the index, and then set
> LP_UNUSED in the heap. It's probably impossible if you don't have a
> small table -- VACUUM just isn't going to be fast enough to get to the
> leaf page after the bitmap index scan, but still be able to get to the
> heap before its corresponding bitmap heap scan (that uses the VM as an
> optimization) can do the relevant visibility checks (while it could
> happen with a large table and a slow bitmap scan, the chances of the
> VACUUM being precisely aligned with the bitmap scan, in just the wrong
> way, seem remote in the extreme).

A cursor, waiting for IO, waiting for other parts of the query, ... can all
make this windows almost arbitrarily large.

Greetings,

Andres Freund

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Alvaro Herrera

On 2024-Dec-02, Amit Kapila wrote:

> We already have PGOutputData->cachectx which could be used for it.
> I think we should be able to reset such a context when we are
> revalidating the publications.

I don't see that context being reset anywhere, so I have a hard time
imagining that this would work without subtle risk of breakage
elsewhere.

> Even, if we want a new context for some localized handling, we should
> add that in PGOutputData rather than a local context as the proposed
> patch is doing at the very least for HEAD.

I don't necessarily agree, given that this context is not needed
anywhere else.

> Can't we consider freeing the publication names individually that can
> be backpatchable and have no or minimal risk of breaking anything?

That was my first thought, but then it occurred to me that such a thing
would be totally pointless.

> > you call anything that loads a Publication depending on how the caller
> > caches its data.  So I would still choose for modifying the structure
> > on HEAD removing the pstrdup() for the publication name.
> 
> BTW, the subscription structure also used the name in a similar way.
> This will make the publication/subscription names handled differently.

True (with conninfo, slotname, synccommit, and origin).

FWIW it seems FreeSubscription is incomplete, and not only because it
fails to free the publication names ...

(Why are we storing a string in Subscription->synccommit?)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los trabajadores menos efectivos son sistematicamente llevados al lugar
donde pueden hacer el menor daño posible: gerencia."  (El principio Dilbert)

Re: Incorrect result of bitmap heap scan.

On Mon, Dec 2, 2024 at 12:15 PM Peter Geoghegan  wrote:
> The underlying reason why nbtree can discriminate like this is that it
> "knows" that plain index scans will always visit the heap proper. If a
> TID points to an LP_UNUSED item, then it is considered dead to the
> scan (even though in general the heap page itself might be marked
> all-visible). If some completely unrelated, newly inserted heap tuple
> is found instead, then it cannot be visible to the plain index scan's
> MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
> dropped like this).

If I add "SET enable_indexonlyscan = false" to the "setup" section of
the v1-0001-isolationtester-showing-broken-index-only-scans-w.patch
isolationtester test case I posted earlier today, the test will pass.
There is no reason to think that plain GiST index scans are broken.
The fact that GiST VACUUM won't acquire cleanup locks is a problem
*only* because GiST supports index-only scans (AFAIK no other thing
within GiST is broken).

My point is that index-only scans are generally distinct from plain
index scans from an interlocking point of view -- they're not
equivalent (not quite). And yet the "62.4. Index Locking
Considerations" docs nevertheless say nothing about index-only scans
in particular (only about bitmap scans). The docs do say that an MVCC
snapshot is protective, though -- which makes it sound like GiST can't
be doing anything wrong here (GiST only uses MVCC snapshots).

Actually, I don't think that GiST would be broken at all if we'd
simply never added support for index-only scans to GiST. I'm fairly
sure that index-only scans didn't exist when the bulk of this part of
the docs was originally written. ISTM that we ought to do something
about these obsolete docs, after we've fixed the bugs themselves.

-- 
Peter Geoghegan

Re: Remove useless casts to (void *)

Thomas Munro  writes:
> I wouldn't change that.  illumos is selecting the old pre-standard
> declaration here, but it knows the standard one:
> https://github.com/illumos/illumos-gate/blob/27ecbff00d8c86a2647d6fe325cacb220d712115/usr/src/uts/common/sys/shm.h#L129
> https://illumos.org/man/2/shmdt

Oh!  Kind of looks like we should be defining _POSIX_C_SOURCE=200112L,
at least on that platform.

> I don't know why we have only one tiny issue if the system headers
> think we want pre-POSIX stuff.

Agreed, I'd have expected more trouble than this.  But persuading
the system headers that we want a POSIX version from this century
seems like it might be a good idea to forestall future issues.

I'm inclined to propose adding something like

CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L"

to src/template/solaris.  Not sure if we have a corresponding
mechanism for meson, though.

regards, tom lane

Re: Remove useless casts to (void *)

Hi,

On 2024-12-02 17:11:30 -0500, Tom Lane wrote:
> I'm inclined to propose adding something like
> 
> CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L"
> 
> to src/template/solaris.  Not sure if we have a corresponding
> mechanism for meson, though.

elif host_system == 'sunos'
  portname = 'solaris'
  export_fmt = '-Wl,-M@0@'
  cppflags += '-D_POSIX_PTHREAD_SEMANTICS'

Should be trivial to add there.

Greetings,

Andres Freund

Re: optimize file transfer in pg_upgrade

Here is a rebased patch set for cfbot.  I'm planning to spend some time
getting these patches into a more reviewable state in the near future.

-- 
nathan
>From 81fe66e0f0aa4f958a8707df669f60756c89bb85 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 5 Nov 2024 15:59:51 -0600
Subject: [PATCH v2 1/8] Export walkdir().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to swap catalog files
between database directories during pg_upgrade.
---
 src/common/file_utils.c | 5 +
 src/include/common/file_utils.h | 3 +++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 398fe1c334..3f488bf5ec 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -48,9 +48,6 @@
 #ifdef PG_FLUSH_DATA_WORKS
 static int pre_sync_fname(const char *fname, bool isdir);
 #endif
-static void walkdir(const char *path,
-   int (*action) (const char *fname, bool 
isdir),
-   bool process_symlinks);
 
 #ifdef HAVE_SYNCFS
 
@@ -268,7 +265,7 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod 
sync_method)
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
  */
-static void
+void
 walkdir(const char *path,
int (*action) (const char *fname, bool isdir),
bool process_symlinks)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index e4339fb7b6..5a9519acfe 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -39,6 +39,9 @@ extern void sync_pgdata(const char *pg_data, int 
serverVersion,
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int durable_rename(const char *oldfile, const char *newfile);
 extern int fsync_parent_path(const char *fname);
+extern void walkdir(const char *path,
+   int (*action) (const char *fname, bool 
isdir),
+   bool process_symlinks);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

>From 36d6a1aad5cbfeb05954886bb336cfa9ec01c5c3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 6 Nov 2024 13:59:39 -0600
Subject: [PATCH v2 2/8] Add "void *arg" parameter to walkdir() that is passed
 to function.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This will be used in follow up commits to pass private state to the
functions called by walkdir().
---
 src/bin/pg_basebackup/walmethods.c |  8 +++
 src/bin/pg_dump/pg_backup_custom.c |  2 +-
 src/bin/pg_dump/pg_backup_tar.c|  2 +-
 src/bin/pg_dump/pg_dumpall.c   |  2 +-
 src/common/file_utils.c| 38 +++---
 src/include/common/file_utils.h|  6 ++---
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/walmethods.c 
b/src/bin/pg_basebackup/walmethods.c
index 215b24597f..51640cb493 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -251,7 +251,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char 
*pathname,
 */
if (wwmethod->sync)
{
-   if (fsync_fname(tmppath, false) != 0 ||
+   if (fsync_fname(tmppath, false, NULL) != 0 ||
fsync_parent_path(tmppath) != 0)
{
wwmethod->lasterrno = errno;
@@ -486,7 +486,7 @@ dir_close(Walfile *f, WalCloseMethod method)
 */
if (f->wwmethod->sync)
{
-   r = fsync_fname(df->fullpath, false);
+   r = fsync_fname(df->fullpath, false, NULL);
if (r == 0)
r = fsync_parent_path(df->fullpath);
}
@@ -617,7 +617,7 @@ dir_finish(WalWriteMethod *wwmethod)
 * Files are fsynced when they are closed, but we need to fsync 
the
 * directory entry here as well.
 */
-   if (fsync_fname(dir_data->basedir, true) != 0)
+   if (fsync_fname(dir_data->basedir, true, NULL) != 0)
{
wwmethod->lasterrno = errno;
return false;
@@ -1321,7 +1321,7 @@ tar_finish(WalWriteMethod *wwmethod)
 
if (wwmethod->sync)
{
-   if (fsync_fname(tar_data->tarfilename, false) != 0 ||
+   if (fsync_fname(tar_data->tarfilename, false, NULL) != 0 ||
fsync_parent_path(tar_data->tarfilename) != 0)
{
wwmethod->lasterrno = errno;
diff --git a/src/bin/pg_dump/pg_backup_custom.c 
b/src/bin/pg_dump/pg_backup_custom.c
index e44b887eb2..51edf147d6 100644
--- a/src/bin/pg_dump/pg_backup_

Re: CREATE SCHEMA ... CREATE DOMAIN support

Vik Fearing  writes:
> On 02/12/2024 03:15, Tom Lane wrote:
>> Also, if SQL intended to constrain the search path for unqualified
>> identifiers to be only the new schema, they'd hardly need a concept
>> of  at all.

> I looked up the original paper (MUN-051) that introduced the  path specification> and it says, "The paper is proposing the use of 
> paths only to resolve unqualified routine invocations."

Interesting.  But still, the spec allows  within
, so even that narrow interpretation opens them
to the is-this-an-external-reference-or-a-forward-reference problem.

For us, that's clouded further for functions by our overloading rules.
If foo(bigint) exists in the search path, and we have a view or
whatever that references foo() with an int argument, and there is a
CREATE FUNCTION for foo(float8) later in the , what
are we supposed to think is the user's intent?  (Just to save people
doing the experiment: we'd prefer foo(float8) if both are visible,
but foo(bigint) would be perfectly acceptable if not.  Other choices
of the argument types would yield different results, and none of them
seem especially open-and-shut to me.)  I don't know offhand if the
spec allows function overloading in the same way.

regards, tom lane

Re: Proposal: Role Sandboxing for Secure Impersonation

2024-12-02 Thread Michał Kłeczek


> On 2 Dec 2024, at 18:39, Joe Conway  wrote:
> 
> I am very much in favor of functionality of this sort being built in to the 
> core database. Very similar functionality is available in an extension I 
> wrote years ago (without the SQL grammar support) -- see 
> https://github.com/pgaudit/set_user

Coincidentally, I’ve created https://github.com/pgaudit/set_user/issues/87

BTW thanks for set_user - really needed and appreciated.

—
Michal

Re: Incorrect result of bitmap heap scan.

Hi,

On 2024-12-02 16:08:02 +0100, Matthias van de Meent wrote:
> Concurrency timeline:
> 
> Session 1. amgetbitmap() gets snapshot of index contents, containing
> references to dead tuples on heap P1.

IIUC, an unstanted important aspect here is that these dead tuples are *not*
visible to S1. Otherwise the VACUUM in the next step could not remove the dead
tuples.

> Session 2. VACUUM runs on the heap, removes TIDs for P1 from the
> index, deletes those TIDs from the heap pages, and finally sets heap
> pages' VM bits to ALL_VISIBLE, including the now cleaned page P1
> Session 1. Executes the bitmap heap scan that uses the bitmap without
> checking tuples on P1 due to ALL_VISIBLE and a lack of output columns.

Ugh :/

Pretty depressing that we've only found this now, ~6 years after it's been
released. We're lacking some tooling to find this stuff in a more automated
fashion.

> PS.
> I don't think the optimization itself is completely impossible, and we
> can probably re-enable an optimization like that if (or when) we find
> a way to reliably keep track of when to stop using the optimization. I
> don't think that's an easy fix, though, and definitely not for
> backbranches.

One way to make the optimization safe could be to move it into the indexam. If
an indexam would check the VM bit while blocking removal of the index entries,
it should make it safe. Of course that has the disadvantage of needing more VM
lookups than before, because it would not yet have been deduplicated...

Another issue with bitmap index scans is that it currently can't use
killtuples. I've seen several production outages where performance would
degrade horribly over time whenever estimates lead to important queries to
switch to bitmap scans, because lots of dead tuples would get accessed over
and over.

It seems pretty much impossible to fix that with the current interaction
between nodeBitmap* and indexam. I wonder if we could, at least in some cases,
and likely with some heuristics / limits, address this by performing some
visibility checks during the bitmap build.  I'm bringing it up here because I
suspect that some of the necessary changes would overlap with what's needed to
repair bitmap index-only scans.

> The solution I could think to keep most of this optimization requires
> the heap bitmap scan to notice that a concurrent process started
> removing TIDs from the heap after amgetbitmap was called; i.e.. a
> "vacuum generation counter" incremented every time heap starts the
> cleanup run.

I'm not sure this is a good path. We can already clean up pages during index
accesses and we are working on being able to set all-visible during "hot
pruning"/page-level-vacuuming. That'd probably actually be safe (because we
couldn't remove dead items without a real vacuum), but it's starting to get
pretty complicated...

> This is quite non-trivial, however, as we don't have much in place regarding
> per-relation shared structures which we could put such a value into, nor a
> good place to signal changes of the value to bitmap heap-scanning backends.

It doesn't have to be driven of table state, it could be state in
indexes. Most (all?) already have a metapage.

We could also add that state to pg_class? We already update pg_class after
most vacuums, so I don't think that'd be an issue.

Thomas had a WIP patch to add a shared-memory table of all open
relations. Which we need for quite a few features by now (e.g. more efficient
buffer mapping table, avoiding the relation size lookup during planning /
execution, more efficient truncation of relations, ...).

> From 07739e5af83664b67ea02d3db7a461a106d74040 Mon Sep 17 00:00:00 2001
> From: Matthias van de Meent 
> Date: Mon, 2 Dec 2024 15:59:35 +0100
> Subject: [PATCH v1] Disable BitmapScan's skip_fetch optimization
> 
> The optimization doesn't take concurrent vacuum's removal of TIDs into
> account, which can remove dead TIDs and make ALL_VISIBLE pages for which
> we have pointers to those dead TIDs in the bitmap.
> 
> The optimization itself isn't impossible, but would require more work
> figuring out that vacuum started removing TIDs and then stop using the
> optimization. However, we currently don't have such a system in place,
> making the optimization unsound to keep around.

Unfortunately I don't see a better path either.

I think it'd be good if we added a test that shows the failure mechanism so
that we don't re-introduce this in the future. Evidently this failure isn't
immediately obvious...

Greetings,

Andres Freund

Re: Proposal: Role Sandboxing for Secure Impersonation

2024-12-02 Thread Wolfgang Walther


Eric Hanson:

a) Transaction ("local") Sandbox:
- SET LOCAL ROLE alice NO RESET;
- SET LOCAL ROLE alice WITHOUT RESET;
- BEGIN AS ROLE alice;

Transaction-level sandboxes have the benefit that a pooler can simply 
start a new sandboxed transaction for each request and never have to 
worry about resetting or reusing them.


b) Session Sandbox:
- SET ROLE alice NO RESET;
- SET ROLE alice WITHOUT RESET;
- SET UNRESETTABLE ROLE alice; --veto

Session-level sandboxes have the benefit that they can do things that 
can't be done inside a transaction (e.g. create extensions, vacuum, 
analyze, etc.)  It's a fully functional session.  However if RESET ROLE 
is prohibited for the rest of the session, a connection pooler couldn't 
reuse it.


c) "Guarded" Transaction/Session
- SET [LOCAL] ROLE alice GUARDED BY reset_token;
- RESET ROLE WITH TOKEN reset_token;

Guarded sandboxes are nice because the session can also exit the sandbox 
if it has the token.

d) SET [LOCAL] ROLE alice WITH ;

Which would allow you to change to a role for which you don't have any 
grants, yet. Then, the "authenticator pattern" could be implemented by 
having an authenticator role without any privileges to start with.


The client would provide the password to elevate their privileges. RESET 
ROLE would not be problematic anymore. This would be much cheaper than 
having those roles do full logins on a new connection and could be used 
with connection poolers nicely.


Possibly, this could also be extended by supporting alternatives to just 
a password, for example Json Web Tokens. Maybe building on top of the 
ongoing OAuth effort? Those tokens would then contain a claim for the 
role they are allowed to set.


Best,

Wolfgang

Re: Changing shared_buffers without restart

2024-12-02 Thread Dmitry Dolgov

> On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote:
> > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
> >
> > I mean, we can do the following to get a nice contiguous empty address
> > space no other mmap(NULL)s will get put into:
> >
> > /* reserve size bytes of memory */
> > base = mmap(NULL, size, PROT_NONE, ...flags, ...);
> > /* use the first small_size bytes of that reservation */
> > allocated_in_reserved = mmap(base, small_size, PROT_READ |
> > PROT_WRITE, MAP_FIXED, ...);
> >
> > With the PROT_NONE protection option the OS doesn't actually allocate
> > any backing memory, but guarantees no other mmap(NULL, ...) will get
> > placed in that area such that it overlaps with that allocation until
> > the area is munmap-ed, thus allowing us to reserve a chunk of address
> > space without actually using (much) memory.
>
> From what I understand it's not much different from the scenario when we
> just map as much as we want in advance. The actual memory will not be
> allocated in both cases due to CoW, oom_score seems to be the same. I
> agree it sounds attractive, but after some experimenting it looks like
> it won't work with huge pages insige a cgroup v2 (=container).
>
> The reason is Linux has recently learned to apply memory reservation
> limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
> this feature is often configured out of the box in various container
> orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
> reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
> match, reserve some address space via non-hugetlb mapping, and allocate
> a hugetlb out of it, but it doesn't work either (the smaller mmap
> complains about MAP_HUGETLB with EINVAL).

I've asked about that in linux-mm [1]. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

[1]: 
https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/

Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Nov 25, 2024 at 08:54:48PM +, Devulapalli, Raghuveer wrote:
> As Nathan suggested, we moved this to a separate thread. The latest set
> of patches here need to applied on top of patches in that thread.

Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch
is committed?

-- 
nathan

Re: [18] Unintentional behavior change in commit e9931bfb75

Jeff Davis  writes:
> The behavior is commented (commit 176d5bae1d) in formatting.c:

>* ...  When using the default
>* collation, we apply the traditional Postgres behavior that
>* forces ASCII-style treatment of I/i, but in non-default
>* collations you get exactly what the collation says.

> That's a somewhat strange special case (along with similar ones for
> INITCAP() and UPPER()) that applies to single-byte encodings with the
> libc provider and the database default collation only. I assume it was
> done for backwards compatibility?

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

> Should I put the special case back?

I think so.  It's stood for a lot of years now without field
complaints, and I'm fairly sure there *were* field complaints
before.  (I think that behavior long predates 176d5bae1d, which
was just restoring the status quo ante after somebody else's
overenthusiastic application of system locale infrastructure.)

regards, tom lane

Re: Remove useless casts to (void *)

Andres Freund  writes:
> On 2024-12-02 17:11:30 -0500, Tom Lane wrote:
>> I'm inclined to propose adding something like
>> CPPFLAGS="$CPPFLAGS -D_POSIX_C_SOURCE=200112L"
>> to src/template/solaris.  Not sure if we have a corresponding
>> mechanism for meson, though.

> elif host_system == 'sunos'
>   portname = 'solaris'
>   export_fmt = '-Wl,-M@0@'
>   cppflags += '-D_POSIX_PTHREAD_SEMANTICS'

> Should be trivial to add there.

Oh!  The corresponding bit in configure.ac is

# On Solaris, we need this #define to get POSIX-conforming versions
# of many interfaces (sigwait, getpwuid_r, ...).
if test "$PORTNAME" = "solaris"; then
  CPPFLAGS="$CPPFLAGS -D_POSIX_PTHREAD_SEMANTICS"
fi

Barely even need to adjust the comment ;-).  I'll proceed with
improving that (in master only, don't think we need it in back
branches, at least not today) unless somebody pushes back soon.

regards, tom lane

Re: Remove useless casts to (void *)

On 2024-12-02 17:42:33 -0500, Tom Lane wrote:
> I'll proceed with improving that (in master only, don't think we need it in
> back branches, at least not today) unless somebody pushes back soon.

+1

Re: Remove useless casts to (void *)

2024-12-02 Thread Thomas Munro

On Tue, Oct 29, 2024 at 9:06 PM Peter Eisentraut  wrote:
> There are a bunch of (void *) casts in the code that don't make sense to
> me.  I think some of these were once necessary because char * was used
> in place of void * for some function arguments.

Some other places that sprang to my eye recently that reminded me of
K&R C and the ongoing transition to the standard (/me ducks):

Why do we bother with a "Pointer" type?  The idiomatic way to do
byte-oriented pointer arithmetic is to cast to char *, or uint8_t*
etc, which doesn't require the reader to go and figure out what stride
that non-standard type name implies.  Clearly it has a history linked
to the introduction of void *, and it's described as 'holding the
address of any memory resident type' and is filed under the 'standard
system types' section, but C89 has void * for objects of type unknown
to the compiler, and if you want to do arithmetic over those objects
you have to be more explicit about their size, which requires writing
a suitably-alarming-to-the-reader cast to a pointer to a real type.
Also, the explanation about "true ANSI compilers" means nothing to the
modern reader who wasn't hacking C 30+ years ago.

Maybe it has to do with matching up to DatumGetPointer().
DatumGetPointer() isn't defined to return it though, it returns char
*, callers usually/always cast to a real time, and I don't see Pointer
being used near it.  DatumGetPointer() should arguably return void *
too, to force users to provide a type if they're going to dereference
or perform arithmetic.

What problem does PointerIsValid(x) solve, when you could literally
just write x or if you insist x != NULL in most contexts and it would
be 100% idiomatic C, and shorter?  Why introduce extra terminological
load with "validity"?   C programmers had better know all about NULL.

Why do all the XLogRegister*() calls have to cast their argument to
char *?  Seems like a textbook use for void * argument, not requiring
a cast.  If the XLog* interfaces want to do byte-oriented arithmetic
internally, they should do the cast internally, instead of having an
argument that requires all callers to cast their arguments.

(While grepping for casts to char *, I had a mistake in my regex and
finished up seeing how many places in our code check sizeof(char),
which is funny because sizeof is defined as telling you how many chars
it takes to hold a type/value; perhaps it has documentation value :-))

Re: Showing applied extended statistics in explain Part 2

Ilia Evdokimov  writes:
> On 19.11.2024 00:38, Tomas Vondra wrote:
>> On 11/18/24 22:15, Tomas Vondra wrote:
>>> So I think the correct solution is to not pass any expressions with
>>> RestrictInfo to deparse_expression(). Either by stripping the nodes, or
>>> by not adding them at all.
>>> 
>>> The patch tries to do the stripping by maybe_extract_actual_clauses(),
>>> but that only looks at the top node, and that is not sufficient here.

Pardon me for being late to the party, but I don't understand why this
is difficult.  There should never be more than one layer of
RestrictInfos, at the top level of an implicitly-ANDed list of quals.
The only exception to this is that RestrictInfos representing OR
clauses have an additional field "orclause" that attaches
RestrictInfos to the elements of the OR list --- but the main "clause"
field doesn't look like that, and you can just ignore "orclause" if it
doesn't suit you.  So ISTM this doesn't need to be any harder than
what extract_actual_clauses() does (and has done for decades).

regards, tom lane

Re: UUID v7

2024-12-02 Thread Masahiko Sawada

On Sat, Nov 30, 2024 at 7:01 AM Daniel Verite  wrote:
>
> Andrey M. Borodin wrote:
>
>
> > I'm sending amendments addressing your review as a separate step in patch
> > set. Step 1 of this patch set is identical to v39.
>
> Some comments about the implementation of monotonicity:
>
> +/*
> + * Get the current timestamp with nanosecond precision for UUID generation.
> + * The returned timestamp is ensured to be at least SUBMS_MINIMAL_STEP
> greater
> + * than the previous returned timestamp (on this backend).
> + */
> +static inline int64
> +get_real_time_ns_ascending()
> +{
> +   static int64 previous_ns = 0;
>
> [...]
>
> +   /* Guarantee the minimal step advancement of the timestamp */
> +   if (previous_ns + SUBMS_MINIMAL_STEP_NS >= ns)
> +   ns = previous_ns + SUBMS_MINIMAL_STEP_NS;
> +   previous_ns = ns;
>
> In the case of parallel execution (uuidv7() being parallel-safe), if
> there have been previous calls to uuidv7() in that backend,
> previous_ns will be set in the backend process,
> but zero in a newly spawned worker process.
> If (previous_ns + SUBMS_MINIMAL_STEP_NS >= ns) ever happens
> to be true in the main process, it will start at false in the workers,
> leading to non-monotonic results within the same query.

The monotonicity of generated UUIDv7 is guaranteed only within a
single backend. I think your point is that UUIDs in parallel query
results might not be ordered. But is it guaranteed that without ORDER
BY clause, the results returned by parallel queries are in the same
order as the results from non-parallel queries in the first place?

>
> Also in the case of a backward clock change, we can end up with some
> backends sticking to the "old time" plus increment per invocation
> until they die, while some other backends spawned after the clock
> change are on the "new time". These backends may produce series of
> UUIDv7 that would be completely out of sync with each others.
> A backward clock change is an abnormality, but if it occurs, what's
> the best choice? Take the bullet and switch to the new time , or
> stick to a time that is permanently decorrelated from the OS
> clock? I would think that the latter is worse.

IIUC after generating a UUIDv7 with the correct time, even if the
system time goes back, the time in the next UUIDv7 will be
SUBMS_MINIMAL_STEP_NS nanoseconds ahead of the last correct time.
Also, in case where the backend generates its first UUIDv7 with an
incorrect (e.g. an old) time, it generates UUIDv7 based on the
incorrect timestamp. However, it starts generating UUIDv7 with the
correct timestamp as soon as the system time goes back to the correct
time. So I think that it doesn't happen that one backend is sticking
to an old time while another backend is using the correct timestamp to
generate UUIDv7. Note that we use (the previous timestamp +
SUBMS_MINIMAL_STEP_NS) only if the system clock didn't move forward by
SUBMS_MINIMAL_STEP_NS.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Use streaming read API in pgstattuple.

2024-12-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 29 Nov 2024 at 20:05, Matheus Alcantara
 wrote:
>
> Hi,
>
>
> On 29/11/24 04:28, Nazir Bilal Yavuz wrote:
> > -for (; blkno < nblocks; blkno++)
> > +p.last_exclusive = nblocks;
> > +
> > +while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
> >   {
> >   CHECK_FOR_INTERRUPTS();
> >
> > -pagefn(&stat, rel, blkno, bstrategy);
> > +pagefn(&stat, rel, buf);
> >   }
> > +
> > +Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
> >
> > With this change we assume that if stream returns an invalid buffer
> > that means stream must be finished. We don't check if the stream
> > didn't finish but somehow read an invalid buffer. In the upstream
> > version, if we read an invalid buffer then postgres server will fail.
> > But in the patched version, the server will continue to run because it
> > will think that stream has reached to end. This could be happening for
> > other streaming read users; for example at least the
> > read_stream_next_buffer() calls in the collect_corrupt_items()
> > function face the same issue.
>
> Just for reference; On pg_prewarm() for example this loop is
> implemented as:
>
> for (block = first_block; block <= last_block; ++block)
> {
>  Buffer buf;
>  ...
>  buf = read_stream_next_buffer(stream, NULL);
>  ...
> }
> Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
>
>
> Would this approach make sense on these cases? (this patch and on
> collect_corrupt_items)

Yes, that should work and I guess it would be easy to apply this
approach to this patch but it would be hard to apply this to
collect_corrupt_items().

In cases like pg_prewarm or the current scenario, we know the exact
number of blocks to read, which allows us to determine when the stream
should finish (e.g., after the loop runs blockno times) and return an
invalid buffer. But in collect_corrupt_items(), the number of loop
iterations required is not directly tied to blockno since some blocks
may be skipped. This makes it difficult to know when the stream should
terminate and return an invalid buffer.


-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: More CppAsString2() in psql's describe.c


On 28.11.24 07:34, Michael Paquier wrote:

- "WHERE p.prokind = 'a'\n",
+ "WHERE p.prokind = " 
CppAsString2(PROKIND_AGGREGATE) "\n",


I noticed something here.  If the symbol that is the argument of 
CppAsString2() is not defined, maybe because there is a typo or because 
the required header file is not included, then there is no compiler 
diagnostic.  Instead, the symbol is just used as is, which might then 
result in an SQL error or go unnoticed if there is no test coverage.


Now, the same is technically true for the previous code with the 
hardcoded character values, but I think at least something like


"WHERE p.prokind = 'a'\n",

is easier to eyeball for correctness, whereas, you know, 
CppAsString2(PROPARALLEL_RESTRICTED), is quite something.


I think this would be more robust if it were written something like

"WHERE p.prokind = '%c'\n", PROKIND_AGGREGATE

(Moreover, the current structure assumes that the C character literal 
syntax used by the PROKIND_* and other symbols happens to be the same as 
the SQL string literal syntax required in those queries, which is just 
an accident.)

Drop back the redundant "Lock" suffix from LWLock wait event names

2024-12-02 Thread Bertrand Drouvot

Hi hackers,

da952b415f unintentionally added back the "Lock" suffix into the LWLock wait
event names:

- "added back" because the "Lock" suffix was removed in 14a9101091
- "unintentionally" because there is nothing in the thread [2] that explicitly
mentions that the idea was also to revert 14a9101091

Please find attached a patch to remove it back so that the pg_stat_activity
view now reports back the LWLock without the "Lock" suffix (as pg_wait_events
and the related documentation do).

It has been reported in bug #18728 (see [1]).

[1]: 
https://www.postgresql.org/message-id/flat/18728-450924477056a339%40postgresql.org
[2]: 
https://www.postgresql.org/message-id/flat/202401231025.gbv4nnte5fmm%40alvherre.pgsql

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 86c30d2daf939a83b578de33fe96cf1790dbd427 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 2 Dec 2024 07:43:35 +
Subject: [PATCH v1] Drop back the redundant "Lock" suffix from LWLock wait
 event names

da952b415f unintentionally added back the "Lock" suffix into the LWLock wait
event names.

Per bug #18728 from Christophe Courtois.
---
 src/backend/storage/lmgr/generate-lwlocknames.pl | 1 +
 src/backend/storage/lmgr/lwlock.c| 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
 100.0% src/backend/storage/lmgr/

diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index eaddd9d3b9..c4267328d3 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -107,6 +107,7 @@ while (<$lwlocklist>)
 	$lastlockidx = $lockidx;
 	$continue = ",\n";
 
+	# Add the Lock suffix only here as the C code depends of it
 	print $h "#define ${lockname}Lock (&MainLWLockArray[$lockidx].lock)\n";
 }
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab..87cf295262 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -124,7 +124,7 @@ StaticAssertDecl(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS,
  * ... and do not forget to update the documentation's list of wait events.
  */
 static const char *const BuiltinTrancheNames[] = {
-#define PG_LWLOCK(id, lockname) [id] = CppAsString(lockname) "Lock",
+#define PG_LWLOCK(id, lockname) [id] = CppAsString(lockname),
 #include "storage/lwlocklist.h"
 #undef PG_LWLOCK
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
-- 
2.34.1

Re: Vacuum statistics

2024-12-02 Thread Alexander Korotkov

Hi, Alena!

On Wed, Nov 13, 2024 at 6:21 PM Alena Rybakina
 wrote:
> Updated 0001-v13 attached, as well as the diff between v12 and v13.
>
> Thank you)
>
> And I agree with your changes. And included them in patches.

Thank you for the updated patchset.  Some points from me.

* I've read the previous discussion on how important to keep all these
fields regarding vacuum statistics including points by Andrei and Jim.
It still worrying me that statistics volume is going to burst in about
3 times, but I don't have a particular proposal on how to make more
granular approach.  I wonder if you could propose something.
* Previously PGSTAT_FILE_FORMAT_ID got increased by 1. Your 0001 patch
increases it by 2.  It's minor note, but I'd like to keep the
tradition.
* Commit message for 0001 looks nice, but commit messages of 0002,
0003, and 0004 look messy.  Could you please, rearrange them.
* The distinction between 0001 and 0002 is not clear. The first line
of 0001 is "Machinery for grabbing an extended vacuum statistics on
heap relations", the first line of 0002 is "Machinery for grabbing an
extended vacuum statistics on heap and index relations."  I guess 0001
should be about heap relations while 0002 should be about just index
relations.  Is this correct?
* I guess this statistics should work for any table AM, based on what
has been done in relation_vacuum() interface method.  If that's
correct, we need to get rid of "heap" terminology and use "table"
instead.
* 0004 should be pure documentation patch, but it seems containing
changes to isolation tests.  Please, move them into a more appropriate
place.

--
Regards,
Alexander Korotkov
Supabase

Re: RISC-V animals sporadically produce weird memory-related failures

2024-12-02 Thread Alexander Lakhin

02.12.2024 18:25, Tom Turelinckx wrote:

These crashes are hardly related to code changes, so maybe there are
platform-specific issues still...

I naively assumed that because llvm and clang are available in Trixie on
riscv64 that I could simply install them and enable --with-llvm on copperhead,
but I then discovered that this caused lots of segmentation faults and I had to
revert the --with-llvm again. Sorry about not first testing without submitting
results.

Thank you for the clarification!
I hadn't noticed the "--with-llvm" option added in the configuration...
Now I've re-run `make check` for a llvm-enabled build (made with clang
19.1.4) locally and got the same:
2024-12-02 16:49:47.620 UTC postmaster[21895] LOG: server process (PID 21933) was terminated by signal 11: Segmentation
fault
2024-12-02 16:49:47.620 UTC postmaster[21895] DETAIL: Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*,
BOOLTBL2.*

FROM BOOLTBL1, BOOLTBL2
WHERE BOOLTBL2.f1 <> BOOLTBL1.f1;

A build made with clang-19 without llvm passed `make check` successfully.

I had increased the core file size limit in /etc/security/limits.conf, but in
Trixie this is overruled by a default
/etc/security/limits.d/10-coredump-debian.conf. Moreover, the core_pattern was
set by apport on the Ubuntu lxc host, but apport is not available in the Trixie
lxc guest. I have now corrected both issues, and a simple test resulted in a
core file being written to the current directory, like it was before the
upgrade.

Thank you for fixing this!

Best regards,
Alexander

Re: Using read stream in autoprewarm

2024-12-02 Thread Kirill Reshke

On Fri, 29 Nov 2024 at 16:19, Nazir Bilal Yavuz  wrote:

> v4 is attached.
Hi!

I feel like we are ready to mark this as RFC, WDYT?

-- 
Best regards,
Kirill Reshke

Re: Doc: typo in config.sgml

2024-12-02 Thread Bruce Momjian

On Mon, Dec  2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > Now that we have a warning about non-emittable characters in the PDF
> > build, do you want me to put back the Latin1 characters in the SGML
> > files or leave them as HTML entities?
> 
> I think going forward we're going to be putting in people's names
> in UTF8 --- I was certainly planning to start doing that.  It doesn't

Yes, I expected that, and added an item to my release checklist to make
a PDF file and check for the warning.  I don't normally do that.

> matter that much what we do with existing cases, though.

Okay, I think Peter had an opinion but I wasn't sure what it was.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

RE: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-12-02 Thread Zhijie Hou (Fujitsu)

On Friday, November 29, 2024 9:08 PM Shlok Kyal  
wrote:
> 
> 
> Thanks for reviewing the patch. I have fixed the issue and updated the patch.

Thanks for updating the patch. I have reviewed and have few suggestions.

Please check the attached diff which includes:

1) Comments in CheckCmdReplicaIdentity()
* Removed some duplicate descriptions.
* Fixed the comments for the column list which I think is not correct.
* Mentioned the column list and generated column in XXX part as well.

2) Doc
* Since we mentioned the restriction for UPDATEs and DELTEs when row filter or
column list is defined in the create_publication.sgml, I feel we may need to
mention the generated part as well. So, added in the diff.

3) pub_contains_invalid_column
* Simplified one condition a bit.

Please check and merge if it looks good to you.

Best Regards,
Hou zj

From abdc729ab5e880543e4702d65e8ea66533325d71 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Tue, 3 Dec 2024 12:19:50 +0800
Subject: [PATCH] improvements

---
 doc/src/sgml/ref/create_publication.sgml |  9 
 src/backend/commands/publicationcmds.c   | 13 +---
 src/backend/executor/execReplication.c   | 27 
 3 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/ref/create_publication.sgml 
b/doc/src/sgml/ref/create_publication.sgml
index f8e217d661..ae21018697 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -311,6 +311,15 @@ CREATE PUBLICATION name
system columns.
   
 
+  
+   If tables added to a publication include generated columns that are part of
+   the REPLICA IDENTITY, it is essential to publish these
+   columns by explicitly listing them in the column list or by enabling the
+   publish_generated_columns option. Otherwise,
+   UPDATE or DELETE operations will be
+   disallowed on those tables.
+  
+
   
The row filter on a table becomes redundant if
FOR TABLES IN SCHEMA is specified and the table
diff --git a/src/backend/commands/publicationcmds.c 
b/src/backend/commands/publicationcmds.c
index d2b4c8e9a6..323f59fae1 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -427,16 +427,13 @@ pub_contains_invalid_column(Oid pubid, Relation relation, 
List *ancestors,
if (columns == NULL)
{
/*
-* Check if pubgencols is false and generated column is 
part of
-* REPLICA IDENTITY
+* Break the loop if an unpublished generated column is 
part of the
+* REPLICA IDENTITY.
 */
-   if (!pubgencols)
+   if (!pubgencols && att->attgenerated)
{
-   *unpublished_gen_col |= att->attgenerated;
-
-   /* Break the loop if unpublished generated 
columns exist. */
-   if (*unpublished_gen_col)
-   break;
+   *unpublished_gen_col = true;
+   break;
}
 
/* Skip validating the column list since it is not 
defined */
diff --git a/src/backend/executor/execReplication.c 
b/src/backend/executor/execReplication.c
index 227a8aeea0..f66ff21159 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -785,27 +785,26 @@ CheckCmdReplicaIdentity(Relation rel, CmdType cmd)
return;
 
/*
-* It is only safe to execute UPDATE/DELETE when:
+* It is only safe to execute UPDATE/DELETE if the relation does not
+* publish UPDATEs or DELETEs, or all the following conditions are
+* satisfied:
 *
 * 1. All columns, referenced in the row filters from publications which
-* the relation is in, are valid - i.e. when all referenced columns are
-* part of REPLICA IDENTITY or the table does not publish UPDATEs or
-* DELETEs.
+*the relation is in, are valid - i.e. when all referenced columns 
are
+*part of REPLICA IDENTITY.
 *
 * 2. All columns, referenced in the column lists from publications 
which
-* the relation is in, are valid - i.e. when all referenced columns are
-* part of REPLICA IDENTITY or the table does not publish UPDATEs or
-* DELETEs.
+*the relation is in, are valid - i.e. when all columns referenced 
in
+*the REPLICA IDENTITY are covered by the column list.
 *
-* 3. All generated columns in REPLICA IDENTITY of the relation, for all
-* the publications which the relation is in, are valid - i.e. when
-* unpublished generated columns are not part of REPLICA IDENTITY or the
-* table does not publish UPDATEs or DELETEs

Re: [18] Unintentional behavior change in commit e9931bfb75

2024-12-02 Thread Jeff Davis

On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
> Well, also for compatibility with our SQL parser's understanding
> of identifier lowercasing.

But why only for single-byte encodings? And why not for ICU?

> > Should I put the special case back?
> 
> I think so.

Done. I put the special case back in (commit e3fa2b037c) because the
earlier commit wasn't intended to be a behavior change.

I'm still not convinced that the special case behavior is great, but a
lot of users are unaffected because they are on UTF8 anyway, so I'm
fine keeping it.

Regards,
Jeff Davis

Re: [18] Unintentional behavior change in commit e9931bfb75

Jeff Davis  writes:
> On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
>> Well, also for compatibility with our SQL parser's understanding
>> of identifier lowercasing.

> But why only for single-byte encodings? And why not for ICU?

I think the not-for-utf8 exclusion was mostly because that was
how it was before, which was probably mostly historical accident.
(I do vaguely recall that there was discussion on the point, but
I'm too tired to go look in the archives for it.)

As for ICU, that didn't exist back then, and I'm not going to
defend whether it was a good idea for that code path to fail
to reproduce this behavior.

regards, tom lane

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Michael Paquier

On Tue, Dec 03, 2024 at 09:50:42AM +0530, Amit Kapila wrote:
> But that suits the current design more. We allocate PGOutputData and
> other contexts in that structure in a "Logical decoding context". A
> few of its members (publications, publication_names) residing in
> totally unrelated contexts sounds odd. In the first place, we don't
> need to allocate publications under CacheMemoryContext, they should be
> allocated in PGOutputData->cachectx. However, because we need to free
> those entirely at one-shot during invalidation processing, we could
> use a new context as a child context of PGOutputData->cachectx. Unless
> I am missing something, the current memory context usage appears more
> like a coding convenience than a thoughtful design decision.

PGOutputData->cachectx has been introduced in 2022 in commit 52e4f0cd4,
while the decision to have RelationSyncEntry and the publication list
in CacheMemoryContext gets down to v10 where this logical replication
has been introduced.  This was a carefully-thought choice back then
because this is data that belongs to the process cache, so yes, this
choice makes sense to me.  Even today this choice makes sense: this is
still data cached in the process, and it's stored in the memory
context for cached data.

The problem is that we have two locations where the cached data is
stored, and I'd agree to make the whole leaner by relying on one or
the other entirely, but not both.  If you want to move all that to the
single memory context in PGOutputData, perhaps that makes sense in the 
long-run and the code gets simpler.  Perhaps also it could be simpler
to get everything under CacheMemoryContext and remove
PGOutputData->cachectx.  However, if you do so, I'd suggest to use the
same parent context for RelationSyncData as well rather than have two
concepts, not only the publication list.

That's digressing from the original subject a bit, btw..
--
Michael

signature.asc
Description: PGP signature

Re: speedup ALTER TABLE ADD CHECK CONSTRAINT.

2024-12-02 Thread jian he

On Mon, Dec 2, 2024 at 12:06 AM Sergei Kornilov  wrote:
>
> Hello!
> Thanks for the patch, but I think using SPI is still not allowed here: 
> https://www.postgresql.org/message-id/9878.1511970441%40sss.pgh.pa.us
>
> > if it uses SPI for sub-operations of ALTER TABLE I think that is sufficient 
> > reason to reject it out of hand.
>

Thanks for dropping this link.
i wonder what was your thoughts about this
(https://www.postgresql.org/message-id/900056D1-32DF-4927-8251-3E0C0DC407FD%40anarazel.de)
at that time?

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Michael Paquier

On Mon, Dec 02, 2024 at 03:29:56PM +0530, Amit Kapila wrote:
> We can look at it from a different angle which is that the
> FreePublication(s) relies on how the knowledge of Publication
> structure is built. So, it doesn't look weird if we think from that
> angle.

OK, I can live with that on all the stable branches with an extra
list free rather than a deep list free.

I agree that the memory handling of this whole area needs some rework
to make such leaks harder to introduce in the WAL sender.  Still,
let's first solve the problem at hand :)

So how about the attached that introduces a FreePublication() matching
with GetPublication(), used to do the cleanup?  Feel free to comment.
--
Michael
From f65c7bf220ba152a868ef52bb7df245809ce6d73 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 3 Dec 2024 15:45:04 +0900
Subject: [PATCH v3] Fix memory leak in pgoutput with publication list cache

Blah.

Analyzed-by: Jeff Davis, Michael Paquier
Discussion: https://postgr.es/m/z0khf9evmvloc...@paquier.xyz
Backpatch-through: 13
---
 src/include/catalog/pg_publication.h|  1 +
 src/backend/catalog/pg_publication.c| 10 ++
 src/backend/replication/pgoutput/pgoutput.c |  8 +---
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/include/catalog/pg_publication.h b/src/include/catalog/pg_publication.h
index 9a83a72d6b..4e2de947e9 100644
--- a/src/include/catalog/pg_publication.h
+++ b/src/include/catalog/pg_publication.h
@@ -118,6 +118,7 @@ typedef struct PublicationRelInfo
 } PublicationRelInfo;
 
 extern Publication *GetPublication(Oid pubid);
+extern void FreePublication(Publication *pub);
 extern Publication *GetPublicationByName(const char *pubname, bool missing_ok);
 extern List *GetRelationPublications(Oid relid);
 
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 9bbb60463f..2efe86d953 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -1075,6 +1075,16 @@ GetPublication(Oid pubid)
 	return pub;
 }
 
+/*
+ * Free memory allocated by Publication structure.
+ */
+void
+FreePublication(Publication *pub)
+{
+	pfree(pub->name);
+	pfree(pub);
+}
+
 /*
  * Get Publication using name.
  */
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 5e23453f07..a51b1c15fb 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -2064,11 +2064,13 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		if (!publications_valid)
 		{
 			oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-			if (data->publications)
+			foreach(lc, data->publications)
 			{
-list_free_deep(data->publications);
-data->publications = NIL;
+Publication *pub = lfirst(lc);
+
+FreePublication(pub);
 			}
+			list_free(data->publications);
 			data->publications = LoadPublications(data->publication_names);
 			MemoryContextSwitchTo(oldctx);
 			publications_valid = true;
-- 
2.45.2



signature.asc
Description: PGP signature

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-02 Thread Peter Smith

On Tue, Dec 3, 2024 at 5:47 PM Michael Paquier  wrote:
>
> On Mon, Dec 02, 2024 at 03:29:56PM +0530, Amit Kapila wrote:
> > We can look at it from a different angle which is that the
> > FreePublication(s) relies on how the knowledge of Publication
> > structure is built. So, it doesn't look weird if we think from that
> > angle.
>
> OK, I can live with that on all the stable branches with an extra
> list free rather than a deep list free.
>
> I agree that the memory handling of this whole area needs some rework
> to make such leaks harder to introduce in the WAL sender.  Still,
> let's first solve the problem at hand :)
>
> So how about the attached that introduces a FreePublication() matching
> with GetPublication(), used to do the cleanup?  Feel free to comment.
> --

Perhaps the patch can use foreach_ptr macro instead of foreach?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-12-02 Thread Peter Smith

Hi Nisha, here are a couple of review comments for patch v52-0001.

==
Commit Message

Add check if slot is already acquired, then mark it invalidate directly.

~

/slot/the slot/

"mark it invalidate" ?

Maybe you meant:
"then invalidate it directly", or
"then mark it 'invalidated' directly", or
etc.

==
src/backend/replication/logical/slotsync.c

1.
@@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data,
size_t startup_data_len)
 static void
 update_synced_slots_inactive_since(void)
 {
- TimestampTz now = 0;
+ TimestampTz now;

  /*
  * We need to update inactive_since only when we are promoting standby to
@@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void)
  /* The slot sync worker or SQL function mustn't be running by now */
  Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);

+ /* Use same inactive_since time for all slots */
+ now = GetCurrentTimestamp();
+

Something is broken with these changes.

AFAICT, the result after applying patch 0001 still has code:
/* Use the same inactive_since time for all the slots. */
if (now == 0)
  now = GetCurrentTimestamp();

So the end result has multiple/competing assignments to variable 'now'.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-12-02 Thread Shlok Kyal

On Tue, 3 Dec 2024 at 10:21, Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, November 29, 2024 9:08 PM Shlok Kyal  
> wrote:
> >
> >
> > Thanks for reviewing the patch. I have fixed the issue and updated the 
> > patch.
>
> Thanks for updating the patch. I have reviewed and have few suggestions.
>
> Please check the attached diff which includes:
>
> 1) Comments in CheckCmdReplicaIdentity()
> * Removed some duplicate descriptions.
> * Fixed the comments for the column list which I think is not correct.
> * Mentioned the column list and generated column in XXX part as well.
>
> 2) Doc
> * Since we mentioned the restriction for UPDATEs and DELTEs when row filter or
> column list is defined in the create_publication.sgml, I feel we may need to
> mention the generated part as well. So, added in the diff.
>
> 3) pub_contains_invalid_column
> * Simplified one condition a bit.
>
> Please check and merge if it looks good to you.
>
The changes look good to me. I have included it in the updated patch.

Thanks and Regards,
Shlok Kyal


v14-0001-Disallow-UPDATE-DELETE-on-table-with-generated-c.patch
Description: Binary data

Re: Consider pipeline implicit transaction as a transaction block

2024-12-02 Thread Michael Paquier

On Mon, Dec 02, 2024 at 09:14:11AM -0500, Robert Haas wrote:
> On Thu, Nov 28, 2024 at 2:53 AM Anthonin Bonnefoy
>  wrote:
>> Sorry about that. I didn't have a strong need for this to be
>> backpatched and should have made this clearer.
> 
> FWIW, I don't think you did anything wrong. To me, the thread reads
> like you just submitted this as a normal patch and Michael decided to
> back-patch.

Yep, this one's on me.  Please don't worry, Anthonin.
--
Michael


signature.asc
Description: PGP signature

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

On Wed, Nov 3, 2021 at 7:33 PM Peter Geoghegan  wrote:
> But what about index-only scans, which GiST also supports? I think
> that the rules are different there, even though index-only scans use
> an MVCC snapshot.

(Reviving this old thread after 3 years)

I was reminded of this old thread during today's discussion of a
tangentially related TID-recycle-safety bug that affects bitmap index
scans that use the visibility map as an optimization [1]. Turns out I
was right to be concerned.

This GiST bug is causally unrelated to that other bug, so I thought it
would be helpful to move discussion of the GiST bug to this old
thread.

Attached is a refined version of a test case I posted earlier on [2],
decisively proving that GiST index-only scans are in fact subtly
broken. Right now it fails, showing a wrong answer to a query. The
patch adds an isolationtest test case to btree_gist, based on a test
case of Andres'.

Offhand, I think that the only way to fix this is to bring GiST in
line with nbtree: we ought to teach GiST VACUUM to start acquiring
cleanup locks (previously known as super-exclusive locks), and then
teach GiST index-only scans to hold onto a leaf page buffer pin as the
visibility map (or the heap proper) is accessed for the TIDs to be
returned from the leaf page. Arguably, GiST isn't obeying the current
contract for amgettuple index AMs at all right now (whether or not it
violates the contract as written is open to interpretation, I suppose,
but either way the current behavior is wrong).

We probably shouldn't hold onto a buffer pin during plain GiST
index-only scans, though -- plain GiST index scans *aren't* broken,
and so we should change as little as possible there. More concretely,
we should probably copy more nbtree scan related code into GiST to
deal with all this: we could copy nbtree's _bt_drop_lock_and_maybe_pin
into GiST to fix this bug, while avoiding changing the performance
characteristics of GiST plain index scans. This will also entail
adding a new buffer field to GiST's GISTScanOpaqueData struct --
something similar to nbtree's BTScanOpaqueData.currPos.buf field
(it'll go next to the current GISTScanOpaqueData.curBlkno field, just
like the nbtree equivalent goes next to its own currPage blkno field).

Long term, code like nbtree's _bt_drop_lock_and_maybe_pin should be
generalized and removed from every individual index AM, nbtree
included -- I think that the concepts generalize to every index AM
that supports amgettuple (the race condition in question has
essentially nothing to do with individual index AM requirements). I've
discussed this kind of approach with Tomas Vondra (CC'd) recently.
That's not going to be possible within the scope of a backpatchable
fix, though.

[1] https://postgr.es/m/873c33c5-ef9e-41f6-80b2-2f5e11869...@garret.ru
[2] 
https://postgr.es/m/CAH2-Wzm6gBqc99iEKO6540ynwpjOqWESt5yjg-bHbt0hc8DPsw%40mail.gmail.com
--
Peter Geoghegan

v2-0001-isolationtester-showing-broken-index-only-scans-w.patch
Description: Binary data

Re: Incorrect result of bitmap heap scan.