date:20241220

Fix logging for invalid recovery timeline

2024-12-20 Thread David Steele


Hackers,

I noticed while debugging a user issue that the checkpoint logged in

"Latest checkpoint is at %X/%X on timeline %u, but in the history of the 
requested timeline, the server forked off from that timeline at %X/%X."


is being pulled from the value stored in the control file rather than 
the value read from backup_label (in the case where there is a 
backup_label). This only affects logging since the timeline check is 
done against the checkpoint/TLI as read from backup_label.


This patch updates the checkpoint and TLI to (what I believe are) the 
correct values for logging.


I think this should also be back-patched.

Regards,
-DavidFrom f14e30b18cde216131bd3e069ee8ecd5da3301b0 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Fri, 20 Dec 2024 15:13:59 +
Subject: Fix logging for invalid recovery timeline.

If the requested recovery timeline is not reachable the logged checkpoint and
TLI should to be the values read from backup_label when backup_label is
present.
---
 src/backend/access/transam/xlogrecovery.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index c6994b78282..99afd01bf38 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -844,13 +844,13 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
 * tliSwitchPoint will throw an error if the checkpoint's 
timeline is
 * not in expectedTLEs at all.
 */
-   switchpoint = 
tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL);
+   switchpoint = tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);
ereport(FATAL,
(errmsg("requested timeline %u is not a child 
of this server's history",
recoveryTargetTLI),
 errdetail("Latest checkpoint is at %X/%X on 
timeline %u, but in the history of the requested timeline, the server forked 
off from that timeline at %X/%X.",
-  
LSN_FORMAT_ARGS(ControlFile->checkPoint),
-  
ControlFile->checkPointCopy.ThisTimeLineID,
+  
LSN_FORMAT_ARGS(CheckPointLoc),
+  CheckPointTLI,
   
LSN_FORMAT_ARGS(switchpoint;
}
 
-- 
2.34.1

Re: FileFallocate misbehaving on XFS

2024-12-20 Thread Andres Freund

Hi,

On 2024-12-19 17:47:13 +1100, Michael Harris wrote:
> I finally managed to get the patched version installed in a production
> database where the error is occurring very regularly.

Thanks!

> Here is a sample of the output:
> 
> 2024-12-19 01:08:50 CET [2533222]:  LOG:  mdzeroextend FileFallocate
> failing with ENOSPC: free space for filesystem containing
> "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
> 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
> 1073741376, f_ffree: 1069933796

That's ~700 GB of free space...

It'd be interesting to see filefrag -v for that segment.

> This is a different system to those I previously provided logs from.
> It is also running RHEL8 with a similar configuration to the other
> system.

Given it's a RHEL system, have you raised this as an issue with RH? They
probably have somebody with actual XFS hacking experience on staff.

RH's kernels are *heavily* patched, so it's possible the issue is actually RH
specific.

> I have so far not installed the bpftrace that Jakub suggested before -
> as I say this is a production machine and I am wary of triggering a
> kernel panic or worse (even though it seems like the risk for that
> would be low?). While a kernel stack trace would no doubt be helpful
> to the XFS developers, from a postgres point of view, would that be
> likely to help us decide what to do about this?

Well, I'm personally wary of installing workarounds for a problem I don't
understand and can't reproduce, which might be specific to old filesystems
and/or heavily patched kernels.  This clearly is an FS bug.

That said, if we learn that somehow this is a fundamental XFS issue that can
be triggered on every XFS filesystem, with current kernels, it becomes more
reasonable to implement a workaround in PG.

Another thing I've been wondering about is if we could reduce the frequency of
hitting problems by rounding up the number of blocks we extend by to powers of
two. That would probably reduce fragmentation, and the extra space would be
quickly used in workloads where we extend by a bunch of blocks at once,
anyway.

Greetings,

Andres Freund

Document How Commit Handles Aborted Transactions

2024-12-20 Thread David G. Johnston

Hi,

The commit reference page lacks an "Outputs" section even though it is
capable of outputting both "COMMIT" and "ROLLBACK".

The attached adds this section, describes when each applies, and then
incorporates the same into the main description for commit as well as the
transaction section of the tutorial - which presently seems to be the main
discussion area for the topic (the Concurrency Control chapter lacks a
section for this introductory material).

This was noted as being needed by Tom Lane back into 2006 but it
never happened.
https://www.postgresql.org/message-id/28798.1142608067%40sss.pgh.pa.us

It came up again when I was answering a question on Slack regarding "commit
and chain" wondering whether the "and chain" could be made conditional
(i.e., could the new transaction start aborted) on whether commit outputted
"commit" or "rollback".

Its left implied that this behavior of "rollback" is standard-conforming.
Please feel free to suggest/add language to the Compatibility section if
this is not the case.

David J.
From f9c40e72c8e62ae7b364a82d26a0f73995ef5082 Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Fri, 20 Dec 2024 08:40:47 -0700
Subject: [PATCH] doc: Commit performs rollback of aborted transactions

The Commit command handles an aborted transaction in the same
manner as the Rollback command.  This needs to be documented.
Add it to both the reference page as well mentioning the behavior
in the official material for introducting transactions - the tutorial.

In passing, make the description of the Commit reference page
self-contained by mentioning the 'and chain' behavior.
---
 doc/src/sgml/advanced.sgml   | 18 +++---
 doc/src/sgml/ref/commit.sgml | 33 +
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/advanced.sgml b/doc/src/sgml/advanced.sgml
index 755c9f1485..b5cae2335c 100644
--- a/doc/src/sgml/advanced.sgml
+++ b/doc/src/sgml/advanced.sgml
@@ -149,7 +149,8 @@ DETAIL:  Key (city)=(Berkeley) is not present in table "cities".
 systems.  The essential point of a transaction is that it bundles
 multiple steps into a single, all-or-nothing operation.  The intermediate
 states between the steps are not visible to other concurrent transactions,
-and if some failure occurs that prevents the transaction from completing,
+and if some failure
+occurs that prevents the transaction from completing,
 then none of the steps affect the database at all.

 
@@ -218,7 +219,8 @@ UPDATE branches SET balance = balance + 100.00

 In PostgreSQL, a transaction is set up by surrounding
 the SQL commands of the transaction with
-BEGIN and COMMIT commands.  So our banking
+ and 
+ commands.  So our banking
 transaction would actually look like:
 
 
@@ -233,7 +235,7 @@ COMMIT;

 If, partway through the transaction, we decide we do not want to
 commit (perhaps we just noticed that Alice's balance went negative),
-we can issue the command ROLLBACK instead of
+we can issue the command  instead of
 COMMIT, and all our updates so far will be canceled.

 
@@ -256,6 +258,16 @@ COMMIT;
 

 
+   
+When a failure does occur during a transaction it is not ended but instead
+goes into an aborted state.  While in this state all commands except
+ and  are ignored and,
+importantly, both those commands behave identically - they roll-back and close
+the current transaction and return the session to a state where new commands can
+be issued.  They will also automatically begin a new transaction if executed
+with a AND CHAIN parameter.
+   
+

 It's possible to control the statements in a transaction in a more
 granular fashion through the use of savepoints.  Savepoints
diff --git a/doc/src/sgml/ref/commit.sgml b/doc/src/sgml/ref/commit.sgml
index 7e2dcac5a3..e4ef573fe5 100644
--- a/doc/src/sgml/ref/commit.sgml
+++ b/doc/src/sgml/ref/commit.sgml
@@ -33,6 +33,19 @@ COMMIT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
changes made by the transaction become visible to others
and are guaranteed to be durable if a crash occurs.
   
+  
+   If no changes have been made - because the transaction is in an
+   aborted state - the effect of the commit will look like a rollback,
+   including the command tag output.
+  
+  
+   In either situation, if the AND CHAIN parameter is
+   specified a new, identically configured, transaction is started.
+  
+  
+   For more information regarding transactions see
+   .
+  
  
 
  
@@ -67,6 +80,25 @@ COMMIT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
   
  
 
+ 
+  Outputs
+
+  
+   On successful completion on a non-aborted transaction, a COMMIT
+   command returns a command tag of the form
+
+COMMIT
+
+  
+  
+   However, if the transaction being affected is aborted, a COMMIT
+   command returns a command tag of the form
+
+ROLLBACK
+
+  
+ 
+
  
   Notes
 
@@ -107,6 +139,7 @@ CO

Re: Sort functions with specialized comparators

2024-12-20 Thread Andrey M. Borodin

> On 16 Dec 2024, at 14:02, John Naylor  wrote:
> 
> Sorry, I forgot this part earlier. Yes, let's have the private function.

PFA v6.

I was poking around intarray and trying not to bash everything. It seems to me 
that overall code readability should be seriously reworked. Even if no one is 
going to invent anything new around this code. Looks like overall project code 
standards matured far above, but this particular extension is forgotten. There 
is a lot of useless checks and optimizations for n < 2, copying data to new 
allocation when it's not necessary, small inconsistencies etc.

I don't think it's a matter of this particular thread though.

Best regards, Andrey Borodin.

v6-0001-Use-specialized-sort-facilities.patch
Description: Binary data

v6-0003-prefer-non-resizing-to-constructing-empty-array.patch
Description: Binary data

v6-0002-Use-PREPAREARR-where-possible.patch
Description: Binary data

Re: FileFallocate misbehaving on XFS

2024-12-20 Thread Jakub Wartak

On Thu, Dec 19, 2024 at 7:49 AM Michael Harris  wrote:

> Hello,
>
> I finally managed to get the patched version installed in a production
> database where the error is occurring very regularly.
>
> Here is a sample of the output:
>
> 2024-12-19 01:08:50 CET [2533222]:  LOG:  mdzeroextend FileFallocate
> failing with ENOSPC: free space for filesystem containing
> "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
> 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
> 1073741376, f_ffree: 1069933796


[..]

> I have attached a file containing all the errors I collected. The
> error is happening pretty regularly - over 400 times in a ~6 hour
> period. The number of blocks being extended varies from ~9 to ~15, and
> the statfs result shows plenty of available space & inodes at the
> time. The errors do seem to come in bursts.
>

I couldn't resist: you seem to have entered the quantum realm of free disk
space AKA Schrodinger's free space: you both have the space and dont have
it... ;)

No one else has responded, so I'll try. My take is that we got very limited
number of reports (2-3) of this stuff happening and it always seem to be
>90% space used, yet the adoption of PG16 is rising, so we may or may not
see more errors of those kind, but on another side of things: it's
frequency is so rare that it's really wild we don't see more reports like
this one. Lots of OS upgrades in the wild are performed by building new
standbys (maybe that lowers the fs fragmentation), rather than in-place OS
upgrades. To me it sounds like a new bug in XFS that is rare. You can
probably live with #undef HAVE_POSIX_FALLOCATE as a way to survive, another
would be probably to try to run xfs_fsr to defragment the fs.

Longer-term: other than collecting the eBPF data to start digging from
where it is really triggered, I don't see a way forward. It would be
suboptimal to just abandon fallocate() optimizations from commit
31966b151e6ab7a6284deab6e8fe5faddaf2ae4c, just because of very unusual
combinations of factors (XFS bug).

Well we could be having some kludge like pseudo-code: if(posix_falloc() ==
ENOSPC && statfs().free_space_pct >= 1) fallback_to_pwrites(), but it is
ugly. Another is GUC (or even two -- how much to extend or to use or not
the posix_fallocate()), but people do not like more GUCs...

>  I have so far not installed the bpftrace that Jakub suggested before -
> as I say this is a production machine and I am wary of triggering a
> kernel panic or worse (even though it seems like the risk for that
> would be low?). While a kernel stack trace would no doubt be helpful
> to the XFS developers, from a postgres point of view, would that be
> likely to help us decide what to do about this?[..]

Well you could try having reproduction outside of production, or even clone
the storage (but not using backup/restore), but literally clone the XFS
LUNs on the storage itself and connect those separate VM to have a safe
testbed (or even use dd(1) of some smaller XFS fs exhibiting such behaviour
to some other place)

As for eBPF/bpftrace: it is safe (it's sandboxed anyway), lots of customers
are using it, but as always YMMV.

There's also xfs_fsr that might help overcome.

You can also experiment if -o allocsize helps or just even try -o
allocsize=0 (but that might have some negative effects on performance
probably)

-J.

Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-12-20 Thread Sterrett, Matthew




On 12/7/2024 12:42 AM, Devulapalli, Raghuveer wrote:

[0] https://cirrus-ci.com/task/6023394207989760
[1] https://cirrus-ci.com/task/5460444254568448
[2] https://cirrus-ci.com/task/6586344161411072


I was able to fix [0] and [1], but I can't think of why [2] fails. When I tried 
to reproduce this locally, I get a different unrelated error. Any idea why I am 
seeing this?

LINK : fatal error LNK1181: cannot open input file 'C:\Program Files\Git\nologo'

Commands: meson setup build && cd build && meson compile


Hello! I'm Matthew Sterrett and I'm a coworker of Raghuveer; he asked me 
to look into the Windows build failures related to pg_comp_crc32c.


It seems that the only thing that was required to fix that is to mark 
pg_comp_crc32c as PGDLLIMPORT, so I added a patch that does just that. 
I'm new to working with mailing lists, so please tell me if I messed 
anything up!


Matthew Sterrett
From 74d085d44d41af8ffb01f7bf2377ac487c7d4cc1 Mon Sep 17 00:00:00 2001
From: Paul Amonson 
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v10 1/4] Add a Postgres SQL function for crc32c benchmarking.

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

Signed-off-by: Paul Amonson 
Signed-off-by: Raghuveer Devulapalli 
---
 src/test/modules/meson.build  |  1 +
 src/test/modules/test_crc32c/Makefile | 20 
 src/test/modules/test_crc32c/meson.build  | 22 +
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c| 47 +++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 6 files changed, 95 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/meson.build
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b61953..68d8904dd0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_crc32c')
 subdir('test_ddl_deparse')
 subdir('test_dsa')
 subdir('test_dsm_registry')
diff --git a/src/test/modules/test_crc32c/Makefile 
b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 00..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:   CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/meson.build 
b/src/test/modules/test_crc32c/meson.build
new file mode 100644
index 00..7021a6d6cf
--- /dev/null
+++ b/src/test/modules/test_crc32c/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+'--NAME', 'test_crc32c',
+'--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_crc32c
+
+test_install_data += files(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+)
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql 
b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 00..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 
'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c 
b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 00..b350caf5ce
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(100, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+   int64

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-20 Thread Ranier Vilela

Em qui., 19 de dez. de 2024 às 19:50, Melanie Plageman <
melanieplage...@gmail.com> escreveu:

> On Wed, Dec 18, 2024 at 9:50 PM Richard Guo 
> wrote:
> >
> > On Thu, Dec 19, 2024 at 8:18 AM Melanie Plageman
> >  wrote:
> > > I pushed the straightforward option for now so that it's fixed.
> >
> > I think this binary search code now has a risk of underflow.  If 'mid'
> > is calculated as zero, the second 'if' branch will cause 'end' to
> > underflow.
>
> Thanks, Richard!
>
> > Maybe we need to do something like below.
> >
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2600,7 +2600,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
> > if (tupoffset == curoffset)
> > return true;
> > else if (tupoffset < curoffset)
> > +   {
> > +   if (mid == 0)
> > +   return false;
> > end = mid - 1;
> > +   }
> > else
> > start = mid + 1;
> > }
> >
> > Alternatively, we can revert 'start' and 'end' to signed int as they
> > were before.
>
> What about this instead:
>
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index 2da4e4da13e..fb90fd869c2 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2574,11 +2574,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
>
> if (scan->rs_flags & SO_ALLOW_PAGEMODE)
> {
> -   uint32  start,
> -   end;
> -
> -   if (hscan->rs_ntuples == 0)
> -   return false;
> +   uint32  start = 0,
> +   end = hscan->rs_ntuples;
>
> /*
>  * In pageatatime mode, heap_prepare_pagescan() already did
> visibility
> @@ -2589,10 +2586,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
>  * in increasing order, but it's not clear that there would be
> enough
>  * gain to justify the restriction.
>  */
> -   start = 0;
> -   end = hscan->rs_ntuples - 1;
>
> -   while (start <= end)
> +   while (start < end)
> {
> uint32  mid = (start + end) / 2;
> OffsetNumber curoffset = hscan->rs_vistuples[mid];
> @@ -2600,7 +2595,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> -   end = mid - 1;
> +   end = mid;
> else
> start = mid + 1;
> }
>
> Or to make it easier to read, here:
>
> uint32start = 0,
>   end = hscan->rs_ntuples;
>
> while (start < end)
> {
> uint32mid = (start + end) / 2;
> OffsetNumber curoffset = hscan->rs_vistuples[mid];
>
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> end = mid;
> else
> start = mid + 1;
> }
>
> I think this gets rid of any subtraction and is still the same.
>
Look goods to me.
I think that you propose, can get rid of the early test too.
Note the way we can avoid an overflow in the mid calculation.

diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index 9f17baea5d..bd1335276a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2577,9 +2577,6 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
buffer,
  uint32 start,
  end;

- if (hscan->rs_ntuples == 0)
- return false;
-
  /*
  * In pageatatime mode, heap_prepare_pagescan() already did visibility
  * checks, so just look at the info it left in rs_vistuples[].
@@ -2590,17 +2587,17 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
buffer,
  * gain to justify the restriction.
  */
  start = 0;
- end = hscan->rs_ntuples - 1;
+ end = hscan->rs_ntuples;

- while (start <= end)
+ while (start < end)
  {
- uint32 mid = (start + end) / 2;
+ uint32 mid = (start + end) >> 1;
  OffsetNumber curoffset = hscan->rs_vistuples[mid];

  if (tupoffset == curoffset)
  return true;
  else if (tupoffset < curoffset)
- end = mid - 1;
+ end = mid;
  else
  start = mid + 1;
  }

best regards,
Ranier Vilela

Re: A few patches to clarify snapshot management

2024-12-20 Thread Heikki Linnakangas


On 16/12/2024 23:56, Nathan Bossart wrote:

On Mon, Dec 16, 2024 at 12:06:33PM +0200, Heikki Linnakangas wrote:

While working on the CSN snapshot patch, I got sidetracked looking closer
into the snapshot tracking in snapmgr.c. Attached are a few patches to
clarify some things.


I haven't yet looked closely at what you are proposing, but big +1 from me
for the general idea.  I recently found myself wishing for a lot more
commentary about this stuff [0].

[0] https://postgr.es/m/Z0dB1ld2iPcS6nC9%40nathan


While playing around some more with this, I noticed that this code in 
GetTransactionSnapshot() is never reached, and AFAICS has always been 
dead code:



Snapshot
GetTransactionSnapshot(void)
{
/*
 * Return historic snapshot if doing logical decoding. We'll never need 
a
 * non-historic transaction snapshot in this (sub-)transaction, so 
there's
 * no need to be careful to set one up for later calls to
 * GetTransactionSnapshot().
 */
if (HistoricSnapshotActive())
{
Assert(!FirstSnapshotSet);
return HistoricSnapshot;
}


when you think about it, that's good, because it doesn't really make 
sense to call GetTransactionSnapshot() during logical decoding. We jump 
through hoops to make the historic catalog decoding possible with 
historic snapshots, tracking subtransactions that modify catalogs and 
WAL-logging command ids, but they're not suitable for general purpose 
queries. So I think we should turn that into an error, per attached patch.


Another observation is that we only ever use regular MVCC snapshots as 
active snapshots. I added a "Assert(snapshot->snapshot_type == 
SNAPSHOT_MVCC);" to PushActiveSnapshotWithLevel() and all regression 
tests passed. That's also good, because we assumed that much in a few 
places anyway: there are a couple of calls that amount to 
"XidInMVCCSnapshot(..., GetActiveSnapshot()"), in 
find_inheritance_children_extended() and RelationGetPartitionDesc(). We 
could add comments and that assertion to make that assumption explicit.



And that thought takes me deeper down the rabbit hole:


/*
 * Struct representing all kind of possible snapshots.
 *
 * There are several different kinds of snapshots:
 * * Normal MVCC snapshots
 * * MVCC snapshots taken during recovery (in Hot-Standby mode)
 * * Historic MVCC snapshots used during logical decoding
 * * snapshots passed to HeapTupleSatisfiesDirty()
 * * snapshots passed to HeapTupleSatisfiesNonVacuumable()
 * * snapshots used for SatisfiesAny, Toast, Self where no members are
 *   accessed.
 *
 * TODO: It's probably a good idea to split this struct using a NodeTag
 * similar to how parser and executor nodes are handled, with one type for
 * each different kind of snapshot to avoid overloading the meaning of
 * individual fields.
 */
typedef struct SnapshotData


I'm thinking of implementing that TODO, splitting SnapshotData into 
separate structs like MVCCSnapshotData, SnapshotDirtyData, etc. It seems 
to me most places can assume that you're dealing with MVCC snapshots, 
and if we had separate types for them, could be using MVCCSnapshot 
instead of the generic Snapshot. Only the table and index AM functions 
need to deal with non-MVCC snapshots.


--
Heikki Linnakangas
Neon (https://neon.tech)
From ec248c69cb42a0747ecc6a63ac4e4682cce2ee93 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 20 Dec 2024 18:37:44 +0200
Subject: [PATCH 1/1] Don't allow GetTransactionSnapshot() in logical decoding

A historic snapshot should only be used for catalog access, not
general queries. We never call GetTransactionSnapshot() during logical
decoding, which is good because it wouldn't be very sensible, so the
code to deal with that was unreachable and untested. Turn it into an
error, to avoid doing that in the future either.
---
 src/backend/utils/time/snapmgr.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e60360338d5..3717869f736 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,16 +212,12 @@ Snapshot
 GetTransactionSnapshot(void)
 {
 	/*
-	 * Return historic snapshot if doing logical decoding. We'll never need a
-	 * non-historic transaction snapshot in this (sub-)transaction, so there's
-	 * no need to be careful to set one up for later calls to
-	 * GetTransactionSnapshot().
+	 * This should not be called while doing logical decoding.  Historic
+	 * snapshots are only usable for catalog access, not for general-purpose
+	 * queries.
 	 */
 	if (HistoricSnapshotActive())
-	{
-		Assert(!FirstSnapshotSet);
-		return HistoricSnapshot;
-	}
+		elog(ERROR, "cannot take query snapshot during logical decoding");
 
 	/* First call in transaction? */
 	if (!FirstSnapshotSet)
-- 
2.39.5

Re: AIO v2.0

2024-12-20 Thread Jelte Fennema-Nio

On Fri, 20 Dec 2024 at 01:54, Andres Freund  wrote:
> Arguably the configuration *did* tell us, by having a higher hard limit...
> 
> But opting into a higher rlimit, while obviously adhering to the hard limit
> and perhaps some other config knob, seems fine?

Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:

https://0pointer.net/blog/file-descriptor-limits.html

Re: AIO v2.0

2024-12-20 Thread Andres Freund

Hi,

On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote:
> On Fri, 20 Dec 2024 at 01:54, Andres Freund  wrote:
> > Arguably the configuration *did* tell us, by having a higher hard limit...
> > 
> > But opting into a higher rlimit, while obviously adhering to the hard limit
> > and perhaps some other config knob, seems fine?
> 
> Yes, totally fine. That's exactly the reasoning why the hard limit is
> so much larger than the soft limit by default on systems with systemd:
> 
> https://0pointer.net/blog/file-descriptor-limits.html

Good link.

This isn't just relevant for using io_uring:

There obviously are several people working on threaded postgres. Even if we
didn't duplicate fd.c file descriptors between threads (we probably will, at
least initially), the client connection FDs alone will mean that we have a lot
more FDs open. Due to the select() issue the soft limit won't be increased
beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before
starting postgres on linux doesn't help anyone.

Greetings,

Andres Freund

Re: per backend I/O statistics

2024-12-20 Thread Bertrand Drouvot

Hi,

On Fri, Dec 20, 2024 at 08:00:00AM +0200, Alexander Lakhin wrote:
> Hello Michael,
> 
> 19.12.2024 06:21, Michael Paquier wrote:
> > Fixed that, bumped the two version counters, and done.
> 
> Could you, please, look at recent failures produced by grassquit (which
> has fsync = on in it's config), on an added test case? For instance, [1]:
> --- 
> /home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/stats.out 
> 2024-12-19 04:44:08.779311933 +
> +++ 
> /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
> 2024-12-19 16:37:41.351784840 +
> @@ -1333,7 +1333,7 @@
>    AND :my_io_sum_shared_after_fsyncs= 0);
>   ?column?
>  --
> - t
> + f
>  (1 row)
> 
> The complete query is:
> SELECT current_setting('fsync') = 'off'
>   OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs
>   AND :my_io_sum_shared_after_fsyncs= 0);
> 
> And the corresponding query in 027_stream_regress_primary.log is:
> 2024-12-19 16:37:39.907 UTC [4027467][client backend][15/1980:0] LOG:  
> statement: SELECT current_setting('fsync') = 'off'
>   OR (1 = 1
>   AND 1= 0);
> 
> (I can reproduce this locally with an asan-enabled build too.)
> 
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-12-19%2016%3A28%3A58

Thanks for the report! I was not able able to reproduce (even with asan-enabled)
but I think the test is wrong. Indeed the backend could fsync too during the 
test
(see register_dirty_segment() and the case where the request queue is full).

I think the test should look like the attached instead, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a83def4ca95c154e824378452cb15f100887a56c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 20 Dec 2024 08:59:53 +
Subject: [PATCH v1] Fix per-backend IO stats regression test

The tests added in 9aea73fc61 did not take into account that the backend can
fsync in some circumstances.
---
 src/test/regress/expected/stats.out | 3 +--
 src/test/regress/sql/stats.sql  | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)
  50.0% src/test/regress/expected/
  50.0% src/test/regress/sql/

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 150b6dcf74..d18fc7c081 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1329,8 +1329,7 @@ SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 (1 row)
 
 SELECT current_setting('fsync') = 'off'
-  OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs
-  AND :my_io_sum_shared_after_fsyncs= 0);
+  OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
  ?column? 
 --
  t
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 1e7d0ff665..fc0d2fde31 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -648,8 +648,7 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   WHERE object = 'relation' \gset my_io_sum_shared_after_
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
-  OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs
-  AND :my_io_sum_shared_after_fsyncs= 0);
+  OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.34.1

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-12-20 Thread Amit Kapila

On Mon, Dec 16, 2024 at 4:10 PM Nisha Moond  wrote:
>
> Here is the v56 patch set with the above comments incorporated.
>

Review comments:
===
1.
+ {
+ {"idle_replication_slot_timeout", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets the duration a replication slot can remain idle before "
+ "it is invalidated."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &idle_replication_slot_timeout_ms,

I think users are going to keep idele_slot timeout at least in hours.
So, millisecond seems the wrong choice to me. I suggest to keep the
units in minutes. I understand that writing a test would be
challenging as spending a minute or more on one test is not advisable.
But I don't see any test testing the other GUCs that are in minutes
(wal_summary_keep_time and log_rotation_age). The default value should
be one day.

2.
+ /*
+ * An error is raised if error_if_invalid is true and the slot is found to
+ * be invalid.
+ */
+ if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
+ {
+ StringInfoData err_detail;
+
+ initStringInfo(&err_detail);
+
+ switch (s->data.invalidated)
+ {
+ case RS_INVAL_WAL_REMOVED:
+ appendStringInfo(&err_detail, _("This slot has been invalidated
because the required WAL has been removed."));
+ break;
+
+ case RS_INVAL_HORIZON:
+ appendStringInfo(&err_detail, _("This slot has been invalidated
because the required rows have been removed."));
+ break;
+
+ case RS_INVAL_WAL_LEVEL:
+ /* translator: %s is a GUC variable name */
+ appendStringInfo(&err_detail, _("This slot has been invalidated
because \"%s\" is insufficient for slot."),
+ "wal_level");
+ break;
+
+ case RS_INVAL_NONE:
+ pg_unreachable();
+ }
+
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes from replication slot \"%s\"",
+NameStr(s->data.name)),
+ errdetail_internal("%s", err_detail.data));
+ }
+

This should be moved to a separate function.

3.
+static inline bool
+IsSlotIdleTimeoutPossible(ReplicationSlot *s)

Would it be better to name this function as CanInvalidateIdleSlot()?
The current name doesn't seem to match with similar other
functionalities.

-- 
With Regards,
Amit Kapila.

Re: per backend I/O statistics

2024-12-20 Thread Bertrand Drouvot

Hi,

On Thu, Dec 19, 2024 at 06:12:04AM +, Bertrand Drouvot wrote:
> On Thu, Dec 19, 2024 at 01:21:54PM +0900, Michael Paquier wrote:
> > bumped the two version counters, and done.

> >The existing structure could be expanded in the
> >future to add more information about other statistics related to
> >backends, depending on requirements or ideas.

BTW, now that the per backend I/O statistics is done, I'll start working on per
backend wal statistics.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Parametrization minimum password lenght

2024-12-20 Thread Bertrand Drouvot

On Thu, Dec 19, 2024 at 09:57:31AM -0600, Nathan Bossart wrote:
> On Thu, Dec 19, 2024 at 09:36:17AM -0600, Nathan Bossart wrote:
> > On Thu, Dec 19, 2024 at 07:25:30AM +, Bertrand Drouvot wrote:
> >> -   errmsg("password is too short")));
> >> +   errmsg("password is too short: %d (< 
> >> %d)", pwdlen, min_password_length)));
> > 
> > I think we should explicitly point to the parameter and note its current
> > value.
> 
> Like so.

LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: per backend I/O statistics

2024-12-20 Thread Michael Paquier

On Fri, Dec 20, 2024 at 09:09:00AM +, Bertrand Drouvot wrote:
> Thanks for the report! I was not able able to reproduce (even with 
> asan-enabled)
> but I think the test is wrong. Indeed the backend could fsync too during the 
> test
> (see register_dirty_segment() and the case where the request queue is full).
> 
> I think the test should look like the attached instead, thoughts?

Hmm.  I cannot reproduce that here either.  FWIW, I use these settings
by default in my local builds, similarly to the CI which did not
complain:
ASAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0"
UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2"
CFLAGS+="-fsanitize=undefined "
LDFLAGS+="-fsanitize=undefined "

grassquit uses something a bit different, which don't allow me to see
a problem with 027 even with fsync enabled:
ASAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0"
CFLAGS+="-fsanitize=address -fno-sanitize-recover=all "
LDFLAGS+="-fsanitize=address -fno-sanitize-recover=all "

Anyway, I was arriving at the same conclusion as you, because this is
just telling us that there could be "some" sync activity.  As
rewritten, this would just check that the after-the-fact counter is
never lower than the before-the-fact counter, which is still better
than removing the query.  I was thinking about doing the latter and
remove the query, but I'm also OK with what you have here, keeping the
query with a relaxed check.  I'll go adjust that in a bit..
--
Michael


signature.asc
Description: PGP signature

Re: Test to dump and restore objects left behind by regression

2024-12-20 Thread Ashutosh Bapat

On Wed, Dec 18, 2024 at 7:39 PM Daniel Gustafsson  wrote:
>
> > On 18 Dec 2024, at 12:28, Ashutosh Bapat  
> > wrote:
>
> In general I think it's fine to have such an expensive test gated behind a
> PG_TEST_EXTRA flag, and since it's only run on demand we might as well run it
> for all formats while at it.  If this ran just once per week in the buildfarm
> it would still allow us to catch things in time at fairly low overall cost.
>
> > I have rebased my patches on the current HEAD. The test now passes and
> > does not show any new diff or bug.
>
> A few comments on this version of the patch:
>
> +   regression run. Not enabled by default because it is time consuming.
> Since this test consumes both time and to some degree diskspace (the 
> dumpfiles)
> I wonder if this should be "time and resource consuming".

Done.

>
>
> +   if (   $ENV{PG_TEST_EXTRA}
> +   && $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/)
> Should this also test that $oldnode and $newnode have matching pg_version to
> keep this from running in a cross-version upgrade test?  While it can be 
> argued
> that running this in a cross-version upgrade is breaking it and getting to 
> keep
> both pieces, it's also not ideal to run a resource intensive test we know will
> fail.  (It can't be done at this exact callsite, just picked to illustrate.)
>

You already wrote it in parenthesis. At the exact callsite $oldnode
and $newnode can not be of different versions. In fact newnode is yet
to be created at this point. But $oldnode has the same version as the
server run from the code. In a cross-version upgrade this test will
not be executed. I am confused as to what this comment is about.

>
> -sub filter_dump
> +sub filter_dump_for_upgrade
> What is the reason for the rename?  filter_dump() is perhaps generic but it's
> also local to the upgrade test so it's also not too unclear.
>

In one of the earlier versions of the patch, there was
filter_dump_for_regress or some such function which was used to filter
the dump from the regression database. Name was changed to
differentiate between the two functions. But the new function is now
named as adjust_regress_dumpfile() so this name change is not required
anymore. Reverting it. I have left the comment change since the test
file now has tests for both upgrade and dump/restore.

>
> +  my $format_spec = substr($format, 0, 1);
> This doesn't seem great for readability, how about storing the formats and
> specfiers in an array of Perl hashes which can be iterated over with
> descriptive names, like $format{'name'} and $format{'spec'}?
>

Instead of an array of hashes, I used a single hash with format
description as key and format spec as value. Hope that's acceptable.

>
> + || die "opening $dump_adjusted ";
> Please include the errno in the message using ": $!" appended to the error
> message, it could help in debugging.
>

I didn't see this being used with other open calls in the file. For
that matter we are not using $! with open() in many test files. But it
seems useful. Done

> +compare the results of dump and retore tests
> s/retore/restore/
>

Thanks for pointing out. Fixed.

>
> +   else
> +   {
> +   note('first dump file: ' . $dump1);
> +   note('second dump file: ' . $dump2);
> +   }
> +
> This doesn't seem particularly helpful, if the tests don't fail then printing
> the names won't bring any needed information.  What we could do here is to add
> an is() test in compare_dump()s to ensure the filenames differ to catch any
> programmer error in passing in the same file twice.

Good suggestion. Done.

0001 - same as 0001 from previous version
0002 - addresses above comments

--
Best Wishes,
Ashutosh Bapat
From 5ab6dd99438dbb1a77151f5faa0a4104aec5ce74 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 27 Jun 2024 10:03:53 +0530
Subject: [PATCH 1/2] Test pg_dump/restore of regression objects

002_pg_upgrade.pl tests pg_upgrade of the regression database left
behind by regression run. Modify it to test dump and restore of the
regression database as well.

Regression database created by regression run contains almost all the
database objects supported by PostgreSQL in various states. Hence the
new testcase covers dump and restore scenarios not covered by individual
dump/restore cases. Many regression tests mention tht they leave objects
behind for dump/restore testing. But till now 002_pg_upgrade only tested
dump/restore through pg_upgrade which is different from dump/restore
through pg_dump. Adding the new testcase closes that gap.

Testing dump and restore of regression database makes this test run
longer for a relatively smaller benefit. Hence run it only when
explicitly requested by user by specifying "regress_dump_test" in
PG_TEST_EXTRA.

Note For the reviewer:
The new test has uncovered two bugs so far in one year.
1. Introduced by 14e87ffa5c54. Fixed in fd41ba93e4630921a72ed5127cd0d552a8f3f8fc.
2. Introduced by 0413a556990ba628a3de8a0b58be020fd9a14ed0. Reverted in 7456

Re: Make tuple deformation faster

2024-12-20 Thread David Rowley

On Thu, 5 Dec 2024 at 13:09, Andres Freund  wrote:
> On 2024-12-05 11:52:01 +1300, David Rowley wrote:
> > On Thu, 5 Dec 2024 at 03:51, Andres Freund  wrote:
> > > Possibly stupid idea: Could we instead store the attributes *before* the 
> > > main
> > > TupleDescData, with increasing "distance" for increased attnos?  That way 
> > > we
> > > wouldn't need to calculate any variable offsets. Of course the price 
> > > would be
> > > to have some slightly more complicated invocation of pfree(), but that's
> > > comparatively rare.
> >
> > Are you thinking this to make the address calculation cheaper? or so
> > that the hacky code that truncates the TupleDesc would work without
> > crashing still?
>
> Primarily to make the address calculation cheaper.

I mostly got that working, but there were quite several adjustments
needed to fix various things. For example, there are a few places that
are pfreeing a TupleDesc without going through FreeTupleDesc(); namely
index_truncate_tuple() and InsertOneTuple(). There was also a bit more
complexity around the TupleDescs stored in DSA memory as the base
address of those is the start of the FormData_pg_attribute array, so
some offsetting is needed to get the TupleDesc address.  None of those
changes are complex themselves, but it was quite painful to find all
those places and I started to worry that there might end up being a
bit more fallout from that method and I basically chickened out. I
agree the address calculation is cheaper, but accessing
FormData_pg_attribute isn't as performance-critical anymore as
CompactAttribute is doing the performance-critical work now.

I've now pushed the 0001 patch and the 0002 patch as one commit.  I'm
going to give the buildfarm a bit of time then push the commit to
remove pg_attribute.attcacheoff. I'm planning on letting that settle
overnight then if all is well push the attalignby changes.

David

Re: Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment

2024-12-20 Thread Matthias van de Meent

On Sat, 21 Dec 2024 at 01:05, Thomas Munro  wrote:
>
> On Sat, Dec 21, 2024 at 11:41 AM Matthias van de Meent
>  wrote:
> > The unlinking of forks in the FileTag infrastructure has been broken
> > since b0a55e43 in PG16,
> > while a segment number other than 0 has never
> > been unlinked (at least not since the introduction of the system with
> > 3eb77eba in PG12)
>
> Right, and that predates the FileTag refactoring, it's just that the
> earlier coding didn't explicitly mention the segment number so it
> looks slightly different.  In fact it was hard-coded to unlink
> relpathperm(entry->rnode, MAIN_FORKNUM) before that work, so both fork
> and segment number were always fixed, it's just that the FileTag
> mechanism was made slightly more general, really for the SYNC stuff,
> not so much for the UNLINK stuff which uses the same tags.

I see.

> > However, extensions may still make use of this and
> > incorrectly assume that only the requested file of the requested fork
> > 's segment will be unlinked, when it actually unlinks data from the
> > main fork.
>
> It seems unlikely to be useful for any purpose other than tombstones.
> And it seems like if someone is already using it, they would have been
> in touch to say that it doesn't work.  Or perhaps you tried to use it
> and noticed this flaw, or know of someone who would like to use it?
> Or more likely I guess you're working on smgr extension support.

I noticed it when I was browsing NBuffers-sized allocations, which got
me looking into the FileTag infrastructure, which got me trying to
figure out what FileTag.segno is used for that would require it to be
a uint64 in addition to the RelFileNode, which got me looking through
this code.

So, not exactly for SMGR extension support here, but my experience in
that did make it easier for me to figure out that the code doesn't
behave as I'd expected it to.

> > The attached fixes that for PG16+. PG13-15 will take a little bit more
> > effort due to code changes in PG16; though it'd probably still require
> > a relatively minor change.
>
> The patch does not seem unreasonable and I'd like to help tidy this
> up, but ... hmm, could we also consider going the other way?
> register_unlink_segment(), mdunlinkfiletag() and the macro that
> populates md.c's FileTag are internal to md.c, and we don't expect
> external code to be pushing md.c SYNC_UNLINK_REQUEST requests into the
> request queue (who would do that and what could the motivation
> possibly be?)  Doesn't feel like a supported usage to me...  So my
> question is: what bad thing would happen if we just renamed
> register_unlink_segment() to register_unlink_tombstone() without
> fork/seg arguments, to make it clear that it's not really a general
> purpose unreliable segment unlink mechanism that we want anyone to
> build more stuff on top of?

I just noticed I misinterpreted the conditions in mdunlinkfork, so
that I thought it allowed a user to pass their own forknum into
register_unlink_segment (as that is called with the user-provided
forknum). Instead, that branch is only taken when forknum ==
MAIN_FORKNUM, so I think you might be right that going in the other
direction is more desirable.

In that case, something along the lines of the attached would then be
better - it removes the fork and segno from register_unlink_segment's
arguments (renamed to register_unlink_tombstone), and Asserts() that
mdunlinkfiletag only receives a FileTag that contains expected values.

Kind regards,

Matthias van de Meent.

v2-0001-MD-smgr-Clarify-FileTag-based-unlinking.patch
Description: Binary data

Re: a litter question about mdunlinkfiletag function

2024-12-20 Thread Matthias van de Meent

On Tue, 15 Oct 2024 at 04:50, px shi  wrote:
>>
>> You will find other places where relpathperm() is called without having
>> a FileTag structure available, e.g. ReorderBufferProcessTXN().
>
>
> I apologize for the confusion. What I meant to say is that in the 
> mdunlinkfiletag() function, the forknum is currently hardcoded as 
> MAIN_FORKNUM when calling relpathperm(). While mdunlinkfiletag currently only 
> handles MAIN_FORKNUM, wouldn’t it be more flexible to retrieve the forknum 
> from the ftag structure instead?

I just noticed this mail thread as I was searching the archives for
other mentions of `mdunlinkfiletag` when doing some more digging on
uses of that function, to back my own bug report of what looks like
the same issue. See [0].

As was explained to me by Thomas, the reason why MAIN_FORKNUM is
hardcoded here (and why ftag.segno is also ignored) is that this code
is only ever reached for FileTag values with forknum=MAIN_FORKNUM (and
segno is also always 0) with the code in Postgres' repository. The
patch proposed in [0] is supposed to make that more clear to
developers.

I suspect the code will be further updated to include the correct fork
number and segment number when there is a need to unlink
non-MAIN_FORKNUM or non-segno=0 files in mdunlinkfiletag.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/CAEze2WiWt%2B9%2BOnqW1g9rKz0gqxymmt%3Doe6pKAEDrutdfpDMpTw%40mail.gmail.com

Re: Possible integer overflow in bringetbitmap()

2024-12-20 Thread Michael Paquier

On Tue, Dec 10, 2024 at 12:33:08PM +0900, Michael Paquier wrote:
> Sure, you could do (a) and (b) together.  It also seems to me that it
> is just simpler to make totalpages a int64 to map automatically with
> the result expected by the caller of bringetbitmap(), and we know that
> based on MaxBlockNumber we'll never run out of bits.

That should be simple enough.  Are you planning to send a proposal of
patch?
--
Michael


signature.asc
Description: PGP signature

Re: Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment

2024-12-20 Thread Thomas Munro

On Sat, Dec 21, 2024 at 11:41 AM Matthias van de Meent
 wrote:
> The unlinking of forks in the FileTag infrastructure has been broken
> since b0a55e43 in PG16,

Well spotted.

-p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+p = _mdfd_segpath_rflb(rlfb, ftag->forknum, ftag->segno);

As you say, a harmless thinko as far as core is concerned, as we only
ever use it for the "tombstone" file preventing relfilenode recycling.
The tombstone is the bare relfilenode (main segment 0), since every
other segment is unlinked on commit.

> while a segment number other than 0 has never
> been unlinked (at least not since the introduction of the system with
> 3eb77eba in PG12)

Right, and that predates the FileTag refactoring, it's just that the
earlier coding didn't explicitly mention the segment number so it
looks slightly different.  In fact it was hard-coded to unlink
relpathperm(entry->rnode, MAIN_FORKNUM) before that work, so both fork
and segment number were always fixed, it's just that the FileTag
mechanism was made slightly more general, really for the SYNC stuff,
not so much for the UNLINK stuff which uses the same tags.

> However, extensions may still make use of this and
> incorrectly assume that only the requested file of the requested fork
> 's segment will be unlinked, when it actually unlinks data from the
> main fork.

It seems unlikely to be useful for any purpose other than tombstones.
And it seems like if someone is already using it, they would have been
in touch to say that it doesn't work.  Or perhaps you tried to use it
and noticed this flaw, or know of someone who would like to use it?
Or more likely I guess you're working on smgr extension support.  It
is not a reliable mechanism (pull the power after the checkpoint
record is written and before it processes that list and you've leaked
a file), and it's dealing with an edge case we should close in a
better way, and then get rid of it.

> The attached fixes that for PG16+. PG13-15 will take a little bit more
> effort due to code changes in PG16; though it'd probably still require
> a relatively minor change.

The patch does not seem unreasonable and I'd like to help tidy this
up, but ... hmm, could we also consider going the other way?
register_unlink_segment(), mdunlinkfiletag() and the macro that
populates md.c's FileTag are internal to md.c, and we don't expect
external code to be pushing md.c SYNC_UNLINK_REQUEST requests into the
request queue (who would do that and what could the motivation
possibly be?)  Doesn't feel like a supported usage to me...  So my
question is: what bad thing would happen if we just renamed
register_unlink_segment() to register_unlink_tombstone() without
fork/seg arguments, to make it clear that it's not really a general
purpose unreliable segment unlink mechanism that we want anyone to
build more stuff on top of?

Add XMLNamespaces to XMLElement

2024-12-20 Thread Jim Jones

Hi,

I'd like to propose the implementation of the XMLNamespaces option for
XMLElement.

XMLNAMESPACES(nsuri AS nsprefix)
XMLNAMESPACES(DEFAULT default-nsuri)
XMLNAMESPACES(NO DEFAULT)

* nsprefix:  Namespace's prefix.
* nsuri: Namespace's URI.
* DEFAULT default-nsuri: Specifies the DEFAULT namespace to use within
the scope of a namespace declaration.
* NO DEFAULT:    Specifies that NO DEFAULT namespace is to be
used within the scope of a namespace declaration.

This basically works pretty much like XMLAttributes, but with a few more
restrictions (see SQL/XML:2023, 11.2 ):

* XML namespace declaration shall contain at most one DEFAULT namespace
declaration item.
* No namespace prefix shall be equivalent to xml or xmlns.
* No namespace URI shall be identical to http://www.w3.org/2000/xmlns/
or to http://www.w3.org/XML/1998/namespace.
* The value of a namespace URI contained in an regular namespace
declaration item (no DEFAULT) shall not be a zero-length string.

Examples:

SELECT xmlelement(NAME "foo", xmlnamespaces('http://x.y' AS bar));
  xmlelement   
---
 http://x.y"/>

SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http://x.y'));
    xmlelement     
---
 http://x.y"/>

SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT));
   xmlelement    
-
 

In transformXmlExpr() it seemed convenient to use the same parameters to
store the prefixes and URIs as in XMLAttributes (arg_names and
named_args), but I am still not so sure it is the right approach. Is
there perhaps a better way?

Any thoughts? Feedback welcome!

Best, Jim
From 8dee3772be0d89b3d49eff17344ff53440e5f566 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Fri, 20 Dec 2024 14:50:51 +0100
Subject: [PATCH v1] Add XMLNamespaces option to XMLElement

This patch adds the scoped option XMLNamespaces to XMLElement,
as described in ISO/IEC 9075-14:2023, 11.2 XML lexically scoped
options:

xmlnamespaces(uri AS prefix, ...)
xmlnamespaces(DEFAULT uri, ...)
xmlnamespaces(NO DEFAULT, ...)

* prefix: Namespace's prefix.
* uri:Namespace's URI.
* DEFAULT prefix: Specifies the DEFAULT namespace to use within
the scope of a namespace declaration.
* NO DEFAULT: Specifies that NO DEFAULT namespace is to be
used within the scope of a namespace declaration.

== Examples ==

SELECT xmlelement(NAME "foo", xmlnamespaces('http:/x.y' AS bar));
  xmlelement
--
 

SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http:/x.y'));
xmlelement
--
 

 SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT));
   xmlelement
-
 

Tests and documentation were updated accordingly.
---
 doc/src/sgml/func.sgml  |  57 +++-
 src/backend/parser/gram.y   |  73 +++---
 src/backend/parser/parse_expr.c |  73 ++
 src/backend/utils/adt/xml.c |  32 -
 src/include/nodes/primnodes.h   |   4 +-
 src/include/utils/xml.h |   6 +
 src/test/regress/expected/xml.out   | 205 
 src/test/regress/expected/xml_1.out | 151 
 src/test/regress/expected/xml_2.out | 205 
 src/test/regress/sql/xml.sql| 100 ++
 10 files changed, 884 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 47370e581a..b33663bc09 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14482,7 +14482,10 @@ SELECT xmlconcat('', '
 
 
-xmlelement ( NAME name , XMLATTRIBUTES ( attvalue  AS attname  , ... )  , content , ... ) xml
+xmlelement ( NAME name
+, XMLATTRIBUTES ( attvalue  AS attname  , ... ) 
+, XMLNAMESPACES ( {regular-nsuri AS nsprefix | DEFAULT default-nsuri | NO DEFAULT} , ... ) 
+, content , ... ) xml
 
 
 
@@ -14495,7 +14498,39 @@ SELECT xmlconcat('', 'PostgreSQL data type.  The
  argument(s) within XMLATTRIBUTES generate attributes
  of the XML element; the content value(s) are
- concatenated to form its content.
+ concatenated to form its content. The arguments within XMLNAMESPACES
+ constuct namespace declarations from values provided in nsuri
+ and nsprefix, which correspond to the URI of a namespace and
+ its prefix, respectively. The option DEFAULT can be used to set the
+ default namespace declaration (without a prefix) to the URI provided in default-nsuri.
+ The option NO DEFAULT states that a namespace scope has no default namespace. A valid
+ XMLNAMESPACES item must fulfill the following conditions:
+
+
+
+ 
+  Only a single DEFAULT declaration item within the same scope.
+ 
+
+
+ 
+  No two nsuri can be equal within the same scope.
+ 
+
+
+ 
+  No nsprefix can be equal to xml or xmlns,
+  and no nsuri can be equal to http://www.w3.org/2000/xmlns/
+

Re: Statistics Import and Export

2024-12-20 Thread Jeff Davis

On Thu, 2024-12-19 at 21:23 -0800, Jeff Davis wrote:
> > 0001-0005 - changes to pg_dump/pg_upgrade
> 
> Attached is a version 36j...

The testing can use some work here. I noticed that if I take out the
stats entirely, the tests still pass, because pg_upgrade still gets the
same before/after result.

Also, we need some testing of the output and ordering of pg_dump.
Granted, in most cases problems would result in errors during the
reload. But we have those tests for other kinds of objects, so we
should have the tests for stats, too.

I like the description "STATISTICS DATA" because it differentiates from
the extended stats definitions. It might be worth differentiating
between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but
I'm not sure if there's value in that.

But how did you determine what to use for the .tag and prefix? In the
output, it uses the form:

  Name: STATISTICS DATA ; Type: STATISTICS DATA; ...

Should that be:

  Name: ; Type: STATISTICS DATA; ...

Or:

  Data for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Or:

  Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Regards,
Jeff Davis

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Tom Lane

Trey Boudreau  writes:
> so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'.

Seems reasonable in the abstract, and given the UNLISTEN * precedent
it's hard to quibble with that syntax choice.  I think what actually
needs discussing are the semantics, specifically how this'd interact
with other LISTEN/UNLISTEN actions.  Explain what you think should
be the behavior after:

LISTEN foo;
LISTEN *;
UNLISTEN *;
-- are we still listening on foo?

LISTEN *;
LISTEN foo;
UNLISTEN *;
-- how about now?

LISTEN *;
UNLISTEN foo;
-- how about now?

LISTEN *;
LISTEN foo;
UNLISTEN foo;
-- does that make a difference?

I don't have any strong preferences about this, but we ought to
have a clear idea of the behavior we want before we start coding.

regards, tom lane

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-12-20 Thread Daniel Gustafsson

> On 20 Dec 2024, at 02:00, Jacob Champion  
> wrote:

Thanks for the new version, I was doing a v39 review but I'll roll that over
into a v40 review now.

As I was reading I was trying to identify parts can be broken out and committed
ahead of time.  This not only to trim down size, but mostly to shape the final
commit into a coherent single commit that brings a single functionality
utilizing existing APIs.  Basically I think we should keep generic
functionality out of the final commit and keep that focused on OAuth and the
required APIs and infra.

The async auth support seemed like a candidate to go in before the rest.  While
there won't be any consumers of it, it's also not limited to OAuth.  What do
you think about slicing that off and get in ahead of time?  I took a small stab
at separating out the generic bits (it includes the PG_MAX_AUTH_TOKEN_LENGTH
move as well which is unrelated, but could also be committed ahead of time)
along with some small tweaks on it.

--
Daniel Gustafsson

diff --git a/src/include/libpq/auth.h b/src/include/libpq/auth.h
index 227b41daf6..01bc3c7250 100644
--- a/src/include/libpq/auth.h
+++ b/src/include/libpq/auth.h
@@ -16,6 +16,22 @@
 
 #include "libpq/libpq-be.h"
 
+/*
+ * Maximum accepted size of GSS and SSPI authentication tokens.
+ * We also use this as a limit on ordinary password packet lengths.
+ *
+ * Kerberos tickets are usually quite small, but the TGTs issued by Windows
+ * domain controllers include an authorization field known as the Privilege
+ * Attribute Certificate (PAC), which contains the user's Windows permissions
+ * (group memberships etc.). The PAC is copied into all tickets obtained on
+ * the basis of this TGT (even those issued by Unix realms which the Windows
+ * realm trusts), and can be several kB in size. The maximum token size
+ * accepted by Windows systems is determined by the MaxAuthToken Windows
+ * registry setting. Microsoft recommends that it is not set higher than
+ * 65535 bytes, so that seems like a reasonable limit for us as well.
+ */
+#define PG_MAX_AUTH_TOKEN_LENGTH   65535
+
 extern PGDLLIMPORT char *pg_krb_server_keyfile;
 extern PGDLLIMPORT bool pg_krb_caseins_users;
 extern PGDLLIMPORT bool pg_gss_accept_delegation;
diff --git a/src/interfaces/libpq/fe-auth-sasl.h 
b/src/interfaces/libpq/fe-auth-sasl.h
index 258bfd0564..b47011d077 100644
--- a/src/interfaces/libpq/fe-auth-sasl.h
+++ b/src/interfaces/libpq/fe-auth-sasl.h
@@ -30,6 +30,7 @@ typedef enum
SASL_COMPLETE = 0,
SASL_FAILED,
SASL_CONTINUE,
+   SASL_ASYNC,
 } SASLStatus;
 
 /*
@@ -77,6 +78,8 @@ typedef struct pg_fe_sasl_mech
 *
 *  state: The opaque mechanism state returned by init()
 *
+*  final: true if the server has sent a final exchange outcome
+*
 *  input: The challenge data sent by the server, or NULL when
 * generating a client-first initial response 
(that is, when
 * the server expects the client to send a 
message to start
@@ -101,12 +104,17 @@ typedef struct pg_fe_sasl_mech
 *
 *  SASL_CONTINUE:  The output buffer is filled with a client 
response.
 *  Additional server challenge is 
expected
+*  SASL_ASYNC: Some asynchronous processing external 
to the
+*  connection needs to be done 
before a response can be
+*  generated. The mechanism is 
responsible for setting up
+*  conn->async_auth appropriately 
before returning.
 *  SASL_COMPLETE:  The SASL exchange has completed successfully.
 *  SASL_FAILED:The exchange has failed and the connection 
should be
 *  dropped.
 *
 */
-   SASLStatus  (*exchange) (void *state, char *input, int inputlen,
+   SASLStatus  (*exchange) (void *state, bool final,
+char *input, int 
inputlen,
 char **output, int 
*outputlen);
 
/*
diff --git a/src/interfaces/libpq/fe-auth-scram.c 
b/src/interfaces/libpq/fe-auth-scram.c
index 0bb820e0d9..da168eb2f5 100644
--- a/src/interfaces/libpq/fe-auth-scram.c
+++ b/src/interfaces/libpq/fe-auth-scram.c
@@ -24,7 +24,8 @@
 /* The exported SCRAM callback mechanism. */
 static void *scram_init(PGconn *conn, const char *password,
const char *sasl_mechanism);
-static SASLStatus scram_exchange(void *opaq, char *input, int inputlen,
+static SASLStatus scram_exchange(void *opaq, bool final,
+char *input, 
int inputlen,

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread David G. Johnston

On Friday, December 20, 2024, Tom Lane  wrote:

> Trey Boudreau  writes:
>
> * "Listen to all but X" seems like a reasonable desire.
>

This I concur with, and would add: let me name my channels
accounting.payables, accounting.receivables, sales.leads; and let me listen
or ignore all accounting/sales channel names.

But staying within the existing “deny default, permissive grants only”
design to meet this specific goal seems like a reasonable incremental step
to accept.  Let others wanting to work on a more expansive capability
change brings those patches forth.

As for exposing this to the user, this allow-all “channel” would be
presented as any other normal channel.  The reader would need to know about
the special meaning of whatever label we end up using. IOW, the wildcard is
the label and no attempt to tie real in-use channel names to it should or
even could be attempted.

David J.

Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment

2024-12-20 Thread Matthias van de Meent

Hi,

I noticed that the MD smgr internals misbehave when unlink requests
for specific forks or specific segments are sent through SyncOps, as
it currently always unlinks segment 0 of the main fork, even if only a
different fork and/or segment was requested.

While probably not extremely critical, it seems bad to not unlink the
right segment while in recovery, so here's a patch that unlinks the
exact requested files.

The unlinking of forks in the FileTag infrastructure has been broken
since b0a55e43 in PG16, while a segment number other than 0 has never
been unlinked (at least not since the introduction of the system with
3eb77eba in PG12). However, extensions may still make use of this and
incorrectly assume that only the requested file of the requested fork
's segment will be unlinked, when it actually unlinks data from the
main fork.

The attached fixes that for PG16+. PG13-15 will take a little bit more
effort due to code changes in PG16; though it'd probably still require
a relatively minor change.

Kind regards,

Matthias van de Meent.
Neon (https://neon.tech)


v1-0001-MD-smgr-Unlink-the-requested-file-segment-not-mai.patch
Description: Binary data

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Tom Lane

"David G. Johnston"  writes:
> On Friday, December 20, 2024, Tom Lane  wrote:
>> * "Listen to all but X" seems like a reasonable desire.

> This I concur with, and would add: let me name my channels
> accounting.payables, accounting.receivables, sales.leads; and let me listen
> or ignore all accounting/sales channel names.

Hmm.  That reminds me that there was recently a proposal to allow
LISTEN/UNLISTEN with pattern arguments.  (It wasn't anything you'd
expect like regex patterns or LIKE patterns, but some off-the-wall
syntax, which I doubt we'd accept in that form.  But clearly there's
some desire for that out there.)

While I don't say we need to implement that as part of this,
it'd be a good idea to anticipate that that will happen.  And
that kind of blows a hole in my idea, because mine was predicated on
the assumption that you could unambiguously match UNLISTENs against
LISTENs.  A patterned UNLISTEN might revoke a superset or subset
of previous LISTENs, and I'm not sure you could readily tell which.

I think we can still hold to the idea that LISTEN * or UNLISTEN *
cancels all previous requests, but it's feeling like we might
have to accumulate subsequent requests without trying to make
contradictory ones cancel out.  Is it okay if the behavior is
explicitly dependent on the order of those requests, more or
less "last match wins"?  If not, how do we avoid that?

> As for exposing this to the user, this allow-all “channel” would be
> presented as any other normal channel.  The reader would need to know about
> the special meaning of whatever label we end up using. IOW, the wildcard is
> the label and no attempt to tie real in-use channel names to it should or
> even could be attempted.

Don't think that quite flies.  We might have to regurgitate the
state explicitly:

LISTEN *
UNLISTEN foo.*
LISTEN foo.bar.*

showing that we're listening to channels foo.bar.*, but not other
channels beginning "foo", and also to all channels not beginning
"foo".

regards, tom lane

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Daniel Gustafsson

> On 20 Dec 2024, at 23:07, Tom Lane  wrote:

> ..it makes "LISTEN *" act the same as though you had somehow explicitly listed
> every possible channel.

When thinking about it while reading this thread, this is what I came up with
as well.  Since the current workings of LISTEN is so well established I can't
see how we could make this anything but a natural extension of the current.

--
Daniel Gustafsson

Re: pure parsers and reentrant scanners

2024-12-20 Thread Peter Eisentraut


On 20.12.24 02:07, Tom Lane wrote:

I noticed that lapwing is bleating about

ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O2 -fPIC -fvisibility=hidden -I. -I. 
-I../../src/include  -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE 
-I/usr/include/libxml2  -I/usr/include/et  -c -o cubescan.o cubescan.c
cubescan.c:1689:5: warning: no previous prototype for 'cube_yyget_column' 
[-Wmissing-prototypes]
cubescan.c:1765:6: warning: no previous prototype for 'cube_yyset_column' 
[-Wmissing-prototypes]

and likewise in segscan.c.  lapwing is using flex 2.5.35, so probably
this is the same bug worked around in parser/scan.l:


Ok, we can fix that, but maybe this is also a good moment to think about 
whether that is useful.  I could not reproduce the issue with flex 
2.5.39.  I could find no download of flex 2.5.35.  The github site only 
offers back to 2.5.39, the sourceforce site back to 2.5.36.  lapwing 
says it's Debian 7.0, which went out of support in 2016 and out of 
super-duper-extended support in 2020.  It also doesn't have a supported 
OpenSSL version anymore, and IIRC, it has a weird old compiler that 
occasionally gives bogus warnings.  I think it's time to stop supporting 
this.

Re: downgrade some aclchk.c errors to internal

2024-12-20 Thread Peter Eisentraut


On 20.12.24 12:47, Peter Eisentraut wrote:
In aclchk.c, there are a few error messages that use ereport() but it 
seems like they should be internal error messages.  Moreover, they are 
using get_object_class_descr(), which is only meant for internal errors. 
  (For example, it does not have translation support.)  I dug through 
this and it seems like these errors are indeed not or no longer user- 
facing, so we can downgrade them to internal.  See commit messages in 
the attached patches for further explanations.


There was a mistake in the second patch, I had missed some other callers 
that I have fixed up here.  Amended patch set attached.
From b9a2945b367bc32f1a1dda421a72e176fecacdda Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 20 Dec 2024 10:10:19 +0100
Subject: [PATCH v2 1/2] Downgrade errors in object_ownercheck() to internal

The "does not exist" errors in object_ownership() were written as
ereport(), suggesting that they are user-facing.  But no code path
except one can reach this function without first checking that the
object exists.  If this were actually a user-facing error message,
then there would be some problems: get_object_class_descr() is meant
to be for internal errors only and does not support translation.

The one case that can reach this without first checking the object
existence is from be_lo_unlink().  (This makes some sense since large
objects are referred to by their OID directly.)  In this one case, we
can add a line of code to check the object existence explicitly,
consistent with other LO code.

For the rest, downgrade the error messages to elog()s.  The new
message wordings are the same as in DropObjectById().
---
 src/backend/catalog/aclchk.c   | 10 --
 src/backend/libpq/be-fsstubs.c |  5 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index aaf96a965e4..840122dca44 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -4082,9 +4082,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid)
 
tuple = SearchSysCache1(cacheid, ObjectIdGetDatum(objectid));
if (!HeapTupleIsValid(tuple))
-   ereport(ERROR,
-   (errcode(ERRCODE_UNDEFINED_OBJECT),
-errmsg("%s with OID %u does not 
exist", get_object_class_descr(classid), objectid)));
+   elog(ERROR, "cache lookup failed for %s %u",
+get_object_class_descr(classid), objectid);
 
ownerId = DatumGetObjectId(SysCacheGetAttrNotNull(cacheid,

  tuple,
@@ -4113,9 +4112,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid)
 
tuple = systable_getnext(scan);
if (!HeapTupleIsValid(tuple))
-   ereport(ERROR,
-   (errcode(ERRCODE_UNDEFINED_OBJECT),
-errmsg("%s with OID %u does not 
exist", get_object_class_descr(classid), objectid)));
+   elog(ERROR, "could not find tuple for %s %u",
+get_object_class_descr(classid), objectid);
 
ownerId = DatumGetObjectId(heap_getattr(tuple,

get_object_attnum_owner(classid),
diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c
index 27d317dfdc0..66bec869cbc 100644
--- a/src/backend/libpq/be-fsstubs.c
+++ b/src/backend/libpq/be-fsstubs.c
@@ -317,6 +317,11 @@ be_lo_unlink(PG_FUNCTION_ARGS)
 
PreventCommandIfReadOnly("lo_unlink()");
 
+   if (!LargeObjectExists(lobjId))
+   ereport(ERROR,
+   (errcode(ERRCODE_UNDEFINED_OBJECT),
+errmsg("large object %u does not exist", 
lobjId)));
+
/*
 * Must be owner of the large object.  It would be cleaner to check this
 * in inv_drop(), but we want to throw the error before not after 
closing

base-commit: 8ac0021b6f10928a46b7f3d1b25bc21c0ac7f8c5
-- 
2.47.1

From b68de2c26fac1d11c52dcfa391992454b65d7f52 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 20 Dec 2024 14:45:28 +0100
Subject: [PATCH v2 2/2] Downgrade error in object_aclmask_ext() to internal

The "does not exist" error in object_aclmask_ext() was written as
ereport(), suggesting that it is user-facing.  This is problematic:
get_object_class_descr() is meant to be for internal errors only and
does not support translation.

For the has_xxx_privilege functions, the error has not been
user-facing since commit 403ac226ddd.  The remaining users are
pg_database_size() and pg_tablespace_size().  The call stack here is
pretty deep and this dependency

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-20 Thread Melanie Plageman

On Thu, Dec 19, 2024 at 9:36 PM Richard Guo  wrote:
>
> On Fri, Dec 20, 2024 at 7:50 AM Melanie Plageman
>  wrote:
> Looks correct to me.
>
> BTW, I kind of doubt that the overflow risk fixed in 28328ec87 is a
> real issue in real-world scenarios.  Is it realistically possible to
> have more than INT_MAX tuples fit on one heap page?
>
> If it is, then the statement below could also be prone to overflow.
>
> uint32  mid = (start + end) / 2;
>
> Maybe you want to change it to:
>
> uint32  mid = start + (end - start) / 2;

I've done this and pushed.

Thanks to you and Ranier for your help in identifying and fixing this!

- Melanie

Additional comments around need_escapes in pg_parse_json()

2024-12-20 Thread Corey Huinker

I recently had occasion to use the pg_parse_json() function for creating
input functions for pg_ndistinct and pg_dependencies.

While it is obvious now that I should have been parsing with
need_escapes=true, it wasn't obvious from the outset, and so I'm proposing
adding a few comments to help the next person come to that conclusion
sooner than I did.
From 84a8990622a4b83729600d9db43e8678502548f5 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Fri, 20 Dec 2024 05:08:50 -0500
Subject: [PATCH v1] Explain impact of need_escapes in JSON parsing.

Adding some additional comments to help the reader discover the
purpose and effect of the need_escapes parameter.

Nearly every new use of pg_parse_json and pg_parse_json_incremental will
need need_escapes=true.
---
 src/include/common/jsonapi.h | 11 +++
 src/common/jsonapi.c |  7 ++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/include/common/jsonapi.h b/src/include/common/jsonapi.h
index 167615a557..fa19005822 100644
--- a/src/include/common/jsonapi.h
+++ b/src/include/common/jsonapi.h
@@ -118,6 +118,11 @@ typedef struct JsonLexContext
 	struct jsonapi_StrValType *errormsg;
 } JsonLexContext;
 
+/*
+ * Function types for custom json parsing actions.
+ *
+ * fname and token will always be NULL if the context has need_escapes=false.
+ */
 typedef JsonParseErrorType (*json_struct_action) (void *state);
 typedef JsonParseErrorType (*json_ofield_action) (void *state, char *fname, bool isnull);
 typedef JsonParseErrorType (*json_aelem_action) (void *state, bool isnull);
@@ -197,6 +202,12 @@ extern JsonParseErrorType json_count_array_elements(JsonLexContext *lex,
  * struct is allocated.
  *
  * If need_escapes is true, ->strval stores the unescaped lexemes.
+ *
+ * Setting need_escapes to true is necessary if the operation needs
+ * to reference field names or scalar values. This is true of any
+ * operation beyond purely checking the json-validaity of the source
+ * document.
+ *
  * Unescaping is expensive, so only request it when necessary.
  *
  * If need_escapes is true or lex was given as NULL, then the caller is
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 0e2a82ad7a..307182e478 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -1297,7 +1297,10 @@ parse_scalar(JsonLexContext *lex, const JsonSemAction *sem)
 		return result;
 	}
 
-	/* invoke the callback, which may take ownership of val */
+	/*
+	 * invoke the callback, which may take ownership of val.
+	 * val always NULL when need_escapes=false
+	 */
 	result = (*sfunc) (sem->semstate, val, tok);
 
 	if (lex->flags & JSONLEX_CTX_OWNS_TOKENS)
@@ -1349,6 +1352,7 @@ parse_object_field(JsonLexContext *lex, const JsonSemAction *sem)
 
 	if (ostart != NULL)
 	{
+		/* fname always NULL when need_escapes=false */
 		result = (*ostart) (sem->semstate, fname, isnull);
 		if (result != JSON_SUCCESS)
 			goto ofield_cleanup;
@@ -1370,6 +1374,7 @@ parse_object_field(JsonLexContext *lex, const JsonSemAction *sem)
 
 	if (oend != NULL)
 	{
+		/* fname always NULL when need_escapes=false */
 		result = (*oend) (sem->semstate, fname, isnull);
 		if (result != JSON_SUCCESS)
 			goto ofield_cleanup;
-- 
2.47.1

Some ExecSeqScan optimizations

2024-12-20 Thread Amit Langote

Hi,

I’ve been looking into possible optimizations for ExecSeqScan() and
chatting with colleagues about cases where it shows up prominently in
analytics-style query plans.

For example, in queries like SELECT agg(somecol) FROM big_table WHERE
, ExecScan() often dominates the profile. Digging into it,
I found two potential sources of overhead:

1. Run-time checks for PlanState.qual and PlanState.ps_ProjInfo
nullness: these checks are done repeatedly, which seems unnecessary if
we know the values at plan init time.

2. Overhead from ExecScanFetch() when EvalPlanQual() isn’t relevant:
Andres pointed out that ExecScanFetch() introduces unnecessary
overhead even in the common case where EvalPlanQual() isn’t
applicable.

To address (1), I tried assigning specialized functions to
PlanState.ExecProcNode in ExecInitSeqScan() based on whether qual or
projInfo are NULL. Inspired by David Rowley’s suggestion to look at
ExecHashJoinImpl(), I wrote variants like ExecSeqScanNoQual() (for
qual == NULL) and ExecSeqScanNoProj() (for projInfo == NULL). These
call a local version of ExecScan() that lives in nodeSeqScan.c, marked
always-inline. This local copy takes qual and projInfo as arguments,
letting compilers inline and optimize unnecessary branches away.

For (2), the local ExecScan() copy avoids the generic ExecScanFetch()
logic, simplifying things further when EvalPlanQual() doesn’t apply.
That has the additional benefit of allowing SeqNext() to be called
directly instead of via an indirect function pointer. This reduces the
overhead of indirect calls and enables better compiler optimizations
like inlining.

Junwang Zhao helped with creating a benchmark to test the patch, the
results of which can be accessed in the spreadsheet at [1].  The
results show that the patch makes the latency of queries of shape
`SELECT agg(somecol or *) FROM big_table WHERE ` generally
faster with up to 5% improvement in some cases.

Would love to hear thoughts.

--
Thanks, Amit Langote

[1] 
https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing


v1-0001-Introduce-optimized-ExecSeqScan-variants-for-tail.patch
Description: Binary data

downgrade some aclchk.c errors to internal

2024-12-20 Thread Peter Eisentraut

In aclchk.c, there are a few error messages that use ereport() but it 
seems like they should be internal error messages.  Moreover, they are 
using get_object_class_descr(), which is only meant for internal errors. 
 (For example, it does not have translation support.)  I dug through 
this and it seems like these errors are indeed not or no longer 
user-facing, so we can downgrade them to internal.  See commit messages 
in the attached patches for further explanations.From b9a2945b367bc32f1a1dda421a72e176fecacdda Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 20 Dec 2024 10:10:19 +0100
Subject: [PATCH 1/2] Downgrade errors in object_ownercheck() to internal

The "does not exist" errors in object_ownership() were written as
ereport(), suggesting that they are user-facing.  But no code path
except one can reach this function without first checking that the
object exists.  If this were actually a user-facing error message,
then there would be some problems: get_object_class_descr() is meant
to be for internal errors only and does not support translation.

The one case that can reach this without first checking the object
existence is from be_lo_unlink().  (This makes some sense since large
objects are referred to by their OID directly.)  In this one case, we
can add a line of code to check the object existence explicitly,
consistent with other LO code.

For the rest, downgrade the error messages to elog()s.  The new
message wordings are the same as in DropObjectById().
---
 src/backend/catalog/aclchk.c   | 10 --
 src/backend/libpq/be-fsstubs.c |  5 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index aaf96a965e4..840122dca44 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -4082,9 +4082,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid)

tuple = SearchSysCache1(cacheid, ObjectIdGetDatum(objectid));
if (!HeapTupleIsValid(tuple))
-   ereport(ERROR,
-   (errcode(ERRCODE_UNDEFINED_OBJECT),
-errmsg("%s with OID %u does not 
exist", get_object_class_descr(classid), objectid)));
+   elog(ERROR, "cache lookup failed for %s %u",
+get_object_class_descr(classid), objectid);

ownerId = DatumGetObjectId(SysCacheGetAttrNotNull(cacheid,

  tuple,
@@ -4113,9 +4112,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid)

tuple = systable_getnext(scan);
if (!HeapTupleIsValid(tuple))
-   ereport(ERROR,
-   (errcode(ERRCODE_UNDEFINED_OBJECT),
-errmsg("%s with OID %u does not 
exist", get_object_class_descr(classid), objectid)));
+   elog(ERROR, "could not find tuple for %s %u",
+get_object_class_descr(classid), objectid);

ownerId = DatumGetObjectId(heap_getattr(tuple,

get_object_attnum_owner(classid),
diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c
index 27d317dfdc0..66bec869cbc 100644
--- a/src/backend/libpq/be-fsstubs.c
+++ b/src/backend/libpq/be-fsstubs.c
@@ -317,6 +317,11 @@ be_lo_unlink(PG_FUNCTION_ARGS)

PreventCommandIfReadOnly("lo_unlink()");

+   if (!LargeObjectExists(lobjId))
+   ereport(ERROR,
+   (errcode(ERRCODE_UNDEFINED_OBJECT),
+errmsg("large object %u does not exist", 
lobjId)));
+
/*
 * Must be owner of the large object.  It would be cleaner to check this
 * in inv_drop(), but we want to throw the error before not after 
closing

base-commit: 8ac0021b6f10928a46b7f3d1b25bc21c0ac7f8c5
-- 
2.47.1

From 320a62cd669cc7a43cab950ed60f956984cd65b4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 20 Dec 2024 10:48:20 +0100
Subject: [PATCH 2/2] Downgrade error in object_aclmask_ext() to internal

The "does not exist" error in object_aclmask_ext() was written as
ereport(), suggesting that it is user-facing.  But this has not been
true since commit 403ac226ddd.  So we can downgrade this to a normal
"cache lookup failed" internal error.  If this were actually a
user-facing error message, then there would be some problems:
get_object_class_descr() is meant to be for internal errors only and
does not support translation.
---
 src/backend/catalog/aclchk.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 840122dca44..25b15027447 100644
---

Re: Fix crash when non-creator being an iteration on shared radix tree

2024-12-20 Thread John Naylor

On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada  wrote:
>
> On Wed, Dec 18, 2024 at 10:32 PM John Naylor  wrote:

> > 2. The iter_context is separate because the creator's new context
> > could be a bump context which doesn't support pfree. But above we
> > assume we can pfree in the caller's context. Also, IIUC we only
> > allocate small iter objects, and it'd be unusual to need more than one
> > at a time per backend, so it's a bit strange to have an entire context
> > for that. Since we use a standard pattern of "begin; while(iter);
> > end;", it seems unlikely that someone will cause a leak because of a
> > coding mistake in iteration.

v3-0001 allocates the iter data in the caller's context. It's a small
patch, but still a core behavior change so proposed for master-only. I
believe your v1 is still the least invasive fix for PG17.

> > If these tiny admin structs were always, not sometimes, in the callers
> > current context, I think it would be easier to reason about because
> > then the creator's passed context would be used only for local memory,
> > specifically only for leaves and the inner node child contexts.

0002 does this.

> Fair points. Given that we need only one iterator at a time per
> backend, it would be simpler if the caller passes the pointer to an
> iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
> TidStoreBeginIterate() would be like:
>
> if (TidStoreIsShared(ts))
> shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
> else
>local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

Hard for me to tell if it'd be simpler.

> > 3. I was never a fan of trying to second-guess the creator's new
> > context and instead use slab for fixed-sized leaf allocations. If the
> > creator passes a bump context, we say "no, no, no, use slab -- it's
> > good for ya". Let's assume the caller knows what they're doing.
>
> That's a valid argument but how can a user use the slab context for
> leaf allocations?

It's trivial after 0001-02: 0003 removes makes one test use slab as
the passed context (only 32-bit systems would actually use it
currently).

Also, with a bit more work we could allow a NULL context for when the
caller has purposely arranged to use pointer-sized values. Did you see
any of Heikki's CSN patches? There is a radix tree used as a cache in
a context where the tree could be created and destroyed frequently.
Something about the memory blocks seems to have tickled some bad case
in the glibc allocator, and one less context might be good insurance:

https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi

--
John Naylor
Amazon Web Services
From cd5f4e63dd7a49a0884249e55bf2b78293faa4b9 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 20 Dec 2024 12:01:50 +0700
Subject: [PATCH v3 1/3] Use caller's current memory context for radix tree
 iteration

---
 src/include/lib/radixtree.h | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee4..8fe2302652 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -719,7 +719,6 @@ struct RT_RADIX_TREE
 	/* leaf_context is used only for single-value leaves */
 	MemoryContextData *leaf_context;
 #endif
-	MemoryContextData *iter_context;
 };
 
 /*
@@ -1836,14 +1835,6 @@ RT_CREATE(MemoryContext ctx)
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 
-	/*
-	 * Separate context for iteration in case the tree context doesn't support
-	 * pfree
-	 */
-	tree->iter_context = AllocSetContextCreate(ctx,
-			   RT_STR(RT_PREFIX) "_radix_tree iter context",
-			   ALLOCSET_SMALL_SIZES);
-
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
 	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
@@ -2075,7 +2066,8 @@ RT_FREE(RT_RADIX_TREE * tree)
 /* ITERATION */
 
 /*
- * Create and return the iterator for the given radix tree.
+ * Create and return the iterator for the given radix tree,
+ * in the caller's current memory context.
  *
  * Taking a lock in shared mode during the iteration is the caller's
  * responsibility.
@@ -2086,8 +2078,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 	RT_ITER*iter;
 	RT_CHILD_PTR root;
 
-	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
-			  sizeof(RT_ITER));
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
 	iter->tree = tree;
 
 	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-- 
2.47.1

From 79d22b266f0df433dcb9147d0d64db53c681f6c7 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 20 Dec 2024 14:48:24 +0700
Subject: [PATCH v3 3/3] Always use the caller-provided context for radix tree
 leaves

Previously, RT_CREATE would create an additional slab context
if the values were fixed-length and larger than pointer size.
Commit X arranged so that the caller-provided context is only used for
leaves and nothing else, so if we ove

Re: Make tuple deformation faster

2024-12-20 Thread David Rowley

On Fri, 20 Dec 2024 at 23:04, David Rowley  wrote:
> I've now pushed the 0001 patch and the 0002 patch as one commit.  I'm
> going to give the buildfarm a bit of time then push the commit to
> remove pg_attribute.attcacheoff. I'm planning on letting that settle
> overnight then if all is well push the attalignby changes.

The attcacheoff removal is now pushed. I've attached the two remaining patches.

David


v8-0001-Optimize-alignment-calculations-in-tuple-form-def.patch
Description: Binary data


v8-0002-Speedup-tuple-deformation-with-additional-functio.patch
Description: Binary data

Re: pure parsers and reentrant scanners

2024-12-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 20.12.24 02:07, Tom Lane wrote:
>> I noticed that lapwing is bleating about
>> cubescan.c:1689:5: warning: no previous prototype for 'cube_yyget_column' 
>> [-Wmissing-prototypes]
>> cubescan.c:1765:6: warning: no previous prototype for 'cube_yyset_column' 
>> [-Wmissing-prototypes]
>> and likewise in segscan.c.  lapwing is using flex 2.5.35, so probably
>> this is the same bug worked around in parser/scan.l:

> Ok, we can fix that, but maybe this is also a good moment to think about 
> whether that is useful.  I could not reproduce the issue with flex 
> 2.5.39.  I could find no download of flex 2.5.35.  The github site only 
> offers back to 2.5.39, the sourceforce site back to 2.5.36.  lapwing 
> says it's Debian 7.0, which went out of support in 2016 and out of 
> super-duper-extended support in 2020.  It also doesn't have a supported 
> OpenSSL version anymore, and IIRC, it has a weird old compiler that 
> occasionally gives bogus warnings.  I think it's time to stop supporting 
> this.

OK, that's fair.  I do see lapwing called out a lot in the commit log,
though it's not clear how much of that is about 32-bitness and how
much about old tools.  It's surely still valuable to have i386
machines in the buildfarm, but I agree that supporting unobtainable
tool versions is a bit much.  Could we get that animal updated to
some newer OS version?

Presumably, we should also rip out the existing yyget_column and
yyset_column kluges in

src/backend/parser/scan.l: extern int  core_yyget_column(yyscan_t 
yyscanner);
src/bin/psql/psqlscanslash.l: extern int   slash_yyget_column(yyscan_t 
yyscanner);
src/bin/pgbench/exprscan.l: extern int expr_yyget_column(yyscan_t 
yyscanner);
src/fe_utils/psqlscan.l: extern intpsql_yyget_column(yyscan_t 
yyscanner);

regards, tom lane

Re: pure parsers and reentrant scanners

2024-12-20 Thread Julien Rouhaud

On Fri, Dec 20, 2024 at 11:23 PM Tom Lane  wrote:
>
> Could we get that animal updated to
> some newer OS version?

There is already adder animal that is running debian sid on i386.  The
only remaining interest in lapwing is to have older versions of
everything, so if that's useless I can just trash that vm.

Re: pure parsers and reentrant scanners

2024-12-20 Thread Tom Lane

Julien Rouhaud  writes:
> On Fri, Dec 20, 2024 at 11:23 PM Tom Lane  wrote:
>> Could we get that animal updated to
>> some newer OS version?

> There is already adder animal that is running debian sid on i386.  The
> only remaining interest in lapwing is to have older versions of
> everything, so if that's useless I can just trash that vm.

Hmm, sid is the opposite extreme no?  Maybe switching lapwing to
whatever is currently the oldest supported Debian release would
be a good answer.

regards, tom lane

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

2024-12-20 Thread Jeff Davis

On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
> "sv-SE-x-icu") FROM generate_series(1, 100) i);
> 
> master:  ~540 ms
> Patched: ~460 ms
> glibc:   ~410 ms

It looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.

Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?

Regards,
Jeff Davis

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-20 Thread Masahiko Sawada

On Thu, Dec 19, 2024 at 6:31 PM Michael Paquier  wrote:
>
> On Thu, Dec 19, 2024 at 09:27:04AM -0800, Masahiko Sawada wrote:
> > On Wed, Dec 18, 2024 at 6:26 PM Michael Paquier  wrote:
> >> v2 does not have these weaknesses by design.
> >
> > I agree that v2 is better than v3 in terms of that.
>
> Okay.  In terms of the backbranches, would you prefer that I handle
> this patch myself as I have done the HEAD part?  This would need a
> second, closer, review but I could do that at the beginning of next
> week.

Thanks. Please proceed with this fix as you've already fixed the HEAD part.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Tom Lane

Trey Boudreau  writes:
> My first pass at the documentation looks like this:

> 
>  The special wildcard * cancels all listener
>  registrations for the current session and replaces them with a
>  virtual registration that matches all channels. Further
>  LISTEN and UNLISTEN   class="parameter">channel commands will
>  be ignored until the session sees the UNLISTEN *
>  command.
> 

Hmph.  After thinking about it a bit I have a different idea
(and I see David has yet a third one).  So maybe this is more
contentious than it seems.  But at any rate, I have two
fundamental thoughts:

* "Listen to all but X" seems like a reasonable desire.

* The existing implementation already has the principle that
you can't listen to a channel more than once; that is,
LISTEN foo;
LISTEN foo;  -- this is a no-op, not a duplicate subscription

Therefore I propose:

* "LISTEN *" wipes away all previous listen state, and
sets up a state where you're listening to all channels
(within your database).

* "UNLISTEN *" wipes away all previous listen state, and
sets up a state where you're listening to no channels
(which is the same as it does now).

* "LISTEN foo" adds "foo" to what you are listening to,
with no effect if you already were listening to foo
(whether it was a virtual or explicit listen).

* "UNLISTEN foo" removes "foo" from what you are listening to,
with no effect if you already weren't listening to foo.

This is just about the same as the current behavior, and it makes
"LISTEN *" act the same as though you had somehow explicitly listed
every possible channel.  Which I think is a lot cleaner than
conceptualizing it as an independent gating behavior, as well
as more useful because it'll permit "all but" behavior.

The implementation of this could be something like

struct {
boolall;/* true if listening to all */
List   *plus;   /* channels explicitly listened */
List   *minus;  /* channels explicitly unlistened */
} ListenChannels;

with the proviso that "plus" must be empty if "all" is true,
while "minus" must be empty if "all" is false.  The two lists
are always empty right after LISTEN * or UNLISTEN *, but could
be manipulated by subsequent channel-specific LISTEN/UNLISTEN.

(Since only one list would be in use at a time, you could
alternatively combine "plus" and "minus" into a single list
of exceptions to the all/none state.  I suspect that would
be confusingly error-prone to code; but perhaps it would turn
out elegantly.)

One other thing that needs to be thought about in any case
is what the pg_listening_channels() function ought to return
in these newly-possible states.

regards, tom lane

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Trey Boudreau



> On Dec 20, 2024, at 2:58 PM, Tom Lane  wrote:
> Seems reasonable in the abstract, and given the UNLISTEN * precedent
> it's hard to quibble with that syntax choice.  I think what actually
> needs discussing are the semantics, specifically how this'd interact
> with other LISTEN/UNLISTEN actions.

My first pass at the documentation looks like this:


 The special wildcard * cancels all listener
 registrations for the current session and replaces them with a
 virtual registration that matches all channels. Further
 LISTEN and UNLISTEN channel commands will
 be ignored until the session sees the UNLISTEN *
 command.


> Explain what you think should
> be the behavior after:
> 
> LISTEN foo;
> LISTEN *;
> UNLISTEN *;
> -- are we still listening on foo?
> 
No, as the ‘LISTEN *’ wipes existing registrations.

> LISTEN *;
> LISTEN foo;
> UNLISTEN *;
> -- how about now?
Not listening on ‘foo’ or anything else.

> LISTEN *;
> UNLISTEN foo;
> -- how about now?
‘UNLISTEN foo’ ignored.

> LISTEN *;
> LISTEN foo;
> UNLISTEN foo;
> -- does that make a difference?
‘LISTEN foo’ and ‘UNLISTEN foo’ ignored, leaving only the wildcard.

> I don't have any strong preferences about this, but we ought to
> have a clear idea of the behavior we want before we start coding.
These semantics made sense to me, but I have limited experience and
a very specific use case in mind. Changing the behavior of ‘UNLISTEN *’
feels extremely impolite, and if we leave that alone I don’t see using
the ‘LISTEN *’ syntax with behavior that leaves other LISTENs in place.

We could have a different set of keywords, like LISTEN_ALL/UNLISTEN_ALL
that doesn’t interfere with the existing behavior.
-- Trey

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread David G. Johnston

On Fri, Dec 20, 2024 at 1:58 PM Tom Lane  wrote:

> Trey Boudreau  writes:
> > so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'.
>
> Seems reasonable in the abstract, and given the UNLISTEN * precedent
> it's hard to quibble with that syntax choice.  I think what actually
> needs discussing are the semantics, specifically how this'd interact
> with other LISTEN/UNLISTEN actions.  Explain what you think should
> be the behavior after:
>
>
Answers premised on the framing explained below:

LISTEN foo;
> LISTEN *;
> UNLISTEN *;
> -- are we still listening on foo?
>

Yes; the channels are orthogonal and thus order doesn't matter.

> LISTEN *;
> LISTEN foo;
> UNLISTEN *;
> -- how about now?
>

Yes

> LISTEN *;
> UNLISTEN foo;
> -- how about now?
>

The unlisten was a no-op since listen foo was not issued; * receives
everything, always.

> LISTEN *;
> LISTEN foo;
> UNLISTEN foo;
> -- does that make a difference?
>

If any notify foo happened in between listen foo and unlisten foo the
session would receive the notify message twice - once implicitly via * and
once explicitly via foo.
Alternatively, the server could see that "foo" is subscribed too for PID
listener, send the message and then skip over looking for a * subscription
for PID listener.  Basically document that we won't send duplicates if both
listen * and listen foo are present.

> I don't have any strong preferences about this, but we ought to
> have a clear idea of the behavior we want before we start coding.
>
>
I'm inclined to make this clearly distinct from the semantics of
listen/notify.  Both in command form, what is affected, and the message.

Something like:
MONITOR NOTIFICATION QUEUE;
UNMONITOR NOTIFICATION QUEUE;
Asynchronous notification "foo" [with payload ...] sent by server process
with PID nnn.

If you also LISTEN foo you would also receive:
Asynchronous notification "foo" [with payload ...] received from server
process with PID nnn.

Unlisten undoes Listen
Unmonitor undoes Monitor
Upon session disconnect both Unlisten * and Unmonitor are executed.

If we must shoehorn this into the existing syntax and messages I'd still
want to say that * is simply a special channel name that the system
recognizes and sends all notify messages to.  There is no way to limit
which messages get sent to you via unlisten and if you also listen to the
channel foo explicitly you end up receiving multiple messages.
(Alternatively, send it just to foo and have the server not look for a *
listen for that specific session.)

Adding a "do not send" listing (or option) to the implementation doesn't
seem beneficial enough to deal with, and would be the only way:  Listen *;
Unlisten foo; would be capable of not having foo messages sent to the *
subscribing client.  In short, a "deny (do not send) all" base posture and
then permit-only policies built on top of it.  Listen * is the permit-all
policy.

David J.

Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Trey Boudreau

Howdy all,

NOTE:  Grey-beard coder, pgsql newbie. All info/tips/suggestions welcome!

I have a use-case where I’d like to LISTEN for all NOTIFY channels. Right now I 
simply
issue a LISTEN for every channel name of interest, but in production the 
channels will
number in the low thousands. The current implementation uses a linked list, and 
a linear
probe through the list of desired channels which will always return true 
becomes quite
expensive at this scale.

I have a work-around available by creating the “ALL” channel and making the 
payload
include the actual channel name, but this has a few of drawbacks:
 * it does not play nice with clients that actually want a small subset of 
channels;
 * it requires code modification at every NOTIFY;
 * it requires extra code on the client side.

The work-around subjects the developer (me :-) to significant risk of foot-gun 
disease,
so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'.

The implementation in src/backend/commands/async.c seems straightforward 
enough, but it
feels prudent to select a syntax that doesn't make some kind of actual pattern 
matching
syntactically ugly in the future. Choosing 'LISTEN *' has a nice symmetry with 
'UNLISTEN
*', but I don't have enough SQL chops to know if it cause problems.

If anyone has a better work-around, please speak up! If not, and we can come to 
some
resolution on a future-resistant syntax, I'd happily start working up a patch 
set.

Thanks,
-- Trey Boudreau

Re: Fix crash when non-creator being an iteration on shared radix tree

2024-12-20 Thread Masahiko Sawada

On Fri, Dec 20, 2024 at 2:27 AM John Naylor  wrote:
>
> On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 18, 2024 at 10:32 PM John Naylor  
> > wrote:
>
> > > 2. The iter_context is separate because the creator's new context
> > > could be a bump context which doesn't support pfree. But above we
> > > assume we can pfree in the caller's context. Also, IIUC we only
> > > allocate small iter objects, and it'd be unusual to need more than one
> > > at a time per backend, so it's a bit strange to have an entire context
> > > for that. Since we use a standard pattern of "begin; while(iter);
> > > end;", it seems unlikely that someone will cause a leak because of a
> > > coding mistake in iteration.
>
> v3-0001 allocates the iter data in the caller's context. It's a small
> patch, but still a core behavior change so proposed for master-only. I
> believe your v1 is still the least invasive fix for PG17.

I agree to use v1 for v17.

>
> > > If these tiny admin structs were always, not sometimes, in the callers
> > > current context, I think it would be easier to reason about because
> > > then the creator's passed context would be used only for local memory,
> > > specifically only for leaves and the inner node child contexts.
>
> 0002 does this.
>
> > Fair points. Given that we need only one iterator at a time per
> > backend, it would be simpler if the caller passes the pointer to an
> > iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
> > TidStoreBeginIterate() would be like:
> >
> > if (TidStoreIsShared(ts))
> > shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
> > else
> >local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);
>
> Hard for me to tell if it'd be simpler.
>
> > > 3. I was never a fan of trying to second-guess the creator's new
> > > context and instead use slab for fixed-sized leaf allocations. If the
> > > creator passes a bump context, we say "no, no, no, use slab -- it's
> > > good for ya". Let's assume the caller knows what they're doing.
> >
> > That's a valid argument but how can a user use the slab context for
> > leaf allocations?
>
> It's trivial after 0001-02: 0003 removes makes one test use slab as
> the passed context (only 32-bit systems would actually use it
> currently).

These changes make sense to me. Here are a few comments:

RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
local memory. Do we want to declare only in the !RT_SHMEM case?

---
/*
 * Similar to TidStoreCreateLocal() but create a shared TidStore on a
 * DSA area. The TID storage will live in the DSA area, and the memory
 * context rt_context will have only meta data of the radix tree.
 *
 * The returned object is allocated in backend-local memory.
 */

We need to update the comment about rt_context too since we no longer
use rt_context for shared tidstore.

>
> Also, with a bit more work we could allow a NULL context for when the
> caller has purposely arranged to use pointer-sized values. Did you see
> any of Heikki's CSN patches? There is a radix tree used as a cache in
> a context where the tree could be created and destroyed frequently.
> Something about the memory blocks seems to have tickled some bad case
> in the glibc allocator, and one less context might be good insurance:
>
> https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi

Will check these patches. Thanks!

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: allow changing autovacuum_max_workers without restarting

2024-12-20 Thread Nathan Bossart

I think I've been saying I would commit this since August, but now I am
planning to do so first thing in the new year.  In v11 of the patch, I
moved the initial startup WARNING to autovac_init() to avoid repeatedly
logging when the launcher restarts (e.g., for emergency vacuums when the
autovacuum parameter is disabled).  Otherwise, I just made a couple of
cosmetic alterations and added a commit message.

-- 
nathan
>From dad43c4138c2cc0712006565bb4984c2a211e8fe Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 20 Dec 2024 13:33:26 -0600
Subject: [PATCH v11 1/1] Allow changing autovacuum_max_workers without
 restarting.

This commit introduces a new parameter named
autovacuum_worker_slots that controls how many autovacuum worker
slots to reserve during server startup.  Modifying this new
parameter's value does require a server restart, but it should
typically be set to the upper bound of what you might realistically
need to set autovacuum_max_workers.  With that new parameter in
place, autovacuum_max_workers can now be changed with a SIGHUP
(e.g., pg_ctl reload).

If autovacuum_max_workers is set higher than
autovacuum_worker_slots, a WARNING is emitted, and the server will
only start up to autovacuum_worker_slots workers at a given time.
If autovacuum_max_workers is set to a value less than the number of
currently-running autovacuum workers, the existing workers will
continue running, but no new workers will be started until the
number of running autovacuum workers drops below
autovacuum_max_workers.

Reviewed-by: Sami Imseih, Justin Pryzby, Robert Haas, Andres Freund, Yogesh 
Sharma
Discussion: https://postgr.es/m/20240410212344.GA1824549%40nathanxps13
---
 doc/src/sgml/config.sgml  | 28 ++-
 doc/src/sgml/runtime.sgml |  4 +-
 src/backend/access/transam/xlog.c |  2 +-
 src/backend/postmaster/autovacuum.c   | 82 +++
 src/backend/postmaster/pmchild.c  |  4 +-
 src/backend/storage/lmgr/proc.c   |  6 +-
 src/backend/utils/init/postinit.c |  6 +-
 src/backend/utils/misc/guc_tables.c   | 11 ++-
 src/backend/utils/misc/postgresql.conf.sample |  3 +-
 src/include/postmaster/autovacuum.h   |  1 +
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  1 +
 11 files changed, 117 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbdd6ce5740..740ff5d5044 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8630,6 +8630,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
   
  
 
+ 
+  autovacuum_worker_slots (integer)
+  
+   autovacuum_worker_slots configuration 
parameter
+  
+  
+  
+   
+Specifies the number of backend slots to reserve for autovacuum worker
+processes.  The default is 16.  This parameter can only be set at 
server
+start.
+   
+   
+When changing this value, consider also adjusting
+.
+   
+  
+ 
+
  
   autovacuum_max_workers (integer)
   
@@ -8640,7 +8659,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

 Specifies the maximum number of autovacuum processes (other than the
 autovacuum launcher) that may be running at any one time.  The default
-is three.  This parameter can only be set at server start.
+is three.  This parameter can only be set in the
+postgresql.conf file or on the server command 
line.
+   
+   
+Note that a setting for this value which is higher than
+ will have no effect,
+since autovacuum workers are taken from the pool of slots established
+by that setting.

   
  
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 94135e9d5ee..537e356315d 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -839,7 +839,7 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such
 When using System V semaphores,
 PostgreSQL uses one semaphore per allowed 
connection
 (), allowed autovacuum worker process
-(), allowed WAL sender process
+(), allowed WAL sender process
 (), allowed background
 process (), etc., in sets of 16.
 The runtime-computed parameter 
@@ -892,7 +892,7 @@ $ postgres -D $PGDATA -C 
num_os_semaphores
 When using POSIX semaphores, the number of semaphores needed is the
 same as for System V, that is one semaphore per allowed connection
 (), allowed autovacuum worker process
-(), allowed WAL sender process
+(), allowed WAL sender process
 (), allowed background
 process (), etc.
 On the platforms where this option is preferred, there is no specific
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6f58412bcab..706f2127def 100644
--- a/src/backend/acces

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread David G. Johnston

On Fri, Dec 20, 2024 at 2:42 PM Trey Boudreau  wrote:

>
> > On Dec 20, 2024, at 2:58 PM, Tom Lane  wrote:
> > Seems reasonable in the abstract, and given the UNLISTEN * precedent
> > it's hard to quibble with that syntax choice.  I think what actually
> > needs discussing are the semantics, specifically how this'd interact
> > with other LISTEN/UNLISTEN actions.
>
> My first pass at the documentation looks like this:
>
> 
>  The special wildcard * cancels all listener
>  registrations for the current session and replaces them with a
>  virtual registration that matches all channels. Further
>  LISTEN and UNLISTEN   class="parameter">channel commands will
>  be ignored until the session sees the UNLISTEN *
>  command.
> 
>

I just sent my thoughts here as well.  The choice to "cancel all listener
registrations" seems unintuitive and unnecessary - so long as we either
document or handle deduplication internally.

As I noted in my email, * is a permit-all policy in a "deny by default"
system.  Such a system is allowed to have other more targeted "allow"
policies existing at the same time.  If the permit-all policy gets removed
then those individual allow policies immediately become useful again.  If
you want to remove those targeted allowed policies execute Unlisten *
before executing Listen *.

I dislike the non-symmetric meaning of * in the command sequence above but
it likely is better than inventing a whole new syntax.

David J.

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread David G. Johnston

On Fri, Dec 20, 2024 at 2:42 PM Trey Boudreau  wrote:

> We could have a different set of keywords, like LISTEN_ALL/UNLISTEN_ALL
> that doesn’t interfere with the existing behavior.
>
>
I think we will need something along these lines.  We've given * a meaning
in UNLISTEN * that doesn't match what this proposal wants to accomplish.

I suggested using monitor/unmonitor but I suppose any unquoted symbol or
keyword that is invalid as a channel name would work within the
Listen/Unlisten syntax.

Otherwise I mis-spoke in my previous design since regardless of whether
Listen * unregisters existing channels or not Unlisten * will remove
everything and leave the session back at nothing.  In which case you might
as well just remove the redundant channel listeners.

David J.

wrong comments in ClassifyUtilityCommandAsReadOnly

2024-12-20 Thread jian he

hi.

/*
 * Determine the degree to which a utility command is read only.
 *
 * Note the definitions of the relevant flags in src/include/utility/tcop.h.
 */
static int
ClassifyUtilityCommandAsReadOnly(Node *parsetree)

Is the comment wrong?

it should be
" * Note the definitions of the relevant flags in src/include/tcop/utility.h."

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Vik Fearing




On 20/12/2024 23:45, Tom Lane wrote:

Don't think that quite flies.  We might have to regurgitate the
state explicitly:

LISTEN *
UNLISTEN foo.*
LISTEN foo.bar.*

showing that we're listening to channels foo.bar.*, but not other
channels beginning "foo", and also to all channels not beginning
"foo".




Could I perhaps propose a sort of wildmat[1] syntax?

The above sequence could be expressed simply as:

    LISTEN *,!foo.*,foo.bar.*

I would like this in psql's backslash commands, too.


[1] https://en.wikipedia.org/wiki/Wildmat

--

Vik Fearing

Re: Fix logging for invalid recovery timeline

2024-12-20 Thread Andrey M. Borodin

> On 20 Dec 2024, at 20:37, David Steele  wrote:
> 
> "Latest checkpoint is at %X/%X on timeline %u, but in the history of the 
> requested timeline, the server forked off from that timeline at %X/%X."

I think errdetai here is very hard to follow. I seem to understand what is 
going on after reading errmsg, but errdetai makes me uncertain.

If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);"
don't we risk to have again
ereport(ERROR,
(errmsg("requested timeline %u is not in this server's history",
tli)));
?

Best regards, Andrey Borodin.

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Vik Fearing




On 21/12/2024 05:23, Tom Lane wrote:

Vik Fearing  writes:

Could I perhaps propose a sort of wildmat[1] syntax?
The above sequence could be expressed simply as:
      LISTEN *,!foo.*,foo.bar.*

That doesn't absolve you from having to say what happens if the
user then issues another "LISTEN zed" or "UNLISTEN foo.bar.baz"
command.  We can't break the existing behavior that "LISTEN foo"
followed by "LISTEN bar" results in listening to both channels.
So on the whole this seems like it just adds complexity without
removing any.  I'm inclined to limit things to one pattern per
LISTEN/UNLISTEN command, with more complex behaviors reached
by issuing a sequence of commands.



Fair enough.

--

Vik Fearing

Re: Discussion on a LISTEN-ALL syntax

2024-12-20 Thread Tom Lane

Vik Fearing  writes:
> Could I perhaps propose a sort of wildmat[1] syntax?
> The above sequence could be expressed simply as:
>      LISTEN *,!foo.*,foo.bar.*

That doesn't absolve you from having to say what happens if the
user then issues another "LISTEN zed" or "UNLISTEN foo.bar.baz"
command.  We can't break the existing behavior that "LISTEN foo"
followed by "LISTEN bar" results in listening to both channels.
So on the whole this seems like it just adds complexity without
removing any.  I'm inclined to limit things to one pattern per
LISTEN/UNLISTEN command, with more complex behaviors reached
by issuing a sequence of commands.

regards, tom lane

Re: Fix crash when non-creator being an iteration on shared radix tree

2024-12-20 Thread John Naylor

On Sat, Dec 21, 2024 at 2:17 AM Masahiko Sawada  wrote:
>
> On Fri, Dec 20, 2024 at 2:27 AM John Naylor  wrote:

> > v3-0001 allocates the iter data in the caller's context. It's a small
> > patch, but still a core behavior change so proposed for master-only. I
> > believe your v1 is still the least invasive fix for PG17.
>
> I agree to use v1 for v17.

Okay, did you want to commit that separately, or together with my 0001
on master? Either way, I've put a bit more effort into the commit
message in v4 and adjusted the comment slightly.

> > It's trivial after 0001-02: 0003 removes makes one test use slab as
> > the passed context (only 32-bit systems would actually use it
> > currently).
>
> These changes make sense to me. Here are a few comments:
>
> RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
> local memory. Do we want to declare only in the !RT_SHMEM case?

That's already the case, if I understand your statement correctly.

> ---
> /*
>  * Similar to TidStoreCreateLocal() but create a shared TidStore on a
>  * DSA area. The TID storage will live in the DSA area, and the memory
>  * context rt_context will have only meta data of the radix tree.
>  *
>  * The returned object is allocated in backend-local memory.
>  */
>
> We need to update the comment about rt_context too since we no longer
> use rt_context for shared tidstore.

Fixed. BTW, it seems TidStore.context is unused?

--
John Naylor
Amazon Web Services
From 679cf5c305b3f0affb5c4a679cf18eb0674e8691 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 20 Dec 2024 14:48:24 +0700
Subject: [PATCH v4 3/3] Always use the caller-provided context for radix tree
 leaves

Previously, it may not have worked for a caller to pass a slab context,
since it would have been also used for other things which may have
incompatible size. In an attempt to helpfully avoid wasting space due
to aset's power-of-two rounding, RT_CREATE would create an additional
slab context if the values were fixed-length and larger than pointer
size. The problem was, we have since added the bump context type,
and the generation context was a possibility as well, so silently
overriding the caller's choice is not friendly.

Commit X arranged so that the caller-provided context is
only used for leaves and nothing else, so it's safe for the caller
to use slab here if they wish. As demonstration, use slab in one of
the radix tree regression tests.

Reviewed by Masahiko Sawada
---
 src/include/lib/radixtree.h  | 14 --
 src/test/modules/test_radixtree/test_radixtree.c |  5 +++--
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97aff227c5..fc7474e0c7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1849,21 +1849,7 @@ RT_CREATE(MemoryContext ctx)
 size_class.allocsize);
 	}
 
-	/* By default we use the passed context for leaves. */
 	tree->leaf_context = ctx;
-
-#ifndef RT_VARLEN_VALUE_SIZE
-
-	/*
-	 * For leaves storing fixed-length values, we use a slab context to avoid
-	 * the possibility of space wastage by power-of-2 rounding up.
-	 */
-	if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
-		tree->leaf_context = SlabContextCreate(ctx,
-			   RT_STR(RT_PREFIX) "_radix_tree leaf context",
-			   RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-			   sizeof(RT_VALUE_TYPE));
-#endif			/* !RT_VARLEN_VALUE_SIZE */
 #endif			/* RT_SHMEM */
 
 	/* add root node now so that RT_SET can assume it exists */
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 8074b83695..5c54962edf 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -313,9 +313,10 @@ test_random(void)
 #else
 	MemoryContext radixtree_ctx;
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+	radixtree_ctx = SlabContextCreate(CurrentMemoryContext,
 		  "test_radix_tree",
-		  ALLOCSET_SMALL_SIZES);
+		  SLAB_DEFAULT_BLOCK_SIZE,
+		  sizeof(TestValueType));
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
-- 
2.47.1

From 46056af82a7516250b5cdb817579aeeb3952b8de Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sat, 21 Dec 2024 10:55:31 +0700
Subject: [PATCH v4 1/3] Get rid of radix tree's separate iterator context

Instead, just use the caller's memory context. This relieves backends
from having to create this context when they attach to a tree in
shared memory, and delete when they detach.

The iterator state is freed when ending iteration, and even if a
developer failed to follow existing examples of iteration, the state
struct is small enough to pose low risk.
---
 src/include/lib/radixtree.h | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fe

Re: Add XMLNamespaces to XMLElement

2024-12-20 Thread Pavel Stehule

Hi

so 21. 12. 2024 v 0:51 odesílatel Jim Jones 
napsal:

> Hi,
>
> I'd like to propose the implementation of the XMLNamespaces option for
> XMLElement.
>
> XMLNAMESPACES(nsuri AS nsprefix)
> XMLNAMESPACES(DEFAULT default-nsuri)
> XMLNAMESPACES(NO DEFAULT)
>
> * nsprefix:  Namespace's prefix.
> * nsuri: Namespace's URI.
> * DEFAULT default-nsuri: Specifies the DEFAULT namespace to use within
> the scope of a namespace declaration.
> * NO DEFAULT:Specifies that NO DEFAULT namespace is to be
> used within the scope of a namespace declaration.
>
> This basically works pretty much like XMLAttributes, but with a few more
> restrictions (see SQL/XML:2023, 11.2 ):
>
> * XML namespace declaration shall contain at most one DEFAULT namespace
> declaration item.
> * No namespace prefix shall be equivalent to xml or xmlns.
> * No namespace URI shall be identical to http://www.w3.org/2000/xmlns/
> or to http://www.w3.org/XML/1998/namespace.
> * The value of a namespace URI contained in an regular namespace
> declaration item (no DEFAULT) shall not be a zero-length string.
>
> Examples:
>
> SELECT xmlelement(NAME "foo", xmlnamespaces('http://x.y' AS bar));
>   xmlelement
> ---
>  http://x.y"/>
>
> SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http://x.y'));
> xmlelement
> ---
>  http://x.y"/>
>
> SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT));
>xmlelement
> -
>  
>
> In transformXmlExpr() it seemed convenient to use the same parameters to
> store the prefixes and URIs as in XMLAttributes (arg_names and
> named_args), but I am still not so sure it is the right approach. Is
> there perhaps a better way?
>
> Any thoughts? Feedback welcome!
>

+1

Pavel

>
> Best, Jim
>

57 matches

Mail list logo