date:20240912

On Thu, 12 Sept 2024 at 14:40, shawn wang  wrote:
> Could you please perform another round of testing to ensure that everything 
> is functioning as expected with this modification?

One way to get a few machines with various build systems testing this
is to register the patch on the commitfest app in [1]. You can then
see if the patch is passing the continuous integration tests in [2].
One day soon the features of [2] should be combined with [1].

David

[1] https://commitfest.postgresql.org/50/
[2] http://cfbot.cputube.org/

Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest

2024-09-12 Thread Jelte Fennema-Nio

On Thu, 12 Sept 2024 at 04:07, David Rowley  wrote:
> Maybe instead of the URLs showing the CF
> number that the patch was first added to, if it could have a reference
> in the URL that looks up the maximum CF which contains the patch and
> show that one. e.g. https://commitfest.postgresql.org/latest/4751/ .

Yeah agreed, that's why I added support for
https://commitfest.postgresql.org/patch/4751/ a while ago. That URL
will be easily discoverable on the commitfest entry page soon (patch
for this is in flight).

Re: Add 64-bit XIDs into PostgreSQL 15

2024-09-12 Thread Maxim Orlov

Apparently, the original thread will inevitably disintegrate into many
separate ones.
For me, looks like some kind of road map. One for 64-bit GUCs, another one
to remove
short file names from SLRUs and, to make things more complicated, [1] for
cf entry [0],
to get rid of MultiXactOffset wraparound by switching to 64 bits.

[0] https://commitfest.postgresql.org/49/5205/
[1]
https://www.postgresql.org/message-id/flat/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw%40mail.gmail.com

-- 
Best regards,
Maxim Orlov.

Re: Accept invalidation messages before the query starts inside a transaction

2024-09-12 Thread Andrei Lepikhov


On 12/9/2024 08:31, David G. Johnston wrote:
On Wednesday, September 11, 2024, Andrei Lepikhov > wrote:



I don't know whether to classify this as a bug.

[1] https://www.postgresql.org/docs/16/mvcc-caveats.html



Seems we need to add another sentence to that final paragraph.  
Something like:


However, once an object is accessed within a transaction its definition 
is cached for the duration of the transaction.  Of particular note are 
routine bodies modified using create or replace.  If a replacement gets 
committed mid-transaction the old code body will still be executed for 
the remainder of the transaction.
I'm not sure we need any changes here yet unless Yurii provides an 
example of a real issue. But for the record, let me explain the initial 
reason in more detail.
Extensions have always been challenging to maintain because they include 
hook calls from the core and stored procedures visible in databases 
where the extension was created. At the same time, we should control the 
UPDATE/DROP/CREATE extension commands. And don't forget, we have some 
data in shared memory as well that can't be easily versioned.


Our primary goal is to establish a stable point that can provide 
extension developers (and users who call their functions) with a 
reliable guarantee that we are up to date with all the changes made to 
the extension.


Last week, Michael Paquer's patch introduced sys caches and invalidation 
callbacks to watch such actions. But curiously, while we just execute 
some extension's functions like:

SELECT pg_extension.smth()
we don't lock any table, don't apply invalidation messages at all, don't 
have an information about updating/altering/deletion of our extension's 
objects!


--
regards, Andrei Lepikhov

Re: [PATCH] Refactor SLRU to always use long file names

Hi Michael,

> On Wed, Sep 11, 2024 at 04:07:06PM +0300, Aleksander Alekseev wrote:
> > Commit 4ed8f0913bfd introduced long SLRU file names. The proposed
> > patch removes SlruCtl->long_segment_names flag and makes SLRU always
> > use long file names. This simplifies both the code and the API.
> > Corresponding changes to pg_upgrade are included.
>
> That's leaner, indeed.
>
> > One drawback I see is that technically SLRU is an exposed API and
> > changing it may affect third-party code. I'm not sure if we should
> > seriously worry about this. Firstly, the change is trivial and
> > secondly, it's not clear whether such third-party code even exists (we
> > broke this API just recently in 4ed8f0913bfd and no one complained).
>
> Any third-party code using custom SLRUs would need to take care of
> handling their upgrade path outside pg_upgrade.  Not sure there are
> any of them, TBH, but let's see.
>
> > I didn't include any tests for the new pg_upgrade code. To my
> > knowledge we test it manually, with buildfarm members and during
> > alpha- and beta-testing periods. Please let me know if you think there
> > should be a corresponding TAP test.
>
> Removing the old API means that it is impossible to test a move from
> short to long file names.  That's OK by me to rely on the pg_upgrade
> paths in the buildfarm code.  We have a few of them.

Thanks for the feedback.

> There is one thing I am wondering, here, though, which is to think
> harder about a validity check at the end of 002_pg_upgrade.pl to make
> sure that all the SLRU use long file names after running the tests.
> That would mean thinking about a mechanism to list all of them from a
> backend, rather than hardcode a list of them.  Perhaps that's not
> worth it, just dropping an idea in the bucket of ideas.  I would guess
> in the shape of a catalog that's able to represent at SQL level all
> the SLRUs that exist in a backend.

Hmm... IMO it would be a rather niche facility to maintain in PG core.
At least I'm not aware of cases when a DBA wanted to list initialized
SLRUs. Would it be convenient for core / extensions developers?
Creating a breakpoint on SimpleLruInit() or adding a temporary elog()
sounds simpler to me.

It wouldn't hurt re-checking the segment file names in the TAP test
but this would mean hardcoding catalog names which as I understand you
want to avoid. With high probability PG wouldn't start if the
corresponding piece of pg_upgrade is wrong (I checked more than once
:). So I'm not entirely sure if it's worth the effort, but let's see
what others think.

-- 
Best regards,
Aleksander Alekseev

Re: Allow logical failover slots to wait on synchronous replication

2024-09-12 Thread shveta malik

On Wed, Sep 11, 2024 at 2:40 AM John H  wrote:
>
> Hi Shveta,
>
> On Sun, Sep 8, 2024 at 11:16 PM shveta malik  wrote:
>
> >
> > I was trying to have a look at the patch again, it doesn't apply on
> > the head, needs rebase.
> >
>
> Rebased with the latest changes.
>
> > Regarding 'mode = SyncRepWaitMode', FWIW, SyncRepWaitForLSN() also
> > does in a similar way. It gets mode in local var initially and uses it
> > later. See [1]. So isn't there a chance too that 'SyncRepWaitMode' can
> > change in between.
> >
> > [1]:
> > mode = SyncRepWaitMode;
> > .
> > 
> > if (!WalSndCtl->sync_standbys_defined ||
> > lsn <= WalSndCtl->lsn[mode])
> > {
> > LWLockRelease(SyncRepLock);
> > return;
> > }
>
> You are right, thanks for the correction. I tried reproducing with GDB
> where SyncRepWaitMode
> changes due to pg_ctl reload but was unable to do so. It seems like
> SIGHUP only sets ConfigReloadPending = true,
> which gets processed in the next loop in WalSndLoop() and that's
> probably where I was getting confused.

yes, SIGHUP will be processed in the caller of
StandbySlotsHaveCaughtup() (see ProcessConfigFile() in
WaitForStandbyConfirmation()). So we can use 'SyncRepWaitMode'
directly as it is not going to change in StandbySlotsHaveCaughtup()
even if user triggers the change. And thus it was okay to use it even
in the local variable too similar to how SyncRepWaitForLSN() does it.

> In the latest patch, I've added:
>
> Assert(SyncRepWaitMode >= 0);
>
> which should be true since we call SyncRepConfigured() at the
> beginning of StandbySlotsHaveCaughtup(),
> and used SyncRepWaitMode directly.

Yes, it should be okay I think. As SyncRepRequested() in the beginning
makes sure synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH and
thus SyncRepWaitMode should be mapped to either of
WAIT_WRITE/FLUSH/APPLY etc. Will review further.

thanks
Shveta

Re: PG_TEST_EXTRA and meson

2024-09-12 Thread Ashutosh Bapat

Thanks Nazir,

> > -# Test suites that are not safe by default but can be run if selected
> > -# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
> > -# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
> > -test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
> >
> > A naive question. What happens if we add PG_TEST_EXTRA in meson.build
> > itself rather than passing it via testwrap? E.g. like
> > if "PG_TEST_EXTRA" not in os.environ
> > test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
>
> Then this configure time option will be passed to the test environment
> and there is no way to change it without reconfiguring if we don't
> touch the testwrap file.
>
> > I am worried that we might have to add an extra argument to testwrap
> > for every environment variable that influences the tests. Avoiding it
> > would be better.
>
> If we want to have both configure time and run time variables, I
> believe that this is the only way for now.

Here's what I understand, please correct me: The code in meson.build
is only called at the time of setup; not during meson test. Hence we
can not check the existence of a runtime environment variable in that
file. The things in test_env override those set at run time. So we
save it as an argument to --pg_test_extra and then use it if
PG_TEST_EXTRA is not set at run time. I tried to find if there's some
other place to store "environment variables that can be overriden at
runtime" but I can not find it. So it looks like this is the best we
can do for now.

If it comes to a point where there are more such environment variables
that need to be passed, probably we will pass a key-value string of
those to testwrap. For now, for a single variable, this looks ok.

>
> > +# If PG_TEST_EXTRA is not set in the environment, then look for its Meson
> > +# configure option.
> > +if "PG_TEST_EXTRA" not in os.environ and args.pg_test_extra:
> > + env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> > +
> >
> > If somebody looks at just these lines, it's not clear how
> > PG_TEST_EXTRA is passed to the test environment if it's available in
> > os.environ. So one has to understand that env_dict is the test
> > environment. If that's so, the code and comment rewritten as below
> > makes more sense to me. What do you think?
> >
> > # If PG_TEST_EXTRA is not already part of the test environment, check if 
> > it's
> > # passed via program argument --pg_test_extra. Thus we also override
> > # configuration time value by run time value of PG_TEST_EXTRA.
> > if "PG_TEST_EXTRA" not in env_dict and args.pg_test_extra:
> > env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
>
> I think your suggestion is better, done.

I didn't see the expected change. I was talking about something like attached.

Also
1. I have also made changes to the comment,
2. renamed the argument --pg_test_extra to --pg-test-extra using
convention similar to other arguments.
3. few other cosmetic changes.

Please review and incorporate those in the respective patches and
tests. Sorry for a single diff.

Once this is done, I think we can mark this CF entry as RFC.

--
Best Wishes,
Ashutosh Bapat


pg_test_extra.diffs
Description: Binary data

[PATCH] Add some documentation on how to call internal functions

2024-09-12 Thread Florents Tselai

Hi, The documentation on extending using C functions, leaves a blank on how to call other internal Postgres functions.This can leave first-timers hanging.I think it’d be helpful to spend some paragraphs on discussing the DirectFunctionCall API.Here’s an attempt (also a PR[0]).Here’s the relevant HTML snippet for convenience:To call another version-1 function, you can use DirectFunctionCalln(func, arg1,...,argn). This is particularly useful when you want to call functions defined in the standard internal library, by using an interface similar to their SQL signature.Different flavors of similar macros can be found in fmgr.h. The main point though is that they expect a C function name to call as their first argument (or its Oid in some cases), and actual arguments should be supplied as Datums. They always return Datum.For example, to call the starts_with(text, text) from C, you can search through the catalog and find out that its C implementation is based on the Datum text_starts_with(PG_FUNCTION_ARGS) function.In fmgr.h there are also available macros the facilitate conversions between C types and Datum. For example to turn text* into Datum, you can use DatumGetTextPP(X). If your extension defines additional types, it is usually convenient to define similar macros for these types too.I’ve also added the below example function:PG_FUNCTION_INFO_V1(t_starts_with);

Datum
t_starts_with(PG_FUNCTION_ARGS)
{
Datum t1 = PG_GETARG_DATUM(0);
Datum t2 = PG_GETARG_DATUM(1);
bool  bool_res;

Datum datum_res = DirectFunctionCall2(text_starts_with, t1, t2);
bool_res = DatumGetBool(datum_res);

PG_RETURN_BOOL(bool_res);
}PS1: I was not sure if src/tutorial is still relevant with this part of the documentation.If so, it needs updating too.[0] https://github.com/Florents-Tselai/postgres/pull/1/commits/1651b7bb68e0f9c2b61e1462367295d846d253ec

0001-Add-some-documentation-on-how-to-call-internal-funct.patch
Description: Binary data

Re: Incremental Sort Cost Estimation Instability

2024-09-12 Thread Andrei Lepikhov


On 12/9/2024 03:05, David Rowley wrote:

On Thu, 27 Jun 2024 at 03:00, Andrei Lepikhov  wrote:

I tried to invent a simple solution to fight this minor case. But the
most clear and straightforward way here is to save a reference to the
expression that triggered the PathKey creation inside the PathKey itself.
See the sketch of the patch in the attachment.
I'm not sure this instability is worth fixing this way, but the
dependence of the optimisation outcome on the query text looks buggy.


I don't think that's going to work as that'll mean it'll just choose
whichever expression was used when the PathKey was first created.  For
your example query, both PathKey's are first created for the GROUP BY
clause in standard_qp_callback(). I only have to change the GROUP BY
in your query to use the equivalent column in the other table to get
it to revert back to the plan you complained about.

Yes, it is true. It is not ideal solution so far - looking for better ideas.


Maybe doing something with estimate_num_groups() to find the
EquivalenceClass member with the least distinct values would be
better. I just can't think how that could be done in a performant way.
Initial problem causes wrong cost_sort estimation. Right now I think 
about providing cost_sort() the sort clauses instead of (or in addition 
to) the pathkeys.


--
regards, Andrei Lepikhov

Re: Add 64-bit XIDs into PostgreSQL 15

Michael, Maxim,

> Apparently, the original thread will inevitably disintegrate into many 
> separate ones.
> For me, looks like some kind of road map. One for 64-bit GUCs, another one to 
> remove
> short file names from SLRUs and, to make things more complicated, [1] for cf 
> entry [0],
> to get rid of MultiXactOffset wraparound by switching to 64 bits.
>
> [0] https://commitfest.postgresql.org/49/5205/
> [1] 
> https://www.postgresql.org/message-id/flat/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw%40mail.gmail.com

Agree.

To me it seems like we didn't reach a consensus regarding switching to
64-bit XIDs. Given that and the fact that the patchset is rather
difficult to rebase (and review) I suggest focusing on something we
reached a consensus for. I'm going to close a CF entry for this
particular thread as RwF unless anyone objects. We can always return
to this later, preferably knowing that there is a particular committer
who has time and energy for merging this.

-- 
Best regards,
Aleksander Alekseev

Re: Add has_large_object_privilege function

2024-09-12 Thread Fujii Masao





On 2024/09/10 17:45, Yugo NAGATA wrote:

On Mon, 9 Sep 2024 22:59:34 +0900
Fujii Masao  wrote:




On 2024/07/02 16:34, Yugo NAGATA wrote:

So, I would like to propose to add
has_large_object_function for checking if a user has the privilege on a large
object.


+1


Thank you for your looking into this!
I've attached a updated patch.


Thanks for updating the patches! LGTM, except for a couple of minor things:

+okui chiba * has_largeobject_privilege_id

"okui chiba" seems to be a garbage text accidentally added.

+ * role by Oid, large object by id, and privileges as AclMode.

"Oid" seems better than "id" in "large object by id".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Incremental Sort Cost Estimation Instability

On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  wrote:
> Initial problem causes wrong cost_sort estimation. Right now I think
> about providing cost_sort() the sort clauses instead of (or in addition
> to) the pathkeys.

I'm not quite sure why the sort clauses matter any more than the
EquivalenceClass.  If the EquivalanceClass defines that all members
will have the same value for any given row, then, if we had to choose
any single member to drive the n_distinct estimate from, isn't the
most accurate distinct estimate from the member with the smallest
n_distinct estimate?  (That assumes the less distinct member has every
value the more distinct member has, which might not be true)

David

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-09-12 Thread jian he

On Mon, Sep 9, 2024 at 10:34 PM Jim Jones  wrote:
>
>
> Hi there
>
> On 26.08.24 02:00, jian he wrote:
> > hi all.
> > patch updated.
> > simplified the code a lot.
> >
> > idea is same:
> > COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);
> >
> > If the STDIN number of columns is the same as the target table, then
> > InputFunctionCallSafe
> > call failure will make that column values be null.
> >
> >
> > If the STDIN number of columns is not the same as the target table, then 
> > error
> > ERROR:  missing data for column \"%s\"
> > ERROR:  extra data after last expected column
> > which is status quo.
>
> I wanted to give it another try, but the patch does not apply ...
>

here we are.
please check the attached file.
From 3d6b3d8b0393b5bc4950e85c40c69c0da46c8035 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 12 Sep 2024 17:07:02 +0800
Subject: [PATCH v4 1/1] on_error set_to_null

extent "on_error action", introduce new option: on_error set_to_null.

Due to current grammar, we cannot use "on_error null",
so I choose on_error set_to_null.

any data type conversion errors while the COPY FROM process will set that column value to be NULL.
this will only work with COPY FROM and non-binary format.

However this will respect the not-null constraint, meaning, if you actually converted error to null,
but the column has not-null constraint, not-null constraint violation ERROR will be raised.

discussion: https://postgr.es/m/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=bp3d1_asfe...@mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml   |  1 +
 src/backend/commands/copy.c  |  4 ++-
 src/backend/commands/copyfrom.c  |  9 ++---
 src/backend/commands/copyfromparse.c | 11 +++
 src/include/commands/copy.h  |  1 +
 src/test/regress/expected/copy2.out  | 49 
 src/test/regress/sql/copy2.sql   | 44 +
 7 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..b6bdf45e7e 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -394,6 +394,7 @@ COPY { table_name [ ( error_action value of
   stop means fail the command, while
   ignore means discard the input row and continue with the next one.
+  set_to_null means the input value will set to null and continue with the next one.
   The default is stop.
  
  
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3bb579a3a4..05b152a090 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
  parser_errposition(pstate, def->location)));
 
 	/*
-	 * Allow "stop", or "ignore" values.
+	 * Allow "stop", or "ignore", "set_to_null" values.
 	 */
 	if (pg_strcasecmp(sval, "stop") == 0)
 		return COPY_ON_ERROR_STOP;
 	if (pg_strcasecmp(sval, "ignore") == 0)
 		return COPY_ON_ERROR_IGNORE;
+	if (pg_strcasecmp(sval, "set_to_null") == 0)
+		return COPY_ON_ERROR_NULL;
 
 	ereport(ERROR,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2d3462913e..1669fac444 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1319,7 +1319,7 @@ CopyFrom(CopyFromState cstate)
 	/* Done, clean up */
 	error_context_stack = errcallback.previous;
 
-	if (cstate->opts.on_error != COPY_ON_ERROR_STOP &&
+	if (cstate->opts.on_error == COPY_ON_ERROR_IGNORE &&
 		cstate->num_errors > 0)
 		ereport(NOTICE,
 errmsg_plural("%llu row was skipped due to data type incompatibility",
@@ -1471,10 +1471,11 @@ BeginCopyFrom(ParseState *pstate,
 		cstate->escontext->error_occurred = false;
 
 		/*
-		 * Currently we only support COPY_ON_ERROR_IGNORE. We'll add other
-		 * options later
+		 * Currently we only support COPY_ON_ERROR_IGNORE, COPY_ON_ERROR_NULL.
+		 * We'll add other options later
 		 */
-		if (cstate->opts.on_error == COPY_ON_ERROR_IGNORE)
+		if (cstate->opts.on_error == COPY_ON_ERROR_IGNORE ||
+			cstate->opts.on_error == COPY_ON_ERROR_NULL)
 			cstate->escontext->details_wanted = false;
 	}
 	else
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 97a4c387a3..3fe32b76ac 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -969,6 +969,17 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 			{
 Assert(cstate->opts.on_error != COPY_ON_ERROR_STOP);
 
+if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
+{
+	values[m] = (Datum) 0;
+	nulls[m] = true;
+	/*
+	 * set error_occurred to false, so next
+	 * InputFunctionCallSafe call behave sane.
+	*/
+	cstate->escontext->error_occurred = false;
+	continue;
+}
 cstate->num_errors++;
 
 if (cstate->opts.log_verbosity == COPY_LOG_VERBOSITY_VERBOSE)
diff --git a/

Re: Make printtup a bit faster

2024-09-12 Thread Andy Fan


>> ...  I'd put parts of out(print) function
>> refactor in the next 2 days. I think it deserves a double check before
>> working on *all* the out function.
>
> Well, sure.  You *cannot* write a patch that breaks existing output
> functions.  Not at the start, and not at the end either.  You
> should focus on writing the infrastructure and, for starters,
> converting just a few output functions as a demonstration.  If
> that gets accepted then you can work on converting other output
> functions a few at a time.  But they'll never all be done, because
> we can't realistically force extensions to convert.
>
> There are lots of examples of similar incremental conversions in our
> project's history.  I think the most recent example is the "soft error
> handling" work (d9f7f5d32, ccff2d20e, and many follow-on patches).

Thank you for this example! What I want is a smaller step than you said.
Our goal is to make out function take an extra StringInfo input to avoid
the extra palloc, memcpy, strlen. so the *final state* is:

1). implement all the out functions with (datum, StringInfo) as inputs.
2). change all the caller like printtup or any other function.
3). any extensions which doesn't in core has to change their out function
for their data type. The patch in this thread can't help in this area,
but I guess it would not be very hard for extension's author.

The current (intermediate) stage is:
- I finished parts of step (1), 17 functions in toally. and named it as
print function, the function body is exactly same as the out function in
final stage, so this part is reviewable.
- I use them in printtup user case. so it is testable (for correctness
and performance test purpose).

so I want some of you can have a double check on these function bodies, if
anything wrong, I can change it easlier (vs I made the same efforts on
all the type function). does it make sense?

Patch 0001 ~ 0003 is something related and can be reviewed or committed
seperately. and 0004 is the main part of the above. 

-- 
Best Regards
Andy Fan

>From 4fa462d02902e7ac278a312ad60f43c52f403753 Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Wed, 11 Sep 2024 12:25:52 +0800
Subject: [PATCH v20240912 3/4] add unlikely hint for enlargeStringInfo.

enlargeStringInfo  has a noticeable ratio in perf peport with a
"select * from pg_class" workload). So add a unlikely  hint in
enlargeStringinfo to avoid some overhead.
---
 src/common/stringinfo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/common/stringinfo.c b/src/common/stringinfo.c
index eb9d6502fc..838d9b80d0 100644
--- a/src/common/stringinfo.c
+++ b/src/common/stringinfo.c
@@ -297,7 +297,7 @@ enlargeStringInfo(StringInfo str, int needed)
 	 * Guard against out-of-range "needed" values.  Without this, we can get
 	 * an overflow or infinite loop in the following.
 	 */
-	if (needed < 0)/* should not happen */
+	if (unlikely(needed < 0))/* should not happen */
 	{
 #ifndef FRONTEND
 		elog(ERROR, "invalid string enlargement request size: %d", needed);
@@ -306,7 +306,7 @@ enlargeStringInfo(StringInfo str, int needed)
 		exit(EXIT_FAILURE);
 #endif
 	}
-	if (((Size) needed) >= (MaxAllocSize - (Size) str->len))
+	if (unlikely(((Size) needed) >= (MaxAllocSize - (Size) str->len)))
 	{
 #ifndef FRONTEND
 		ereport(ERROR,
-- 
2.45.1

>From dc1475195f1350745e45a5b7db381354f99d83da Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Wed, 11 Sep 2024 12:19:57 +0800
Subject: [PATCH v20240912 1/4] Refactor float8out_internval for better
 performance

Some users like cube, geo needs calls float8out_internval to get a
string, and then copy them into its own StringInfo. In this commit,
we would let the user provide a buffer to float8out_internal so that it
can put the data to buffer directly. This commit also reuse the existing
string length to avoid another strlen call in appendStringInfoString.
---
 contrib/cube/cube.c | 17 +++--
 src/backend/utils/adt/float.c   | 14 --
 src/backend/utils/adt/geo_ops.c | 34 ++---
 src/include/catalog/pg_type.h   |  1 +
 src/include/utils/float.h   |  2 +-
 5 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index 1fc447511a..a239acf35c 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -12,6 +12,7 @@
 
 #include "access/gist.h"
 #include "access/stratnum.h"
+#include "catalog/pg_type.h"
 #include "cubedata.h"
 #include "libpq/pqformat.h"
 #include "utils/array.h"
@@ -295,26 +296,38 @@ cube_out(PG_FUNCTION_ARGS)
 	StringInfoData buf;
 	int			dim = DIM(cube);
 	int			i;
+	int str_len;
 
 	initStringInfo(&buf);
 
 	appendStringInfoChar(&buf, '(');
+
+	/* 3 for ", " and 1 for '\0'. */
+	enlargeStringInfo(&buf, (MAXFLOAT8LEN + 4) * dim);
 	for (i = 0; i < dim; i++)
 	{
 		if (i > 0)
 			appendStringInfoString(&buf, ", ");
-		appendStringInfoString(&buf, float8out_internal(LL_COORD(cube, i)));
+		float8out_internal(LL

Re: Docs: Order of json aggregate functions

On Thu, 20 Jun 2024 at 05:50, Wolfgang Walther  wrote:
> json_arrayagg and json_agg_strict are out of place.
>
> Attached patch puts them in the right spot. This is the same down to v16.

Thank you.  I've pushed this and ended up backpatching to 16 too. It's
quite hard to unsee the broken order once seen.

It seems worth the backpatch both to reduce pain for any future
backpatches and also because the aggregates seemed rather badly
placed.

David

Re: not null constraints, again

2024-09-12 Thread Alvaro Herrera

On 2024-Sep-12, jian he wrote:

> ---exampleA
> drop table pp1,cc1, cc2;
> create table pp1 (f1 int);
> create table cc1 (f2 text, f3 int) inherits (pp1);
> create table cc2(f4 float) inherits(pp1,cc1);
> alter table pp1 alter column f1 set not null;
> execute constr_meta('{pp1,cc1, cc2}');
> 
> ---exampleB
> drop table pp1,cc1, cc2;
> create table pp1 (f1 int not null);
> create table cc1 (f2 text, f3 int) inherits (pp1);
> create table cc2(f4 float) inherits(pp1,cc1);
> execute constr_meta('{pp1,cc1, cc2}');
> 
> Should exampleA and exampleB
> return  same pg_constraint->coninhcount
> for not-null constraint "cc2_f1_not_null"
> ?

Yes, they should be identical.  In this case example A is in the wrong,
the constraint in cc2 should have inhcount=2 (which example B has) and
it has inhcount=1.  This becomes obvious if you do ALTER TABLE NO
INHERIT of both parents -- in example A, it fails the second one with
 ERROR:  relation 43823 has non-inherited constraint "cc2_f1_not_null"
because the inhcount was set wrong by SET NOT NULL.  Will fix.  (I think
the culprit is the "readyRels" stuff I had added -- I should nuke that
and add a CommandCounterIncrement in the right spot ...)

> We only have this Synopsis
> ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL

Yeah, this syntax is intended to add a "normal" not-null constraint,
i.e. one that inherits.

> --tests from src/test/regress/sql/inherit.sql
> CREATE TABLE inh_nn_parent (a int, NOT NULL a NO INHERIT);
> ALTER TABLE inh_nn_parent ALTER a SET NOT NULL;
> current fail at ATExecSetNotNull
> ERROR:  cannot change NO INHERIT status of NOT NULL constraint
> "inh_nn_parent_a_not_null" on relation "inh_nn_parent"

This is correct, because here you want a normal not-null constraint but
the table already has the weird ones that don't inherit.

> seems we translate
> ALTER TABLE inh_nn_parent ALTER a SET NOT NULL;
> to
> ALTER TABLE inh_nn_parent ALTER a SET NOT NULL INHERIT

Well, we don't "translate" it as such.  It's just what's normal.

> but we cannot (syntax error)
> ALTER TABLE inh_nn_parent ALTER a SET NOT NULL NO INHERIT;

I don't feel a need to support this syntax.  You can do with with the
ADD CONSTRAINT syntax if you need it.

>  /*
> - * Return the address of the modified column.  If the column was already NOT
> - * NULL, InvalidObjectAddress is returned.
> + * ALTER TABLE ALTER COLUMN SET NOT NULL
> + *
> + * Add a not-null constraint to a single table and its children.  Returns
> + * the address of the constraint added to the parent relation, if one gets
> + * added, or InvalidObjectAddress otherwise.
> + *
> + * We must recurse to child tables during execution, rather than using
> + * ALTER TABLE's normal prep-time recursion.
>   */
>  static ObjectAddress
> -ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
> - const char *colName, LOCKMODE lockmode)
> +ATExecSetNotNull(List **wqueue, Relation rel, char *conName, char *colName,
> + bool recurse, bool recursing, List **readyRels,
> + LOCKMODE lockmode)
> 
> you introduced two boolean "recurse", "recursing", don't have enough
> explanation.
> That is too much to comprehend.

Apologies.  I think it's a well-established pattern in tablecmds.c.
"bool recurse" is for the caller (ATRewriteCatalogs) to request
recursion.  "bool recursing" is for the function informing itself that
it is calling itself recursively, i.e. "I'm already recursing".  This is
mostly (?) used to skip things like permission checks.

> " * We must recurse to child tables during execution, rather than using
> " * ALTER TABLE's normal prep-time recursion.
> What does this sentence mean for these two boolean "recurse", "recursing"?

Here "recurse during execution" means ALTER TABLE's phase 2, that is,
ATRewriteCatalogs (which means some ATExecFoo function needs to
implement recursion internally), and "normal prep-time recursion" means
the recursion set up by phase 1 (ATPrepCmd), which creates separate
AlterTableCmd nodes for each child table.  See the comments for
AlterTable and the code for ATController.

> Finally, I did some cosmetic changes, also improved error message
> in MergeConstraintsIntoExisting

Thanks, will incorporate.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

Re: CI, macports, darwin version problems

2024-09-12 Thread Joe Conway


On 9/12/24 02:05, Thomas Munro wrote:

On Tue, Sep 10, 2024 at 3:04 PM Thomas Munro  wrote:

On Mon, Sep 9, 2024 at 1:37 PM Joe Conway  wrote:

Seems the mounted drive got unmounted somehow ¯\_(ツ)_/¯

Please check it out and let me know if it is working properly now.


Looks good, thanks!


... but it's broken again.



The external volume somehow got unmounted again :-(
I have rebooted it and restarted the process now.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add has_large_object_privilege function

2024-09-12 Thread Yugo NAGATA

On Thu, 12 Sep 2024 19:07:22 +0900
Fujii Masao  wrote:

> 
> 
> On 2024/09/10 17:45, Yugo NAGATA wrote:
> > On Mon, 9 Sep 2024 22:59:34 +0900
> > Fujii Masao  wrote:
> > 
> >>
> >>
> >> On 2024/07/02 16:34, Yugo NAGATA wrote:
> >>> So, I would like to propose to add
> >>> has_large_object_function for checking if a user has the privilege on a 
> >>> large
> >>> object.
> >>
> >> +1
> > 
> > Thank you for your looking into this!
> > I've attached a updated patch.
> 
> Thanks for updating the patches! LGTM, except for a couple of minor things:
> 
> +okui chiba * has_largeobject_privilege_id
> 
> "okui chiba" seems to be a garbage text accidentally added.
> 
> + *   role by Oid, large object by id, and privileges as AclMode.
> 
> "Oid" seems better than "id" in "large object by id".

Thank you for pointing out it.
I've fixed them and attached the updated patch, v3.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From 97bc1f94c9e8be00f2649660f1f44d78fcc78653 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Tue, 2 Jul 2024 15:12:17 +0900
Subject: [PATCH v3 2/2] Add has_largeobject_privilege function

This function is for checking whether a user has the privilege on a
large object. There are three variations whose arguments are combinations
of large object OID with user name, user OID, or implicit user (current_user).
It returns NULL if not-existing large object id is specified, and raises an
error if non-existing user name is specified. These behavior is similar
with has_table_privilege.
---
 doc/src/sgml/func.sgml   |  18 +++
 src/backend/utils/adt/acl.c  | 140 
 src/include/catalog/pg_proc.dat  |  13 ++
 src/test/regress/expected/privileges.out | 162 +++
 src/test/regress/sql/privileges.sql  |  42 ++
 5 files changed, 375 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1bde4091ca..53b602b6e3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25117,6 +25117,24 @@ SELECT has_function_privilege('joeuser', 'myfunc(int, text)', 'execute');

   
 
+  
+   
+
+ has_largeobject_privilege
+
+has_largeobject_privilege (
+   user name or oid, 
+  largeobject oid,
+  privilege text )
+boolean
+   
+   
+Does user have privilege for large object?
+Allowable privilege types are
+SELECT and UPDATE.
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index d7b39140b3..35ecef6eaa 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -26,6 +26,7 @@
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_language.h"
+#include "catalog/pg_largeobject.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_tablespace.h"
@@ -39,6 +40,7 @@
 #include "lib/bloomfilter.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
+#include "storage/large_object.h"
 #include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
@@ -46,6 +48,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
 
@@ -124,6 +127,7 @@ static AclMode convert_tablespace_priv_string(text *priv_type_text);
 static Oid	convert_type_name(text *typename);
 static AclMode convert_type_priv_string(text *priv_type_text);
 static AclMode convert_parameter_priv_string(text *priv_text);
+static AclMode convert_largeobject_priv_string(text *priv_text);
 static AclMode convert_role_priv_string(text *priv_type_text);
 static AclResult pg_role_aclcheck(Oid role_oid, Oid roleid, AclMode mode);
 
@@ -4669,6 +4673,142 @@ convert_parameter_priv_string(text *priv_text)
 	return convert_any_priv_string(priv_text, parameter_priv_map);
 }
 
+/*
+ * has_largeobject_privilege variants
+ *		These are all named "has_largeobject_privilege" at the SQL level.
+ *		They take various combinations of large object OID with
+ *		user name, user OID, or implicit user = current_user.
+ *
+ *		The result is a boolean value: true if user has the indicated
+ *		privilege, false if not, or NULL if object doesn't exist.
+ */
+
+/*
+ * has_lo_priv_byid
+ *
+ *		Helper function to check user privileges on a large object given the
+ *		role by Oid, large object by Oid, and privileges as AclMode.
+ */
+static bool
+has_lo_priv_byid(Oid roleid, Oid lobjId, AclMode priv, bool *is_missing)
+{
+	Snapshot	snapshot = NULL;
+	AclResult	aclresult;
+
+	if (priv & ACL_UPDATE)
+		snapshot = NULL;
+	else
+		snapshot = GetActiveSnapshot();
+
+	if (!LargeObjectExistsWithSnapshot(lobjId, snapshot))
+	{
+		Assert(is_missing != NULL);
+		*is_missing = true;
+		return false;
+	}
+
+	if (lo_compat_privileges)
+		return true;
+
+	aclresult = pg_largeobject_aclcheck_snapshot(lo

Re: Disallow altering invalidated replication slots

2024-09-12 Thread Amit Kapila

On Wed, Sep 11, 2024 at 8:41 AM shveta malik  wrote:
>
> On Tue, Sep 10, 2024 at 11:24 PM Bharath Rupireddy
>  wrote:
> >
> >
> > Please find the attached v2 patch also having Shveta's review comments
> > addressed.
>
> The v2 patch looks good to me.
>

LGTM as well. I'll push this tomorrow morning unless there are more
comments or suggestions.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Add CANONICAL option to xmlserialize

2024-09-12 Thread Jim Jones


On 10.09.24 19:43, Tom Lane wrote:
> How about instead introducing a plain function along the lines of
> "xml_canonicalize(xml, bool keep_comments) returns text" ?  The SQL
> committee will certainly never do that, but we won't regret having
> created a plain function whenever they get around to doing something
> in the same space.
A second function to serialize xml documents may sound a bit redundant,
but I totally understand the concern of possibly conflicting with
SQL/XMl spec in the feature. I guess we can always come back here and
extend xmlserialize when the SQL committee moves in this direction.

v14 attached adds the function xmlcanonicalize, as suggested.

Thanks

-- 
Jim
From 08850417c9f0e1504a5e0cfbbd815c3a7aaaf7e8 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Thu, 12 Sep 2024 12:23:34 +0200
Subject: [PATCH v14] Add xmlcanonicalize function

This patch introduces the function xmlcanonicalize, which
serializes xml documents in their canonical form - as described in
the W3C Canonical XML Version 1.1 specification.

xmlcanonicalize(doc xml, keep_comments boolean) -> xml

doc: the XML document to be canonicalized
keep_comments: keeps or removes xml comments from doc

This feature is based on the function xmlC14NDocDumpMemory from the
C14N module of libxml2.
---
 doc/src/sgml/func.sgml  | 48 
 src/backend/utils/adt/xml.c | 40 +
 src/include/catalog/pg_proc.dat |  3 ++
 src/test/regress/expected/xml.out   | 70 +
 src/test/regress/expected/xml_1.out | 69 
 src/test/regress/expected/xml_2.out | 70 +
 src/test/regress/sql/xml.sql| 49 
 7 files changed, 349 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1bde4091ca..f63787d633 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14325,6 +14325,54 @@ SELECT xmltext('< foo & bar >');
 

 
+
+xmlcanonicalize
+
+
+ xmlcanonicalize
+
+
+
+xmlcanonicalize ( doc xml, keep_comments boolean ) xml
+
+
+
+ This function converts a given XML document into its https://www.w3.org/TR/xml-c14n11/#Terminology";>canonical form
+ based on the https://www.w3.org/TR/xml-c14n11/";>W3C Canonical XML 1.1 Specification.
+ It is basically designed to provide applications the ability to compare xml documents or test if they
+ have been changed. The parameter keep_comments, specifies if the XML comments from the given document should be kept or not.
+
+
+
+ Example:
+
+
+   
+

 xmlcomment
 
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 1a07876cd5..3d9ca2e040 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * We used to check for xmlStructuredErrorContext via a configure test; but
@@ -545,6 +546,45 @@ xmltext(PG_FUNCTION_ARGS)
 }
 
 
+Datum
+xmlcanonicalize(PG_FUNCTION_ARGS)
+{
+#ifdef USE_LIBXML
+	xmltype*arg = PG_GETARG_XML_P(0);
+	bool		keep_comments = PG_GETARG_BOOL(1);
+	text	   *result;
+	int			nbytes;
+	xmlDocPtr	doc;
+	xmlChar*xmlbuf = NULL;
+
+	doc = xml_parse(arg, XMLOPTION_DOCUMENT, false,
+	GetDatabaseEncoding(), NULL, NULL, NULL);
+
+	/*
+	 * This dumps the canonicalized XML doc into the xmlChar* buffer.
+	 * mode = 2 means the doc will be canonicalized using the C14N 1.1 standard.
+	 */
+	nbytes = xmlC14NDocDumpMemory(doc, NULL, 2, NULL, keep_comments, &xmlbuf);
+
+	if(doc)
+		xmlFreeDoc(doc);
+
+	if(nbytes < 0)
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+			errmsg("could not canonicalize the given XML document")));
+
+	result = cstring_to_text_with_len((const char *) xmlbuf, nbytes);
+
+	xmlFree(xmlbuf);
+
+	PG_RETURN_XML_P(result);
+#else
+	NO_XML_SUPPORT();
+	return 0;
+#endif			/* not USE_LIBXML */
+}
+
 /*
  * TODO: xmlconcat needs to merge the notations and unparsed entities
  * of the argument values.  Not very important in practice, though.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ff5436acac..1a177647ef 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8905,6 +8905,9 @@
 { oid => '3813', descr => 'generate XML text node',
   proname => 'xmltext', prorettype => 'xml', proargtypes => 'text',
   prosrc => 'xmltext' },
+{ oid => '3814', descr => 'generate the canonical form of an XML document',
+  proname => 'xmlcanonicalize', prorettype => 'xml', proargtypes => 'xml bool',
+  prosrc => 'xmlcanonicalize' },
 
 { oid => '2923', descr => 'map table contents to XML',
   proname => 'table_to_xml', procost => '100', provolatile => 's',
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 361a6f9b27..4994f31778 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1866,3 +1866,73

Re: PG_TEST_EXTRA and meson

2024-09-12 Thread Nazir Bilal Yavuz

Hi,

On Thu, 12 Sept 2024 at 12:35, Ashutosh Bapat
 wrote:
>
> Here's what I understand, please correct me: The code in meson.build
> is only called at the time of setup; not during meson test. Hence we
> can not check the existence of a runtime environment variable in that
> file. The things in test_env override those set at run time. So we
> save it as an argument to --pg_test_extra and then use it if
> PG_TEST_EXTRA is not set at run time. I tried to find if there's some
> other place to store "environment variables that can be overriden at
> runtime" but I can not find it. So it looks like this is the best we
> can do for now.

Yes, that is exactly what is happening.

> If it comes to a point where there are more such environment variables
> that need to be passed, probably we will pass a key-value string of
> those to testwrap. For now, for a single variable, this looks ok.

Yes, that would be better.

> > > +# If PG_TEST_EXTRA is not set in the environment, then look for its Meson
> > > +# configure option.
> > > +if "PG_TEST_EXTRA" not in os.environ and args.pg_test_extra:
> > > + env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> > > +
> > >
> > > If somebody looks at just these lines, it's not clear how
> > > PG_TEST_EXTRA is passed to the test environment if it's available in
> > > os.environ. So one has to understand that env_dict is the test
> > > environment. If that's so, the code and comment rewritten as below
> > > makes more sense to me. What do you think?
> > >
> > > # If PG_TEST_EXTRA is not already part of the test environment, check if 
> > > it's
> > > # passed via program argument --pg_test_extra. Thus we also override
> > > # configuration time value by run time value of PG_TEST_EXTRA.
> > > if "PG_TEST_EXTRA" not in env_dict and args.pg_test_extra:
> > > env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> >
> > I think your suggestion is better, done.
>
> I didn't see the expected change. I was talking about something like attached.
>
> Also
> 1. I have also made changes to the comment,
> 2. renamed the argument --pg_test_extra to --pg-test-extra using
> convention similar to other arguments.
> 3. few other cosmetic changes.
>
> Please review and incorporate those in the respective patches and
> tests. Sorry for a single diff.
>
> Once this is done, I think we can mark this CF entry as RFC.

Thanks for the changes. I applied all of them in respective patches.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 035fe94107fe2c03647a4bb3ec4f0b3d1673e004 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 11 Sep 2024 15:41:27 +0300
Subject: [PATCH v5 1/2] Make PG_TEST_EXTRA env variable override its Meson
 configure option

Since PG_TEST_EXTRA configure option is set while configuring Meson,
each change on PG_TEST_EXTRA needs reconfigure and this is costly. So,
first try to use PG_TEST_EXTRA from environment. If it does not exist,
then try to use it from its configure option in Meson builds.

After the above change, Meson builds are able to get PG_TEST_EXTRA from
environment and this overrides the configure option. PG_TEST_EXTRA
environment variable is set at the top of the .cirrus.tasks.yml file. So,
PG_TEXT_EXTRA in configure scripts became useless and were removed
because of that.
---
 doc/src/sgml/installation.sgml |  6 --
 .cirrus.tasks.yml  |  6 +-
 meson.build| 11 ++-
 meson_options.txt  |  2 +-
 src/tools/testwrap | 10 ++
 5 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index ff9abd4649d..7c481e07e98 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3073,8 +3073,10 @@ ninja install
   

 Enable test suites which require special software to run.  This option
-accepts arguments via a whitespace-separated list.  See  for details.
+accepts arguments via a whitespace-separated list.  Please note that
+this configure option is overridden by PG_TEST_EXTRA environment
+variable if it is set.  See  for
+details.

   
  
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 90cb95c8681..fc413eb11ef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -175,7 +175,6 @@ task:
 --buildtype=debug \
 -Dcassert=true -Dinjection_points=true \
 -Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
--DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 -Dextra_lib_dirs=/usr/local/lib -Dextra_include_dirs=/usr/local/include/ \
 build
 EOF
@@ -364,7 +363,6 @@ task:
 --buildtype=debug \
 -Dcassert=true -Dinjection_points=true \
 ${LINUX_MESON_FEATURES} \
--DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build
 EOF
 
@@ -380,7 +378,6 @@ task:
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \

Re: pg_verifybackup: TAR format backup verification

2024-09-12 Thread Amul Sul

On Tue, Sep 10, 2024 at 1:31 AM Robert Haas  wrote:
>
> I would rather that you didn't add simple_ptr_list_destroy_deep()
> given that you don't need it for this patch series.
>
Done.

> +
> "\"%s\" unexpected file in the tar format backup",
>
> This doesn't seem grammatical to me. Perhaps change this to: file
> \"%s\" is not expected in a tar format backup
>
Ok, updated in the attached version.

> +   /* We are only interested in files that are not in the ignore list. */
> +   if (member->is_directory || member->is_link ||
> +   should_ignore_relpath(mystreamer->context, member->pathname))
> +   return;
>
> Doesn't this need to happen after we add pg_tblspc/$OID to the path,
> rather than before? I bet this doesn't work correctly for files in
> user-defined tablespaces, compared to the way it work for a
> directory-format backup.
>
> +   chartemp[MAXPGPATH];
> +
> +   /* Copy original name at temporary space */
> +   memcpy(temp, member->pathname, MAXPGPATH);
> +
> +   snprintf(member->pathname, MAXPGPATH, "%s/%d/%s",
> +"pg_tblspc", mystreamer->tblspc_oid, temp);
>
> I don't like this at all. This function doesn't have any business
> modifying the astreamer_member, and it doesn't need to. It can just do
> char *pathname; char tmppathbuf[MAXPGPATH] and then set pathname to
> either member->pathname or tmppathbuf depending on
> OidIsValid(tblspcoid). Also, shouldn't this be using %u, not %d?
>
True, fixed in the attached version.

> +   mystreamer->mfile = (void *) m;
>
> Either the cast to void * isn't necessary, or it indicates that
> there's a type mismatch that should be fixed.
>
Fixed -- was added in the very first version and forgotten in later updates.

> +* We could have these checks while receiving contents. However, since
> +* contents are received in multiple iterations, this would result in
> +* these lengthy checks being performed multiple times. Instead, 
> setting
> +* flags here and using them before proceeding with verification will 
> be
> +* more efficient.
>
> Seems unnecessary to explain this.
>
Removed.

> +   Assert(mystreamer->verify_checksum);
> +
> +   /* Should have came for the right file */
> +   Assert(strcmp(member->pathname, m->pathname) == 0);
> +
> +   /*
> +* The checksum context should match the type noted in the backup
> +* manifest.
> +*/
> +   Assert(checksum_ctx->type == m->checksum_type);
>
> What do you think about:
>
> Assert(m != NULL && !m->bad);
> Assert(checksum_ctx->type == m->checksum_type);
> Assert(strcmp(member->pathname, m->pathname) == 0);
>
> Or possibly change the first one to Assert(should_verify_checksum(m))?
>
LGTM.

> +   memcpy(&control_file, streamer->bbs_buffer.data,
> sizeof(ControlFileData));
>
> This probably doesn't really hurt anything, but it's a bit ugly. You
> first use astreamer_buffer_until() to force the entire file into a
> buffer. And then here, you copy the entire file into a second buffer
> which is exactly the same except that it's guaranteed to be properly
> aligned. It would be possible to include a ControlFileData in
> astreamer_verify and copy the bytes directly into it (you'd need a
> second astreamer_verify field for the number of bytes already copied
> into that structure). I'm not 100% sure that's worth the code, but it
> seems like it wouldn't take more than a few lines, so perhaps it is.
>
I think we could skip this memcpy() and directly cast
streamer->bbs_buffer.data to ControlFileData *, as we already ensure
that the correct length is being read just before this memcpy(). Did
the same in the attached version.

> +/*
> + * Verify plain backup.
> + */
> +static void
> +verify_plain_backup(verifier_context *context)
> +{
> +   Assert(context->format == 'p');
> +   verify_backup_directory(context, NULL, context->backup_directory);
> +}
> +
>
> This seems like a waste of space.
>
Yeah, but aim to keep the function name more self-explanatory and
consistent with the naming style.

> +verify_tar_backup(verifier_context *context)
>
> I like this a lot better now! I'm still not quite sure about the
> decision to have the ignore list apply to both the backup directory
> and the tar file contents -- but given the low participation on this
> thread, I don't think we have much chance of getting feedback from
> anyone else right now, so let's just do it the way you have it and we
> can change it later if someone shows up to complain.
>
Ok.

> +verify_all_tar_files(verifier_context *context, SimplePtrList *tarfiles)
>
> I think this code could be moved into its only caller instead of
> having a separate function. And then if you do that, maybe
> verify_one_tar_file could be renamed to just verify_tar_file. Or
> perhaps that function could also be removed and just move the code
> into the caller. It's not much cod

Re: Psql meta-command conninfo+

2024-09-12 Thread Jim Jones




On 11.09.24 13:35, Hunaid Sohail wrote:
> Hi Jim,
>
> On Wed, Sep 11, 2024 at 3:03 PM Jim Jones 
> wrote:
>
> Thanks for working on this.
>
> Any particular reason for the design change? In v28 it returned a
> table
> with a single row and multiple columns --- one column per
> attribute. But
> now it returns multiple rows. In this case, I was expecting 1 row
> with 7
> columns instead of 7 rows with 2 columns.
>
>
> I am not sure which design you are referring to.
> I haven't applied the v28 patch but the original author in thread [1]
> provided sample output. The output is in tabular form with 2 columns
> and multiple rows.
>
> [1] 
> https://www.postgresql.org/message-id/CP8P284MB249615AED23882E1E185C8ABEC3C2%40CP8P284MB2496.BRAP284.PROD.OUTLOOK.COM
>  
>
It may look like this, but it is a single record --- mind the header "-[
RECORD 1 ]+-".
psql was called in expanded mode:

> $ /home/pgsql-17devel/bin/psql -x -p 5432

"-x" or "--expanded"

Example:

$ psql postgres -xc "SELECT 'foo' col1, 'bar' col2"
-[ RECORD 1 ]
col1 | foo
col2 | bar

-- 
Jim

[PATCH] Support Int64 GUCs

Hi,

Attached is a self-sufficient patch extracted from a larger patchset
[1]. The entire patchset probably will not proceed further in the
nearest future. Since there was interest in this particular patch it
deserves being discussed in a separate thread.

Currently we support 32-bit integer values in GUCs, but don't support
64-bit ones. The proposed patch adds this support.

Firstly, it adds DefineCustomInt64Variable() which can be used by the
extension authors.

Secondly, the following core GUCs are made 64-bit:

```
autovacuum_freeze_min_age
autovacuum_freeze_max_age
autovacuum_freeze_table_age
autovacuum_multixact_freeze_min_age
autovacuum_multixact_freeze_max_age
autovacuum_multixact_freeze_table_age
```

I see several open questions with the patch in its current state.

Firstly, I'm not sure if it is beneficial to affect the named GUCs out
of the context of the larger patchset. Perhaps we have better GUCs
that could benefit from being 64-bit? Or should we just leave alone
the core GUCs and focus on providing DefineCustomInt64Variable() ?

Secondly, DefineCustomInt64Variable() is not test-covered. Turned out
it was not even defined (although declared) in the original patch.
This was fixed in the attached version. Maybe one of the test modules
could use it even if it makes little sense for this particular module?
For instance, test/modules/worker_spi/ could use it for
worker_spi.naptime.

Last but not least, large values like 12345678912345 could be
difficult to read. Perhaps we should also support 12_345_678_912_345
syntax? This is not implemented in the attached patch and arguably
could be discussed separately when and if we merge it.

Thoughts?

[1]:
https://www.postgresql.org/message-id/flat/CAJ7c6TP9Ce021ebJ%3D5zOhMjiG3Wqg4hO6Mg0WsccErvAD9vZYA%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

v1-0001-Support-64-bit-integer-GUCs.patch
Description: Binary data

Re: Add 64-bit XIDs into PostgreSQL 15

Hi,

> Agree.
>
> To me it seems like we didn't reach a consensus regarding switching to
> 64-bit XIDs. Given that and the fact that the patchset is rather
> difficult to rebase (and review) I suggest focusing on something we
> reached a consensus for. I'm going to close a CF entry for this
> particular thread as RwF unless anyone objects. We can always return
> to this later, preferably knowing that there is a particular committer
> who has time and energy for merging this.

I started a new thread and opened a new CF entry for Int64 GUCs:

https://commitfest.postgresql.org/50/5253/

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Support Int64 GUCs

2024-09-12 Thread Pavel Borisov

Hi, Alexander!
Thank you for working on this!

On Thu, 12 Sept 2024 at 15:08, Aleksander Alekseev 
wrote:

> Hi,
>
> Attached is a self-sufficient patch extracted from a larger patchset
> [1]. The entire patchset probably will not proceed further in the
> nearest future. Since there was interest in this particular patch it
> deserves being discussed in a separate thread.
>
> Currently we support 32-bit integer values in GUCs, but don't support
> 64-bit ones. The proposed patch adds this support.
>
> Firstly, it adds DefineCustomInt64Variable() which can be used by the
> extension authors.
>
> Secondly, the following core GUCs are made 64-bit:
>
> ```
> autovacuum_freeze_min_age
> autovacuum_freeze_max_age
> autovacuum_freeze_table_age
> autovacuum_multixact_freeze_min_age
> autovacuum_multixact_freeze_max_age
> autovacuum_multixact_freeze_table_age
> ```
>
> I see several open questions with the patch in its current state.
>
> Firstly, I'm not sure if it is beneficial to affect the named GUCs out
> of the context of the larger patchset. Perhaps we have better GUCs
> that could benefit from being 64-bit? Or should we just leave alone
> the core GUCs and focus on providing DefineCustomInt64Variable() ?
>
I think the direction is good and delivering 64-bit GUCs is very much worth
committing.
The patch itself looks good, but we could need to add locks against
concurrently modifying 64-bit values, which could be non-atomic on older
architectures.


> Secondly, DefineCustomInt64Variable() is not test-covered. Turned out
> it was not even defined (although declared) in the original patch.
> This was fixed in the attached version. Maybe one of the test modules
> could use it even if it makes little sense for this particular module?
> For instance, test/modules/worker_spi/ could use it for
> worker_spi.naptime.
>
> Last but not least, large values like 12345678912345 could be
> difficult to read. Perhaps we should also support 12_345_678_912_345
> syntax? This is not implemented in the attached patch and arguably
> could be discussed separately when and if we merge it.
>

I think 12345678912345 is good enough. Underscore dividers make reading
little bit easier but look weird overall. I can't remember other places
where we output long numbers with dividers.

Regards,
Pavel Borisov
Supabase

Re: [PATCH] Support Int64 GUCs

Hi Pavel,

> I think the direction is good and delivering 64-bit GUCs is very much worth 
> committing.
> The patch itself looks good, but we could need to add locks against 
> concurrently modifying 64-bit values, which could be non-atomic on older 
> architectures.

Thanks for the feedback.

> I think 12345678912345 is good enough. Underscore dividers make reading 
> little bit easier but look weird overall. I can't remember other places where 
> we output long numbers with dividers.

We already support this in SQL:

psql (18devel)
=# SELECT 123_456;
 ?column?
--
   123456

-- 
Best regards,
Aleksander Alekseev

Re: Avoid dead code (contrib/pg_visibility/pg_visibility.c)

2024-09-12 Thread Ranier Vilela

Em qui., 12 de set. de 2024 às 02:18, Michael Paquier 
escreveu:

> On Wed, Sep 04, 2024 at 01:50:24PM -0300, Ranier Vilela wrote:
> > I think that commit c582b75
> > <
> https://github.com/postgres/postgres/commit/c582b75851c2d096ce050d753494505a957cee75
> >,
> > left an oversight.
> >
> > The report is:
> > CID 1559993: (#1 of 1): Logically dead code (DEADCODE)
> >
> > Trivial patch attached.
>
> I am not sure to understand what you mean here and if this is still
> relevant as of Noah's latest commit in 65c310b310a6.
>
Sorry Michael, but this patch became irrelevant after ddfc556

Note the omission to connect the dots, from the commit.

best regards,
Ranier Vilela

Re: POC: make mxidoff 64 bits

2024-09-12 Thread Pavel Borisov

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

Regards,
Pavel Borisov
Supabase

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-12 Thread Tatsuo Ishii

Here is the v3 patch. This time I only include patch for the Window
Aggregate node. Patches for other node types will come after this
patch getting committed or come close to commitable state.

David,
In this patch I refactored show_material_info. I divided it into
show_material_info and show_storage_info so that the latter can be
used by other node types including window aggregate node. What do you
think?

I also added a test case in explain.sql per discussion with Maxim
Orlov.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
>From 74ab7724bfdb49dc111eca5d96abed9cd29abdc7 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii 
Date: Thu, 12 Sep 2024 16:15:43 +0900
Subject: [PATCH v3] Add memory/disk usage for Window aggregate nodes in
 EXPLAIN.

This commit is similar to 1eff8279d and expands the idea to Window
aggregate nodes so that users can know how much memory or disk the
tuplestore used.

This commit uses newly introduced tuplestore_get_stats() to query this
information and add some additional output in EXPLAIN ANALYZE to
display the information for the Window aggregate node.
---
 src/backend/commands/explain.c| 31 +++
 src/test/regress/expected/explain.out | 11 ++
 src/test/regress/sql/explain.sql  |  3 +++
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2819e479f8..ce6792be8a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -126,7 +126,9 @@ static void show_sort_info(SortState *sortstate, ExplainState *es);
 static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_storage_info(Tuplestorestate *tupstore, ExplainState *es);
 static void show_material_info(MaterialState *mstate, ExplainState *es);
+static void show_windowagg_info(WindowAggState *winstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 			  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
@@ -2231,6 +2233,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		   planstate, es);
 			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
 			"Run Condition", planstate, ancestors, es);
+			show_windowagg_info(castNode(WindowAggState, planstate), es);
 			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
@@ -3343,13 +3346,11 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 }
 
 /*
- * Show information on material node, storage method and maximum memory/disk
- * space used.
+ * Show information on storage method and maximum memory/disk space used.
  */
 static void
-show_material_info(MaterialState *mstate, ExplainState *es)
+show_storage_info(Tuplestorestate *tupstore, ExplainState *es)
 {
-	Tuplestorestate *tupstore = mstate->tuplestorestate;
 	char	   *maxStorageType;
 	int64		maxSpaceUsed,
 maxSpaceUsedKB;
@@ -3379,6 +3380,28 @@ show_material_info(MaterialState *mstate, ExplainState *es)
 	}
 }
 
+/*
+ * Show information on material node, storage method and maximum memory/disk
+ * space used.
+ */
+static void
+show_material_info(MaterialState *mstate, ExplainState *es)
+{
+	Tuplestorestate *tupstore = mstate->tuplestorestate;
+
+	show_storage_info(tupstore, es);
+}
+
+/*
+ * Show information on WindowAgg node, storage method and maximum memory/disk
+ * space used.
+ */
+static void
+show_windowagg_info(WindowAggState *winstate, ExplainState *es)
+{
+	show_storage_info(winstate->buffer, es);
+}
+
 /*
  * Show information on memoize hits/misses/evictions and memory usage.
  */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 6585c6a69e..d27fbdfebb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -691,3 +691,14 @@ select explain_filter('explain (analyze,serialize) create temp table explain_tem
  Execution Time: N.N ms
 (4 rows)
 
+-- Test tuplestore storage usage in Window aggregate
+select explain_filter('explain (analyze) select sum(n) over() from generate_series(1,100) a(n)');
+ explain_filter 
+
+ WindowAgg  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
+   Storage: Memory  Maximum Storage: NkB
+   ->  Function Scan on generate_series a  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index c7055f850c..50eaf554eb 100644
--- a/src/test/r

Re: POC: make mxidoff 64 bits

2024-09-12 Thread Pavel Borisov

On Thu, 12 Sept 2024 at 16:09, Pavel Borisov  wrote:

> Hi, Maxim!
>
> Previously we accessed offsets in shared MultiXactState without locks as
> 32-bit read is always atomic. But I'm not sure it's so when offset become
> 64-bit.
> E.g. GetNewMultiXactId():
>
> nextOffset = MultiXactState->nextOffset;
> is outside lock.
>
> There might be other places we do the same as well.
>
I think the replacement of plain assignments by
pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.

(The same I think is needed for the patchset [1])
[1]
https://www.postgresql.org/message-id/flat/CAJ7c6TMvPz8q+nC=jokniy7yxpzqycctnnfymsdp-nnwsao...@mail.gmail.com

Regards,
Pavel Borisov

Re: POC: make mxidoff 64 bits

2024-09-12 Thread Alvaro Herrera

On 2024-Sep-12, Pavel Borisov wrote:

> Hi, Maxim!
> 
> Previously we accessed offsets in shared MultiXactState without locks as
> 32-bit read is always atomic. But I'm not sure it's so when offset become
> 64-bit.
> E.g. GetNewMultiXactId():
> 
> nextOffset = MultiXactState->nextOffset;
> is outside lock.

Good though.  But fortunately I think it's not a problem.  The one you
say is with MultiXactGetLock held in shared mode -- and that works OK,
as the assignment (in line 1263 at the bottom of the same routine) is
done with exclusive lock held.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Switch PgStat_HashKey.objoid from Oid to uint64

2024-09-12 Thread Bertrand Drouvot

Hi,

On Thu, Aug 29, 2024 at 08:56:59AM +0900, Michael Paquier wrote:
> On Mon, Aug 26, 2024 at 09:32:54AM +0300, Heikki Linnakangas wrote:
> > Currently, we rely on the fact that all the xl_xact_* structs require
> > sizeof(int) alignment. See comment above struct xl_xact_xinfo.
> 
> Thanks, I have missed this part.  So that explains the alignment I'd
> better use in the record.
> 
> > One idea is to store the uint64 as two uint32's.
> 
> Nice, we could just do that.  This idea makes me feel much better than
> sticking more aligment macros in the paths where the record is built.
> 
> Attached is an updated patch doing that.  ubsan is silent with that.

Thanks!

Yeah, indeed, with "COPT=-fsanitize=alignment -fno-sanitize-recover=all", then

"make -C src/test/modules/test_custom_rmgrs check":

- Is fine on master
- Fails with v1 applied due to things like:

xactdesc.c:91:28: runtime error: member access within misaligned address 
0x5d29d22cc13c for type 'struct xl_xact_stats_items', which requires 8 byte 
alignment
0x5d29d22cc13c: note: pointer points here
  7f 06 00 00 02 00 00 00  fe 7f 00 00 03 00 00 00  05 00 00 00 05 40 00 00  00 
00 00 00 03 00 00 00

- Is fine with v2

So v2 does fix the alignment issue and I also think that's a nice way to fix it.

Lot of stuff that this patch does is mechanical changes:

- replace "objoid" by "objid" in *stats* files 
- change the related type from Oid to uint64
- make use of hash_bytes_extended() instead of hash_bytes when needed

and I don't see any issues here.

There is also some manipulation around the 2 new uint32 fields (objid_hi and
objid_lo) in the xactdesc.c and pgstat_xact.c files that look good to me.

But now we end up having functions that accept Oid as parameters to call
functions that accept uint64 as parameter (for the exact same parameter), for
example:

"
void
pgstat_create_function(Oid proid)
{
pgstat_create_transactional(PGSTAT_KIND_FUNCTION,
MyDatabaseId,
proid);
}
"

as pgstat_create_transactional is now:

-pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
+pgstat_create_transactional(PgStat_Kind kind, Oid dboid, uint64 objid)

That's not an issue as both are unsigned and as we do those calls in that
order (Oid -> uint64).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Support Int64 GUCs

On Thu, Sep 12, 2024 at 02:08:15PM +0300, Aleksander Alekseev wrote:
> Secondly, the following core GUCs are made 64-bit:
> 
> ```
> autovacuum_freeze_min_age
> autovacuum_freeze_max_age
> autovacuum_freeze_table_age
> autovacuum_multixact_freeze_min_age
> autovacuum_multixact_freeze_max_age
> autovacuum_multixact_freeze_table_age
> ```
> 
> I see several open questions with the patch in its current state.
> 
> Firstly, I'm not sure if it is beneficial to affect the named GUCs out
> of the context of the larger patchset. Perhaps we have better GUCs
> that could benefit from being 64-bit? Or should we just leave alone
> the core GUCs and focus on providing DefineCustomInt64Variable() ?

I don't understand why we would want to make these GUCs 64-bit.  All of the
allowed values fit in an int32, so AFAICT this would only serve to mislead
users into thinking they could set these much higher than they can/should.

TBH I'm quite skeptical that this would even be particularly useful for
extension authors.  In what cases would a floating point value not suffice?
I'm not totally opposed to the idea of 64-bit GUCs, but I'd like more
information about the motivation.

-- 
nathan

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2024-09-12 Thread Matthias van de Meent

On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan  wrote:
>
> On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra  wrote:
> > I started looking at this patch today.
>
> Thanks for taking a look!
>
> > The first thing I usually do for
> > new patches is a stress test, so I did a simple script that generates
> > random table and runs a random query with IN() clause with various
> > configs (parallel query, index-only scans, ...). And it got stuck on a
> > parallel query pretty quick.
>
> I can reproduce this locally, without too much difficulty.
> Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
> in my commit 5bf748b8.
[...]
> In short, one or two details of how backends call _bt_parallel_seize
> to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.

Thanks to Peter for the description, that helped me debug the issue. I
think I found a fix for the issue: regression tests for 811af978
consistently got stuck on my macbook before the attached patch 0001,
after applying that this patch they completed just fine.

The issue to me seems to be the following:

Only _bt_first can start a new primitive scan, so _bt_parallel_seize
only assigns a new primscan if the process is indeed in _bt_first (as
provided with _b_p_s(first=true)). All other backends that hit a
NEED_PRIMSCAN state will currently pause until a backend in _bt_first
does the next primitive scan.

A backend that hasn't requested the next primitive scan will likely
hit _bt_parallel_seize from code other than _bt_first, thus pausing.
If this is the leader process, it'll stop consuming tuples from
follower processes.

If the follower process finds a new primary scan is required after
finishing reading results from a page, it will first request a new
primitive scan, and only then start producing the tuples.

As such, we can have a follower process that just finished reading a
page, had issued a new primitive scan, and now tries to send tuples to
its primary process before getting back to _bt_first, but the its
primary process won't acknowledge any tuples because it's waiting for
that process to start the next primitive scan - now we're deadlocked.

---

The fix in 0001 is relatively simple: we stop backends from waiting
for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
instead move our local state machine so that we'll hit _bt_first
ourselves, so that we may be able to start the next primitive scan.
Also attached is 0002, which adds tracking of responsible backends to
parallel btree scans, thus allowing us to assert we're never waiting
for our own process to move the state forward. I found this patch
helpful while working on solving this issue, even if it wouldn't have
found the bug as reported.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v1-0001-Fix-stuck-parallel-btree-scans.patch
Description: Binary data

v1-0002-nbtree-add-tracking-of-processing-responsibilitie.patch
Description: Binary data

Re: Incremental Sort Cost Estimation Instability

2024-09-12 Thread Tomas Vondra

On 9/12/24 12:12, David Rowley wrote:
> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  wrote:
>> Initial problem causes wrong cost_sort estimation. Right now I think
>> about providing cost_sort() the sort clauses instead of (or in addition
>> to) the pathkeys.
> 
> I'm not quite sure why the sort clauses matter any more than the
> EquivalenceClass.  If the EquivalanceClass defines that all members
> will have the same value for any given row, then, if we had to choose
> any single member to drive the n_distinct estimate from, isn't the
> most accurate distinct estimate from the member with the smallest
> n_distinct estimate?  (That assumes the less distinct member has every
> value the more distinct member has, which might not be true)
> 
> David
> 

How large can the cost difference get? My assumption was it's correlated
with how different the ndistincts are for the two sides, so I tried

  CREATE TABLE t1(x integer, y integer,z text);
  CREATE TABLE t2(x integer, y integer,z text);

  INSERT INTO t1 (x,y) SELECT x, 1
FROM generate_series(1,100) AS x;
  INSERT INTO t2 (x,y) SELECT mod(x,1000), 1
FROM generate_series(1,100) AS x;

  CREATE INDEX ON t1(x);
  CREATE INDEX ON t2(x);
  CREATE INDEX ON t1(y);
  CREATE INDEX ON t2(y);

  VACUUM ANALYZE;

Which changes the ndistinct for t2.x from 1M to 1k. I've expected the
cost difference to get much larger, but in it does not really change:

GroupAggregate  (cost=38.99..37886.88 rows=992 width=16) (actual rows=1
loops=1)

GroupAggregate  (cost=37874.26..37904.04 rows=992 width=16) (actual
rows=1 loops=1)

That is pretty significant - the total cost difference is tiny, I'd even
say it does not matter in practice (seems well within possible impact of
collecting a different random sample).

But the startup cost changes in rather absurd way, while the rest of the
plan is exactly the same. We even know this:

   ->  Incremental Sort  (cost=38.99..37869.52 rows=992 width=8)
 Sort Key: t1.x, t1.y
 Presorted Key: t1.x

in both cases. There's literally no other difference between these plans
visible in the explain ...

I'm not sure how to fix this, but it seems estimate_num_groups() needs
to do things differently. And I agree looking for the minimum ndistinct
seems like the right approach. but doesn't estimate_num_groups()
supposed to already do that? The comment says:

 * 3.  If the list contains Vars of different relations that are known equal
 * due to equivalence classes, then drop all but one of the Vars from each
 * known-equal set, keeping the one with smallest estimated # of values
 * (since the extra values of the others can't appear in joined rows).
 * Note the reason we only consider Vars of different relations is that
 * if we considered ones of the same rel, we'd be double-counting the
 * restriction selectivity of the equality in the next step.

I haven't debugged this, but how come this doesn't do the trick?

regards

-- 
Tomas Vondra

may be a mismatch between the construct_array and construct_md_array comments

2024-09-12 Thread Alena Rybakina

While working on the "POC, WIP: OR-clause support for indexes" project 
[0], it was suggested to use the construct_array function to form a 
one-dimensional array.


I noticed that there is a comment that values with NULL are not 
processed there, but in fact this function calls the construct_md_array 
function, which

contains a comment that it can handle NULL values.

/*
 * construct_array    --- simple method for constructing an array object
 *
 * elems: array of Datum items to become the array contents
 *          (NULL element values are not supported).

*/


/*
 * construct_md_array    --- simple method for constructing an array object
 *                            with arbitrary dimensions and possible NULLs

*/

In the places where the construct_md_array function is called, I did not 
see a check for NULL and a limitation on the use of the function, if any.


The tests during the check did not show that there is a problem with 
this [1].


Is this comment correct or we should update it?


[0] https://commitfest.postgresql.org/49/4450/

[1] 
https://www.postgresql.org/message-id/CACJufxHCJvC3X8nUK-jRvRru-ZEXp16EBPADOwTGaqmOYM1Raw%40mail.gmail.com

Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2024-09-12 Thread Christoph Berg

Re: To Tom Lane
> Let's keep it like it is now in PG17.

Late followup news: This feature has actually found a bug in
postgresql-debversion:

 CREATE OPERATOR > (
   LEFTARG = debversion,
   RIGHTARG = debversion,
   COMMUTATOR = <,
-  NEGATOR = >=,
+  NEGATOR = <=,
   RESTRICT = scalargtsel,
   JOIN = scalargtjoinsel
 );

https://salsa.debian.org/postgresql/postgresql-debversion/-/commit/8ef08ccbea1438468249b0e94048b1a8a25fc625#000e84a71f8a28b762658375c194b25d529336f3

So, thanks!

Christoph

Re: may be a mismatch between the construct_array and construct_md_array comments

Alena Rybakina  writes:
> I noticed that there is a comment that values with NULL are not 
> processed there, but in fact this function calls the construct_md_array 
> function, which
> contains a comment that it can handle NULL values.

Right.  construct_md_array has a "bool *nulls" argument, but
construct_array doesn't --- it passes NULL for that to
construct_md_array, which will therefore assume there are no null
array elements.

regards, tom lane

Re: tiny step toward threading: reduce dependence on setlocale()

2024-09-12 Thread Jeff Davis

On Mon, 2024-09-09 at 15:37 +0200, Peter Eisentraut wrote:
> In the end, I figured the best thing to do here is to add an explicit
> locale argument to MATCH_LOWER() and GETCHAR() so one can actually 
> understand this code by reading it.  New patch attached.

Looks good to me, thank you.

Regards,
Jeff Davis

Re: query_id, pg_stat_activity, extended query protocol

2024-09-12 Thread Sami Imseih

> Do you think that we'd better replace the calls reporting the query ID
> in execMain.c by some assertions on HEAD? This won't work for
> ExecutorStart() because PREPARE statements (or actually EXECUTE,
> e.g. I bumped on that yesterday but I don't recall which one) would

Yes, adding the asserts in execMain.c is better, but there is complications
there due to the issue you mention. I think the issue you are bumping into
is when pg_stat_statements.track_utility = on ( default ), the assert in 
ExecutorStart will fail on EXECUTE. I believe it's because ( need to verify )
pg_stat_statements.c sets the queryId = 0 in the ProcessUtility hook [1].

> So your point would be to force this rule within the core executor on
> HEAD? I would not object to that in case we're missing more spots
> with the extended query protocol, actually. That would help us detect
> cases where we're still missing the query ID to be set and the
> executor should know about that.

Yes, but looking at how pg_stat_statements works with PREPARE/EXECUTE, 
I am now thinking it's better to Just keep the tests in pg_stat_statements. 
Having test coverage in pg_stat_statements is better than nothing, and
check-world ( or similar ) will be able to cacth such failures.


> Note that I'm not much worried about the dependency with
> pg_stat_statements. We already rely on it for query jumbling
> normalization for some parse node patterns like DISCARD, and query
> jumbling requires query IDs to be around. So that's not new.

Good point.

[1] 
https://github.com/postgres/postgres/blob/master/contrib/pg_stat_statements/pg_stat_statements.c#L1127-L1128

Regards,

Sami

Re: tiny step toward threading: reduce dependence on setlocale()

2024-09-12 Thread Andreas Karlsson


On 9/4/24 11:45 PM, Jeff Davis wrote:

Committed v2-0001.


> [...]


I fixed this by replacing the assert with an elog(ERROR, ...), so that
it will consistently show a "cache lookup failed for collation 0"
regardless of whether it's a debug build or not. It's not expected that
the error will be encountered.


Thanks!

Andreas

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

On Fri, 13 Sept 2024 at 00:12, Tatsuo Ishii  wrote:
> In this patch I refactored show_material_info. I divided it into
> show_material_info and show_storage_info so that the latter can be
> used by other node types including window aggregate node. What do you
> think?

Yes, I think it's a good idea to move that into a helper function. If
you do the other node types, without that helper the could would have
to be repeated quite a few times. Maybe show_storage_info() can be
moved up with the other helper functions, say below
show_sortorder_options() ?  It might be a good idea to keep the "if
(!es->analyze || tupstore == NULL)" checks in the calling function
rather than the helper too.

I thought about the location of the test for a while and read the
"This file is concerned with testing EXPLAIN in its own right."
comment at the top of that explain.out.  I was trying to decide if
testing output of a specific node type met this or not. I can't pick
out any other tests there which are specific to a node type, so I'm
unsure if this is the location for it or not. However, to put it
anywhere else means having to add a plpgsql function to mask out the
unstable parts of EXPLAIN, so maybe the location is good as it saves
from having to do that. I'm 50/50 on this, so I'm happy to let you
decide.  You could also shrink that 100 rows into a smaller number for
the generate_series without losing any coverage.

Aside from that, I think the patch is good.  Thanks for working on it.

David

Mutable foreign key constraints

I happened to notice that Postgres will let you do

regression=# create table foo (id timestamp primary key);
CREATE TABLE
regression=# create table bar (ts timestamptz references foo);
CREATE TABLE

This strikes me as a pretty bad idea, because whether a particular
timestamp is equal to a particular timestamptz depends on your
timezone setting.  Thus the constraint could appear to be violated
after a timezone change.

I'm inclined to propose rejecting FK constraints if the comparison
operator is not immutable.  Among the built-in opclasses, the only
instances of non-immutable btree equality operators are

regression=# select amopopr::regoperator from pg_amop join pg_operator o on 
o.oid = amopopr join pg_proc p on p.oid = oprcode where amopmethod=403 and 
amopstrategy=3 and provolatile != 'i';
 amopopr 
-
 =(date,timestamp with time zone)
 =(timestamp without time zone,timestamp with time zone)
 =(timestamp with time zone,date)
 =(timestamp with time zone,timestamp without time zone)
(4 rows)

A possible objection is that if anybody has such a setup and
hasn't noticed a problem because they never change their
timezone setting, they might not appreciate us breaking it.
So I certainly wouldn't propose back-patching this.  But
maybe we should add it as a foot-gun defense going forward.

Thoughts?

regards, tom lane

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 11:19:00AM +0900, Michael Paquier wrote:
> On Wed, Sep 11, 2024 at 11:02:43PM +0100, Matthias van de Meent wrote:
>> As for patch v3, that seems fine with me.
> 
> +1 from here as well, after looking at what v3 is doing for these two
> modules.

Committed.

-- 
nathan

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-12 Thread Tomas Vondra

Hi,

I've spent quite a bit of time trying to identify cases where having
more fast-path lock slots could be harmful, without any luck. I started
with the EPYC machine I used for the earlier tests, but found nothing,
except for a couple cases unrelated to this patch, because it affects
even cases without the patch applied at all. More like random noise or
maybe some issue with the VM (or differences to the VM used earlier). I
pushed the results to githus [1] anyway, if anyone wants to look.

So I switched to my smaller machines, and ran a simple test on master,
with the hard-coded arrays, and with the arrays moves out of PGPROC (and
sized per max_locks_per_transaction).

I was looking for regressions, so I wanted to test a case that can't
benefit from fast-path locking, while paying the costs. So I decided to
do pgbench -S with 4 partitions, because that fits into the 16 slots we
had before, and scale 1 to keep everything in memory. And then did a
couple read-only runs, first with 64 locks/transaction (default), then
with 1024 locks/transaction.

Attached is a shell script I used to collect this - it creates and
removes clusters, so be careful. Should be fairly obvious what it tests
and how.

The results for max_locks_per_transaction=64 look like this (the numbers
are throughput):


  machine  mode  clients   master   built-in   with-guc
  -
   i5  prepared114970  14991  14981
   451638  51615  51388
 simple114042  14136  14008
   448705  48572  48457
 --
 xeon  prepared113213  13330  13170
   449280  49191  49263
  16   151413 152268 151560
 simple112250  12291  12316
   445910  46148  45843
  16   141774 142165 142310

And compared to master

  machine  mode  clients   built-inwith-guc
  -
   i5  prepared1100.14% 100.08%
   4 99.95%  99.51%
 simple1100.67%  99.76%
   4 99.73%  99.49%
 --
 xeon  prepared1100.89%  99.68%
   4 99.82%  99.97%
  16100.56% 100.10%
 simple1100.34% 100.54%
   4100.52%  99.85%
  16100.28% 100.38%

So, no difference whatsoever - it's +/- 0.5%, well within random noise.
And with max_locks_per_transaction=1024 the story is exactly the same:

  machine  mode  clients   master   built-in   with-guc
  -
   i5  prepared115000  14928  14948
   451498  51351  51504
 simple114124  14092  14065
   448531  48517  48351
 xeon  prepared113384  13325  13290
   449257  49309  49345
  16   151668 151940 152201
 simple112357  12351  12363
   446039  46126  46201
  16   141851 142402 142427


  machine  mode  clients   built-inwith-guc
  -
   i5  prepared1 99.52%  99.65%
   4 99.71% 100.01%
 simple1 99.77%  99.58%
   4 99.97%  99.63%
 xeon  prepared1 99.56%  99.30%
   4100.11% 100.18%
  16100.18% 100.35%
 simple1 99.96% 100.05%
   4100.19% 100.35%
  16100.39% 100.41%

with max_locks_per_transaction=1024, it's fair to expect the fast-path
locking to be quite beneficial. Of course, it's possible the GUC is set
this high because of some rare issue (say, to run pg_dump, which needs
to lock everything).

I did look at docs if anything needs updating, but I don't think so. The
SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc. Same for src/backend/storage/lmgr/README,
which is focusing on the correctness of fast-path locking, and that's
not changed by this patch.

I also cleaned up (removed) some of the Asserts checking that we got a
valid group / slot index. I don't think this really helped in practice,
once I added asserts to the macr

Re: AIO v2.0

2024-09-12 Thread Robert Pang

Hi Andres

Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.

2024-09-12 01:38:00.851 PDT [2780939] PANIC:  no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG:  background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG:  terminating any other
active server processes

I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.

Best regards
Robert




On Fri, Sep 6, 2024 at 12:47 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:
> > I hope there can be a high-level design document that includes a
> > description, high-level architecture, and low-level design.
> > This way, others can also participate in reviewing the code.
>
> Yep, that was already on my todo list. The version I just posted includes
> that.
>
>
> > For example, which paths were modified in the AIO module?
> > Is it the path for writing WAL logs, or the path for flushing pages, etc.?
>
> I don't think it's good to document this in a design document - that's just
> bound to get out of date.
>
> For now the patchset causes AIO to be used for
>
> 1) all users of read_stream.h, e.g. sequential scans
>
> 2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
>mentioned in my email upthread, the code for that is in a somewhat rough
>shape as Thomas Munro is working on a more general abstraction for some of
>this.
>
> The earlier patchset added a lot more AIO uses because I needed to know all
> the design constraints. It e.g. added AIO use in WAL. While that allowed me to
> learn a lot, it's not something that makes sense to continue working on for
> now, as it requires a lot of work that's independent of AIO.  Thus I am
> focusing on the above users for now.
>
>
> > Also, I recommend keeping this patch as small as possible.
>
> Yep. That's my goal (as mentioned upthread).
>
>
> > For example, the first step could be to introduce libaio only, without
> > considering io_uring, as that would make it too complex.
>
> Currently the patchset doesn't contain libaio support and I am not planning to
> work on using libaio. Nor do I think it makes sense for anybody else to do so
> - libaio doesn't work for buffered IO, making it imo not particularly useful
> for us.
>
> The io_uring specific code isn't particularly complex / large compared to the
> main AIO infrastructure.
>
> Greetings,
>
> Andres Freund
>
>
From bd04bd18ce62cf3f88568d3578503d4efeeb6603 Mon Sep 17 00:00:00 2001
From: Robert Pang 
Date: Thu, 12 Sep 2024 14:36:16 -0700
Subject: [PATCH] Fix BgBufferSync to limit batch size under io_bounce_buffers
 for bgwriter.

---
 src/backend/storage/buffer/bufmgr.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44b1b6fb31..4cd959b295 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3396,6 +3396,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 	uint32		new_recent_alloc;
 
 	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
@@ -3417,6 +3418,8 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3604,7 +3607,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 		{
 			Assert(sync_state & BUF_REUSABLE);
 
-			if (to_write.nbuffers == io_combine_limit)
+			if (to_write.nbuffers == max_combine)
 			{
 WriteBuffers(&to_write, ioq, wb_context);
 			}
-- 
2.46.0.662.g92d0881bb0-goog

Re: Mutable foreign key constraints

2024-09-12 Thread David G. Johnston

On Thursday, September 12, 2024, Tom Lane  wrote:

>
> A possible objection is that if anybody has such a setup and
> hasn't noticed a problem because they never change their
> timezone setting, they might not appreciate us breaking it.
> So I certainly wouldn't propose back-patching this.  But
> maybe we should add it as a foot-gun defense going forward.
>

I’m disinclined to begin enforcing this.  If they got a volatile data type
in a key column and don’t attempt to index the key, which would fail on the
volatile side, I’d be mighty surprised.  I don’t really have much sympathy
for anyone who got themselves into the described position but I don’t see
this unsafe enough to force a potentially large table rewrite on those that
managed to build a fragile but functioning model.

I suggest adding the commentary and queries used to check for just such a
situation to the “don’t do this page” of the wiki and there just explain
while allowed for backward compatibility it is definitely not a recommended
setup.

David J.

Re: Mutable foreign key constraints

"David G. Johnston"  writes:
> On Thursday, September 12, 2024, Tom Lane  wrote:
>> A possible objection is that if anybody has such a setup and
>> hasn't noticed a problem because they never change their
>> timezone setting, they might not appreciate us breaking it.
>> So I certainly wouldn't propose back-patching this.  But
>> maybe we should add it as a foot-gun defense going forward.

> I’m disinclined to begin enforcing this.  If they got a volatile data type
> in a key column and don’t attempt to index the key, which would fail on the
> volatile side, I’d be mighty surprised.

Um, neither type is "volatile" and each can be indexed just fine.
It's the cross-type comparison required by the FK that brings the
hazard.

> I suggest adding the commentary and queries used to check for just such a
> situation to the “don’t do this page” of the wiki and there just explain
> while allowed for backward compatibility it is definitely not a recommended
> setup.

Yeah, that's a possible approach.

regards, tom lane

Re: Switch PgStat_HashKey.objoid from Oid to uint64

On Thu, Sep 12, 2024 at 01:37:52PM +, Bertrand Drouvot wrote:
> There is also some manipulation around the 2 new uint32 fields (objid_hi and
> objid_lo) in the xactdesc.c and pgstat_xact.c files that look good to me.

Thanks for the reviews.  The high and low manipulations are still kind
of OK to me as a solution for the record constructions.

> But now we end up having functions that accept Oid as parameters to call
> functions that accept uint64 as parameter (for the exact same parameter), for
> example:
> 
> "
> void
> pgstat_create_function(Oid proid)
> {
> pgstat_create_transactional(PGSTAT_KIND_FUNCTION,
> MyDatabaseId,
> proid);
> }
> "
> 
> as pgstat_create_transactional is now:
> 
> -pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
> +pgstat_create_transactional(PgStat_Kind kind, Oid dboid, uint64 objid)
> 
> That's not an issue as both are unsigned and as we do those calls in that
> order (Oid -> uint64).

Yes, that's intentional.  All the pgstats routines associated to a
particular object that depends on an OID should still use an OID, and
anything that's generic enough to be used for all stats kinds had
better use a uint64.  I was wondering if it would be better hiding
that behind a dedicated type, but decided to stick with uint64.
--
Michael


signature.asc
Description: PGP signature

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-12 Thread Masahiko Sawada

On Tue, Sep 10, 2024 at 3:05 PM Noah Misch  wrote:
>
> On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > Masahiko Sawada  writes:
> > > On Tue, Sep 10, 2024 at 11:57 AM Tom Lane  wrote:
> > >> Yeah, that seems like it could work.  But are we sure that replicas
> > >> get a copy of the primary's control file rather than creating their
> > >> own?
> >
> > > Yes, I think so. Since at least the system identifiers of primary and
> > > replicas must be identical for physical replication, if replicas use
> > > their own control files then they cannot start the replication.
> >
> > Got it.  So now I'm wondering if we need all the complexity of storing
> > stuff in the GIN metapages.  Could we simply read the (primary's)
> > signedness out of pg_control and use that?
>
> Yes.

Indeed.

I've attached a PoC patch for this idea. We write  the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


poc_char_signedness.patch
Description: Binary data

Re: [PATCH] audo-detect and use -moutline-atomics compilation flag for aarch64

2024-09-12 Thread Thomas Munro

Hi,

FYI, people interested in this thread might be interested in
pgsql-bugs #18610.  There are two related issues here:

1.  Some people want to use LSE on modern ARM servers so they want to
use -moutline-atomics, which IIUC adds auto-detection logic with
fallback code so it can still run on the first generation of aarch64
ARMv8 (without .1) hardware.  That was being discussed as a feature
proposal for master.  (People could already access that if they
compile from source by using -march=something_modern, but the big
distributions are in charge of what they target and AFAIK mostly still
choose ARMv8, so this outline atomics idea is a nice workaround to
make everyone happy, I haven't studied exactly how it works.)

2.  Clang has started assuming -moutline-atomics in some version, so
it's already compiling .bc files that way, so it breaks if our JIT
system decides to inline SQL-callable functions, so we'll need to
decide what to do about that and back-patch something.  Conservative
choice would be to stop it from doing that with -mno-outline-atomics,
until this thread makes progress, but perhaps people closer to the
subject have another idea...

[1] 
https://www.postgresql.org/message-id/flat/18610-37bf303f904fede3%40postgresql.org

Re: type cache cleanup improvements

2024-09-12 Thread Alexander Korotkov

On Sat, Aug 31, 2024 at 10:33 PM Andrei Lepikhov  wrote:
> On 29/8/2024 11:01, Alexander Korotkov wrote:
> > On Mon, Aug 26, 2024 at 11:26 AM Alexander Korotkov
> >> Secondly, I'm not terribly happy with current state of type cache.
> >> The caller of lookup_type_cache() might get already invalidated data.
> >> This probably OK, because caller probably hold locks on dependent
> >> objects to guarantee that relevant properties of type actually
> >> persists.  At very least this should be documented, but it doesn't
> >> seem so.  Setting of tupdesc is sensitive to its order of execution.
> >> That feels quite fragile to me, and not documented either.  I think
> >> this area needs improvements before we push additional functionality
> >> there.
> >
> > I see fdd965d074 added a proper handling for concurrent invalidation
> > for relation cache. If a concurrent invalidation occurs, we retry
> > building a relation descriptor.  Thus, we end up with returning of a
> > valid relation descriptor to caller.  I wonder if we can take the same
> > approach to type cache.  That would make the whole type cache more
> > consistent and less fragile.  Also, this patch will be simpler.
> I think I understand the solution from the commit fdd965d074.
> Just for the record, you mentioned invalidation inside the
> lookup_type_cache above. Passing through the code, I found the only
> place for such a case - the call of the GetDefaultOpClass, which
> triggers the opening of the relation pg_opclass, which can cause an
> AcceptInvalidationMessages call. Did you mean this case, or does a wider
> field of cases exist here?

I've tried to implement handling of concurrent invalidation similar to
commit fdd965d074.  However that appears to be more difficult that I
thought, because for some datatypes like arrays, ranges etc we might
need fill the element type and reference it.  So, I decided to
continue with the current approach but borrowing some ideas from
fdd965d074.  The revised patchset attached.

0001 - adds comment about concurrent invalidation handling
0002 - revised c14d4acb8.  Now we track type oids, whose
TypeCacheEntry's filing is in-progress.  Add entry to
RelIdToTypeIdCacheHash at the end of lookup_type_cache() or on the
transaction abort.  During invalidation don't assert
RelIdToTypeIdCacheHash to be here if TypeCacheEntry is in-progress.

--
Regards,
Alexander Korotkov
Supabase
From 0ed4be04b42acab74efc59ea35772fd51f1b954c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Fri, 13 Sep 2024 02:10:04 +0300
Subject: [PATCH v11 1/2] Update header comment for lookup_type_cache()

Describe the way we handle concurrent invalidation messages.
---
 src/backend/utils/cache/typcache.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 2ec136b7d30..11382547ec0 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -351,6 +351,15 @@ type_cache_syshash(const void *key, Size keysize)
  * invalid.  Note however that we may fail to find one or more of the
  * values requested by 'flags'; the caller needs to check whether the fields
  * are InvalidOid or not.
+ *
+ * Note that while filling TypeCacheEntry we might process concurrent
+ * invalidation messages, causing our not-yet-filled TypeCacheEntry to be
+ * invalidated.  In this case, we typically only clear flags while values are
+ * still available for the caller.  It's expected that the caller holds
+ * enough locks on type-depending objects that the values are still relevant.
+ * It's also important that the tupdesc is filled in the last for
+ * TYPTYPE_COMPOSITE. So, it can't get invalidated during the
+ * lookup_type_cache() call.
  */
 TypeCacheEntry *
 lookup_type_cache(Oid type_id, int flags)
-- 
2.39.3 (Apple Git-146)

From 9735f1266f3dad2955758a28814c4b8a593d834b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Tue, 10 Sep 2024 23:25:04 +0300
Subject: [PATCH v11 2/2] Avoid looping over all type cache entries in
 TypeCacheRelCallback()

Currently when a single relcache entry gets invalidated,
TypeCacheRelCallback() has to loop over all type cache entries to find
appropriate typentry to invalidate.  Unfortunately, using the syscache here
is impossible, because this callback could be called outside a transaction
and this makes impossible catalog lookups.  This is why present commit
introduces RelIdToTypeIdCacheHash to map relation OID to its composite type
OID.

We are keeping RelIdToTypeIdCacheHash entry while corresponding type cache
entry have something to clean.  Therefore, RelIdToTypeIdCacheHash shouldn't
get bloat in the case of temporary tables flood.

Discussion: https://postgr.es/m/5812a6e5-68ae-4d84-9d85-b443176966a1%40sigaev.ru
Author: Teodor Sigaev
Reviewed-by: Aleksander Alekseev, Tom Lane, Michael Paquier, Roman Zharkov
Reviewed-by: Andrei Lepikhov, Pavel Borisov
---
 src/backend/access/transam/xact.c  |  10 +
 src/backend/ut

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-12 Thread Tomas Vondra

Turns out there was a bug in EXEC_BACKEND mode, causing failures on the
Windows machine in CI. AFAIK the reason is pretty simple - the backends
don't see the number of fast-path groups postmaster calculated from
max_locks_per_transaction.

Fixed that by calculating it again in AttachSharedMemoryStructs, which
seems to have done the trick. With this the CI builds pass just fine,
but I'm not sure if EXEC_BACKENDS may have some other issues with the
PGPROC changes. Could it happen that the shared memory gets mapped
differently, in which case the pointers might need to be adjusted?


regards

-- 
Tomas VondraFrom 45a2111c51016386a31d766700b23d9d88ff6c0b Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 12 Sep 2024 23:09:41 +0200
Subject: [PATCH v20240913 1/2] Increase the number of fast-path lock slots

The fast-path locking introduced in 9.2 allowed each backend to acquire
up to 16 relation locks cheaply, provided the lock level allows that.
If a backend needs to hold more locks, it has to insert them into the
regular lock table in shared memory. This is considerably more
expensive, and on many-core systems may be subject to contention.

The limit of 16 entries was always rather low, even with simple queries
and schemas with only a few tables. We have to lock all relations - not
just tables, but also indexes, views, etc. Moreover, for planning we
need to lock all relations that might be used in the plan, not just
those that actually get used in the final plan. It only takes a couple
tables with multiple indexes to need more than 16 locks. It was quite
common to fill all fast-path slots.

As partitioning gets used more widely, with more and more partitions,
this limit is trivial to hit, with complex queries easily using hundreds
or even thousands of locks. For workloads doing a lot of I/O this is not
noticeable, but on large machines with enough RAM to keep the data in
memory, the access to the shared lock table may be a serious issue.

This patch improves this by increasing the number of fast-path slots
from 16 to 1024. The slots remain in PGPROC, and are organized as an
array of 16-slot groups (each group being effectively a clone of the
original fast-path approach). Instead of accessing this as a big hash
table with open addressing, we treat this as a 16-way set associative
cache. Each relation (identified by a "relid" OID) is mapped to a
particular 16-slot group by calculating a hash

h(relid) = ((relid * P) mod N)

where P is a hard-coded prime, and N is the number of groups. This is
not a great hash function, but it works well enough - the main purpose
is to prevent "hot groups" with runs of consecutive OIDs, which might
fill some of the fast-path groups. The multiplication by P ensures that.
If the OIDs are already spread out, the hash should not group them.

The groups are processed by linear search. With only 16 entries this is
cheap, and the groups have very good locality.

Treating this as a simple hash table with open addressing would not be
efficient, especially once the hash table is getting almost full. The
usual solution is to grow the table, but for hash tables in shared
memory that's not trivial. It would also have worse locality, due to
more random access.

Luckily, fast-path locking already has a simple solution to deal with a
full hash table. The lock can be simply inserted into the shared lock
table, just like before. Of course, if this happens too often, that
reduces the benefit of fast-path locking.

This patch hard-codes the number of groups to 64, which means 1024
fast-path locks. As all the information is still stored in PGPROC, this
grows PGPROC by about 4.5kB (from ~840B to ~5kB). This is a trade off
exchanging memory for cheaper locking.

Ultimately, the number of fast-path slots should not be hard coded, but
adjustable based on what the workload does, perhaps using a GUC. That
however means it can't be stored in PGPROC directly.
---
 src/backend/storage/lmgr/lock.c | 118 ++--
 src/include/storage/proc.h  |   8 ++-
 2 files changed, 102 insertions(+), 24 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 83b99a98f08..d053ae0c409 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -167,7 +167,7 @@ typedef struct TwoPhaseLockRecord
  * our locks to the primary lock table, but it can never be lower than the
  * real value, since only we can acquire locks on our own behalf.
  */
-static int	FastPathLocalUseCount = 0;
+static int	FastPathLocalUseCounts[FP_LOCK_GROUPS_PER_BACKEND];
 
 /*
  * Flag to indicate if the relation extension lock is held by this backend.
@@ -184,23 +184,53 @@ static int	FastPathLocalUseCount = 0;
  */
 static bool IsRelationExtensionLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
 
+/*
+ * Macros to calculate the group and index for a relation.
+ *
+ * The formula is a simple hash function, designed to spread the OIDs a bit,
+ * so that even co

Re: Avoid dead code (contrib/pg_visibility/pg_visibility.c)

On Thu, Sep 12, 2024 at 09:10:44AM +0300, Nazir Bilal Yavuz wrote:
> This should be fixed in 65c310b310a6.

Thanks for the confirmation.
--
Michael


signature.asc
Description: PGP signature

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 04:39:14PM -0500, Nathan Bossart wrote:
> Committed.

Ugh, the buildfarm is unhappy with the new tests [0] [1].  Will fix.

[0] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2024-09-12%2022%3A54%3A45
[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skimmer&dt=2024-09-12%2022%3A38%3A13

-- 
nathan

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 07:41:30PM -0500, Nathan Bossart wrote:
> On Thu, Sep 12, 2024 at 04:39:14PM -0500, Nathan Bossart wrote:
> > Committed.
> 
> Ugh, the buildfarm is unhappy with the new tests [0] [1].  Will fix.

I'd suggest to switch the test to return a count() and make sure that
one record exists.  The data in the page does not really matter.
--
Michael


signature.asc
Description: PGP signature

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Fri, Sep 13, 2024 at 09:47:36AM +0900, Michael Paquier wrote:
> On Thu, Sep 12, 2024 at 07:41:30PM -0500, Nathan Bossart wrote:
>> Ugh, the buildfarm is unhappy with the new tests [0] [1].  Will fix.
> 
> I'd suggest to switch the test to return a count() and make sure that
> one record exists.  The data in the page does not really matter.

That's what I had in mind.  I see that skimmer is failing with this error:

ERROR:  cannot access temporary tables during a parallel operation

This makes sense because that machine has
debug_parallel_query/force_parallel_mode set to "regress", but this test
file has used a temporary table for a couple of years without issue...

-- 
nathan

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 07:56:56PM -0500, Nathan Bossart wrote:
> I see that skimmer is failing with this error:
> 
>   ERROR:  cannot access temporary tables during a parallel operation
> 
> This makes sense because that machine has
> debug_parallel_query/force_parallel_mode set to "regress", but this test
> file has used a temporary table for a couple of years without issue...

Oh, the answer seems to be commits aeaaf52 and 47a22dc.  In short, we can't
use a temporary sequence in this test for versions older than v15.

-- 
nathan

Re: ANALYZE ONLY

2024-09-12 Thread torikoshia


On 2024-09-11 16:47, Michael Harris wrote:

Thanks for the feedback.

On Tue, 10 Sept 2024 at 22:03, torikoshia  
wrote:

Regarding the addition of partition descendant tables, should we also
update the below comment on expand_vacuum_rel? Currently it refers 
only

partitions:

|  * Given a VacuumRelation, fill in the table OID if it wasn't
specified,
|  * and optionally add VacuumRelations for partitions of the table.



Well spotted! I have updated the comment to read:

 * Given a VacuumRelation, fill in the table OID if it wasn't 
specified,
 * and optionally add VacuumRelations for partitions or descendant 
tables

 * of the table.


Thanks for the update!

I've switched the status on the commitfest to 'ready for committer'.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 08:42:09PM -0500, Nathan Bossart wrote:
> Oh, the answer seems to be commits aeaaf52 and 47a22dc.  In short, we can't
> use a temporary sequence in this test for versions older than v15.

Here's a patch to make the sequence permanent and to make the output of
tuple_data_split() not depend on endianness.

-- 
nathan
diff --git a/contrib/pageinspect/expected/page.out 
b/contrib/pageinspect/expected/page.out
index 04fd9dee4b..3fd3869c82 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -240,11 +240,12 @@ SELECT page_checksum(decode(repeat('00', :block_size), 
'hex'), 1);
 (1 row)
 
 -- tests for sequences
-create temporary sequence test_sequence;
+create sequence test_sequence start 72057594037927937;
 select tuple_data_split('test_sequence'::regclass, t_data, t_infomask, 
t_infomask2, t_bits)
   from heap_page_items(get_raw_page('test_sequence', 0));
tuple_data_split
 ---
- {"\\x0100","\\x","\\x00"}
+ {"\\x0101","\\x","\\x00"}
 (1 row)
 
+drop sequence test_sequence;
diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql
index 59784fc7cc..346e4ee142 100644
--- a/contrib/pageinspect/sql/page.sql
+++ b/contrib/pageinspect/sql/page.sql
@@ -100,6 +100,7 @@ SELECT page_header(decode(repeat('00', :block_size), 
'hex'));
 SELECT page_checksum(decode(repeat('00', :block_size), 'hex'), 1);
 
 -- tests for sequences
-create temporary sequence test_sequence;
+create sequence test_sequence start 72057594037927937;
 select tuple_data_split('test_sequence'::regclass, t_data, t_infomask, 
t_infomask2, t_bits)
   from heap_page_items(get_raw_page('test_sequence', 0));
+drop sequence test_sequence;

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

Nathan Bossart  writes:
> Here's a patch to make the sequence permanent and to make the output of
> tuple_data_split() not depend on endianness.

+1 --- I checked this on mamba's host and it does produce
"\\x0101" regardless of endianness.

regards, tom lane

Re: Mutable foreign key constraints

2024-09-12 Thread Laurenz Albe

On Thu, 2024-09-12 at 17:33 -0400, Tom Lane wrote:
> I happened to notice that Postgres will let you do
> 
> regression=# create table foo (id timestamp primary key);
> CREATE TABLE
> regression=# create table bar (ts timestamptz references foo);
> CREATE TABLE
> 
> This strikes me as a pretty bad idea, because whether a particular
> timestamp is equal to a particular timestamptz depends on your
> timezone setting.  Thus the constraint could appear to be violated
> after a timezone change.
> 
> I'm inclined to propose rejecting FK constraints if the comparison
> operator is not immutable.

I think that is the only sane thing to do.  Consider

  test=> SHOW timezone;
 TimeZone
  ═══
   Europe/Vienna
  (1 row)

  test=> INSERT INTO foo VALUES ('2024-09-13 12:00:00');
  INSERT 0 1
  test=> INSERT INTO bar VALUES ('2024-09-13 12:00:00+02');
  INSERT 0 1
  test=> SELECT * FROM foo JOIN bar ON foo.id = bar.ts;
   id  │   ts   
  ═╪
   2024-09-13 12:00:00 │ 2024-09-13 12:00:00+02
  (1 row)

  test=> SET timezone = 'Asia/Kolkata';
  SET
  test=> SELECT * FROM foo JOIN bar ON foo.id = bar.ts;
   id │ ts 
  ╪
  (0 rows)

  test=> INSERT INTO foo VALUES ('2024-09-14 12:00:00');
  INSERT 0 1
  test=> INSERT INTO bar VALUES ('2024-09-14 12:00:00+02');
  ERROR:  insert or update on table "bar" violates foreign key constraint 
"bar_ts_fkey"
  DETAIL:  Key (ts)=(2024-09-14 15:30:00+05:30) is not present in table "foo".

That's very broken and should not be allowed.

> A possible objection is that if anybody has such a setup and
> hasn't noticed a problem because they never change their
> timezone setting, they might not appreciate us breaking it.

I hope that there are few cases of that in the field, and I think it
is OK to break them.  After all, it can be fixed with a simple

  ALTER TABLE foo ALTER id TYPE timestamptz;

If the session time zone is UTC, that wouldn't even require a rewrite.

I agree that it cannot be backpatched.

Yours,
Laurenz Albe

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

On Thu, Sep 12, 2024 at 10:40:11PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> Here's a patch to make the sequence permanent and to make the output of
>> tuple_data_split() not depend on endianness.
> 
> +1 --- I checked this on mamba's host and it does produce
> "\\x0101" regardless of endianness.

Thanks for checking.  I'll commit this fix in the morning.

-- 
nathan

Re: Switch PgStat_HashKey.objoid from Oid to uint64

2024-09-12 Thread Bertrand Drouvot

Hi,

On Fri, Sep 13, 2024 at 07:34:21AM +0900, Michael Paquier wrote:
> On Thu, Sep 12, 2024 at 01:37:52PM +, Bertrand Drouvot wrote:
> > There is also some manipulation around the 2 new uint32 fields (objid_hi and
> > objid_lo) in the xactdesc.c and pgstat_xact.c files that look good to me.
> 
> Thanks for the reviews.  The high and low manipulations are still kind
> of OK to me as a solution for the record constructions.

Agree.

> > But now we end up having functions that accept Oid as parameters to call
> > functions that accept uint64 as parameter (for the exact same parameter), 
> > for
> > example:
> > 
> > "
> > void
> > pgstat_create_function(Oid proid)
> > {
> > pgstat_create_transactional(PGSTAT_KIND_FUNCTION,
> > MyDatabaseId,
> > proid);
> > }
> > "
> > 
> > as pgstat_create_transactional is now:
> > 
> > -pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
> > +pgstat_create_transactional(PgStat_Kind kind, Oid dboid, uint64 objid)
> > 
> > That's not an issue as both are unsigned and as we do those calls in that
> > order (Oid -> uint64).
> 
> Yes, that's intentional.  All the pgstats routines associated to a
> particular object that depends on an OID should still use an OID, and
> anything that's generic enough to be used for all stats kinds had
> better use a uint64.

Yeah, that sounds good to me.

> I was wondering if it would be better hiding
> that behind a dedicated type, but decided to stick with uint64.

That makes sense to me.

Overall, the patch LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Why don't we consider explicit Incremental Sort?

2024-09-12 Thread Richard Guo

On Mon, Sep 9, 2024 at 6:22 PM Tomas Vondra  wrote:
> I think we intentionally added incremental sort ... incrementally ;-)

Haha, right.

> I think one challenge with this case is that create_mergejoin_plan
> creates these Sort plans explicitly - it's not "pathified" so it doesn't
> go through the usual cost comparison etc. And it certainly is not the
> case that incremental sort would win always, so we'd need to replicate
> the cost comparison logic too.
>
> I have not thought about this particular case too much, but how likely
> is it that mergejoin will win for this plan in practice? If we consider
> a query that only needs a fraction of rows (say, thanks to LIMIT),
> aren't we likely to pick a nested loop (which can do incremental sort
> for the outer plan)? For joins that need to run to completion it might
> be a win, but there's also the higher risk of a poor costing.

Yeah, currently mergejoin path always assumes that full sort would be
used on top of the outer path and inner path if they are not already
ordered appropriately.  So initial_cost_mergejoin directly charges the
cost of full sort into outer/inner path's cost, without going through
the usual cost comparison with incremental sort.

It seems to me that some parts of our code assume that, for a given
input path that is partially ordered, an incremental sort is always
preferable to a full sort (see commit 4a29eabd1).  Am I getting this
correctly?  If this is the case, then I think using the following
outer path for the merge join:

   ->  Incremental Sort
 Sort Key: t1.a, t1.b
 Presorted Key: t1.a
 ->  Index Scan using t1_a_idx on t1

... would be an immediate improvement over the current path, which is:

   ->  Sort
 Sort Key: t1.a, t1.b
 ->  Index Scan using t1_a_idx on t1

The shaky cost estimates for incremental sort that you mentioned are
indeed a concern.  Fortunately we have the enable_incremental_sort GUC
already.  As in may other parts of the code (such as in the function
add_paths_to_grouping_rel), we can always disable incremental sort to
fall back to a full sort if needed.

> I'm not saying it's not worth exploring, I'm just trying recall reasons
> why it might not work. Also I don't think it's fundamentally impossible
> to do mark/restore for incremental sort - I just haven't tried doing it
> because it wasn't necessary. In the worst case we could simply add a
> Materialize node on top, no?

Yeah, perhaps we could support mark/restore for incremental sort
someday.  This would allow us to apply incremental sort to the inner
path of a merge join too.  But if we apply a Material node on top of
the inner due to the lack of mark/restore support for incremental
sort, then we will need to compare two paths:

partially sorted path -> incremental sort -> material

VS.

partially sorted path -> full sort

I think it's hard to tell which is cheaper without a cost comparison,
which we do not have for now.

To help with the discussion, I drafted a WIP patch as attached, which
chooses an incremental sort over a full sort on the given outer path
of a mergejoin whenever possible.  There is one ensuing plan diff in
the regression tests (not included in the patch).  It seems that some
query in the tests begins to use incremental sort for mergejoin.

Thanks
Richard

v1-0001-Consider-explicit-incremental-sort-for-mergejoins.patch
Description: Binary data

Re: per backend I/O statistics

2024-09-12 Thread Bertrand Drouvot

Hi,

On Fri, Sep 06, 2024 at 03:03:17PM +, Bertrand Drouvot wrote:
> The struct size is 1040 Bytes and that's much more reasonable than the size
> needed for per backend I/O stats in v1 (about 16KB).

One could think that's a high increase of shared memory usage with a high
number of connections (due to the fact that it's implemented as "fixed amount"
stats based on the max number of backends).

So out of curiosity, I did some measurements with:

dynamic_shared_memory_type=sysv
shared_memory_type=sysv
max_connections=2

On my lab, ipcs -m gives me:

- on master

keyshmid  owner  perms  bytes  nattch status
0x00113a04 51347487   postgres   6001149394944 6
0x4bc9f2fa 51347488   postgres   60040069766
0x46790800 51347489   postgres   60010485762

- with v3 applied

keyshmid  owner  perms  bytes  nattch status
0x00113a04 51347477   postgres   6001170227200 6
0x08e04b0a 51347478   postgres   60040069766
0x74688c9c 51347479   postgres   60010485762

So, with 2 max_connections (not advocating that's a reasonable number in 
all the cases), that's a 1170227200 - 1149394944 = 20 MB increase of shared
memory (expected with 20K max_connections and the new struct size of 1040 Bytes)
as compare to master which is 1096 MB.  It means that v3 produces about 2% 
shared
memory increase with 2 max_connections.

Out of curiosity with max_connections=10, then:

- on master:

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status
0x00113a04 52953134   postgres   6005053915136 6
0x37abf5ce 52953135   postgres   60020006976   6
0x71c07d5c 52953136   postgres   60010485762

- with v3:

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status
0x00113a04 52953144   postgres   6005157945344 6
0x7afb24de 52953145   postgres   60020006976   6
0x2695af58 52953146   postgres   60010485762

So, with 10 max_connections (not advocating that's a reasonable number in 
all the cases), that's a 5157945344 - 5053915136 = 100 MB increase of shared
memory (expected with 100K max_connections and the new struct size of 1040 
Bytes)
as compare to master which is about 4800 MB. It means that v3 produces about
2% shared memory increase with 10 max_connections.

Then, based on those numbers, I think that the "fixed amount" approach is a good
one as 1/ the amount of shared memory increase is "relatively" small and 2/ most
probably provides performance benefits as compare to the "non fixed" approach.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Psql meta-command conninfo+

2024-09-12 Thread Hunaid Sohail

Hi,

On Thu, Sep 12, 2024 at 4:08 PM Jim Jones  wrote:

> It may look like this, but it is a single record --- mind the header "-[
> RECORD 1 ]+-".
> psql was called in expanded mode:
>
> > $ /home/pgsql-17devel/bin/psql -x -p 5432
>
> "-x" or "--expanded"
>
> Example:
>
> $ psql postgres -xc "SELECT 'foo' col1, 'bar' col2"
> -[ RECORD 1 ]
> col1 | foo
> col2 | bar
>

I guess I missed the expanded mode. I have attached a new patch. Please
check the output now.

```
$ bin/psql --port=5430 postgres
psql (18devel)
Type "help" for help.

postgres=# \conninfo+
You are connected to database "postgres" as user "hunaid" via socket in
"/tmp" at port "5430".
  Connection Information
 Protocol Version | SSL Connection | GSSAPI Authenticated | Client Encoding
| Server Encoding | Session User | Backend P
ID
--++--+-+-+--+--
---
 3| no | no   | UTF8
 | UTF8| hunaid   | 55598
(1 row)

postgres=# \x
Expanded display is on.
postgres=# \conninfo+
You are connected to database "postgres" as user "hunaid" via socket in
"/tmp" at port "5430".
Connection Information
-[ RECORD 1 ]+---
Protocol Version | 3
SSL Connection   | no
GSSAPI Authenticated | no
Client Encoding  | UTF8
Server Encoding  | UTF8
Session User | hunaid
Backend PID  | 55598
```

Regards,
Hunaid Sohail
From b011b1cc780fee4030147070db84dcc62edd10a9 Mon Sep 17 00:00:00 2001
From: Hunaid Sohail 
Date: Fri, 13 Sep 2024 09:37:10 +0500
Subject: [PATCH v33] Add psql meta command conninfo+

---
 doc/src/sgml/ref/psql-ref.sgml | 26 +++--
 src/bin/psql/command.c | 67 +++---
 src/bin/psql/help.c|  2 +-
 3 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 3fd9959ed1..3f8d72b42e 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1060,11 +1060,29 @@ INSERT INTO tbls1 VALUES ($1, $2) \parse stmt1
   
 
   
-\conninfo
+\conninfo[+]
 
-
-Outputs information about the current database connection.
-
+  
+Outputs a string displaying information about the current
+database connection. When + is appended,
+more details about the connection are displayed in table
+format:
+
+  Protocol Version: The version of the PostgreSQL protocol used for
+this connection.
+  SSL Connection: True if the current connection to the server
+uses SSL, and false otherwise.
+  GSSAPI Authenticated: True if GSSAPI is in use, or false if
+GSSAPI is not in use on this connection.
+  Client Encoding: The encoding used by the client for this connection.
+  Server Encoding: The encoding used by the server for this connection.
+  Session User: The session user's name;
+see the session_user() function in
+ for more details.
+  Backend PID: The process ID of the backend for the
+connection.
+
+  
 
   
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 4dfc7b2d85..7ad28287c1 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -72,7 +72,8 @@ static backslashResult exec_command_cd(PsqlScanState scan_state, bool active_bra
 	   const char *cmd);
 static backslashResult exec_command_close(PsqlScanState scan_state, bool active_branch,
 		  const char *cmd);
-static backslashResult exec_command_conninfo(PsqlScanState scan_state, bool active_branch);
+static backslashResult exec_command_conninfo(PsqlScanState scan_state, bool active_branch,
+			 const char *cmd);
 static backslashResult exec_command_copy(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_copyright(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_crosstabview(PsqlScanState scan_state, bool active_branch);
@@ -328,8 +329,8 @@ exec_command(const char *cmd,
 		status = exec_command_cd(scan_state, active_branch, cmd);
 	else if (strcmp(cmd, "close") == 0)
 		status = exec_command_close(scan_state, active_branch, cmd);
-	else if (strcmp(cmd, "conninfo") == 0)
-		status = exec_command_conninfo(scan_state, active_branch);
+	else if (strcmp(cmd, "conninfo") == 0 || strcmp(cmd, "conninfo+") == 0)
+		status = exec_command_conninfo(scan_state, active_branch, cmd);
 	else if (pg_strcasecmp(cmd, "copy") == 0)
 		status = exec_command_copy(scan_state, active_branch);
 	else if (strcmp(cmd, "copyright") == 0)
@@ -739,11 +740,14 @@ exec_command_close(PsqlScanState scan_stat

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-12 Thread torikoshia

On 2024-09-12 07:32, Masahiko Sawada wrote:

Thanks a lot for working on this!

Hi all,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

I haven't read the patch yet, but it seems a reasonable approach.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

Yeah, and I imagine there would be cases where the current
implementation shows better performance, such as when there are no long
transactions, but compared to unexpected memory bloat and OOM kill, I
think it's far more better.

Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.

Regards,

[1]
https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%4

Re: long-standing data loss bug in initial sync of logical replication

2024-09-12 Thread Shlok Kyal

On Mon, 9 Sept 2024 at 10:41, Shlok Kyal  wrote:
>
> On Mon, 2 Sept 2024 at 10:12, Amit Kapila  wrote:
> >
> > On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal  wrote:
> > >
> > > Next I am planning to test solely on the logical decoding side and
> > > will share the results.
> > >
> >
> > Thanks, the next set of proposed tests makes sense to me. It will also
> > be useful to generate some worst-case scenarios where the number of
> > invalidations is more to see the distribution cost in such cases. For
> > example, Truncate/Drop a table with 100 or 1000 partitions.
> >
> > --
> > With Regards,
> > Amit Kapila.
>
> Hi,
>
> I did some performance testing solely on the logical decoding side and
> found some degradation in performance, for the following testcase:
> 1. Created a publisher on a single table, say 'tab_conc1';
> 2. Created a second publisher on a single table say 'tp';
> 4. two sessions are running in parallel, let's say S1 and S2.
> 5. Begin a transaction in S1.
> 6. Now in a loop (this loop runs 'count' times):
>  S1: Insert a row in table 'tab_conc1'
>  S2: BEGIN;  Alter publication DROP/ ADD tp; COMMIT
> 7. COMMIT the transaction in S1.
> 8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
>
> Observation:
> With fix a new entry is added in decoding. During debugging I found
> that this entry only comes when we do a 'INSERT' in Session 1 after we
> do 'ALTER PUBLICATION' in another session in parallel (or we can say
> due to invalidation). Also, I observed that this new entry is related
> to sending replica identity, attributes,etc as function
> 'logicalrep_write_rel' is called.
>
> Performance:
> We see a performance degradation as we are sending new entries during
> logical decoding. Results are an average of 5 runs.
>
> count|Head (sec)|Fix (sec)|Degradation (%)
> --
> 1   |1.298|1.574 |21.26348228
> 5   |22.892  |24.997   |9.195352088
> 10 |88.602  |93.759   |5.820410374
>
> I have also attached the test script here.
>

For the above case I tried to investigate the inconsistent degradation
and found out that Serialization was happening for a large number of
'count'. So, I tried adjusting 'logical_decoding_work_mem' to a large
value, so that we can avoid serialization here. I ran the above
performance test again and got the following results:

count|Head (sec)|Fix (sec) |Degradation (%)
---
1   |0.415446  |0.53596167|29.00874482
5   |7.950266  |10.37375567  |30.48312685
75000   |17.192372|22.246715  |29.39875312
10 |30.555903|39.431542  |29.04721552

 These results are an average of 3 runs. Here the degradation is
consistent around ~30%.

Thanks and Regards,
Shlok Kyal

Re: query_id, pg_stat_activity, extended query protocol