date:20250203

Re: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.

2025-02-03 Thread Masahiko Sawada

On Sun, Feb 2, 2025 at 9:50 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > I think it's a good idea to support it at least on HEAD. I've attached
> > a patch for that.
>
> +1. I've confirmed that pg_rewind and -R can't output dbname for now,
> and your patch allows to do it.
>
> Few comments for your patch.

Thank you for reviewing the patch!

>
> 1.
>
> pg_basebackup.sgml has below description. I feel this can be ported to
> pg_rewind.sgml as well.
>
> ```
> The dbname will be recorded only if the dbname was
> specified explicitly in the connection string or 
> environment variable.
> ```

Agreed, will fix.

>
> 2.
> I'm not sure whether recovery_gen.h/c is a correct place for the exported 
> function
> GetDbnameFromConnectionOptions(). The function itself does not handle
> postgresql.cuto.conf file.
> I seeked other header files and felt that connect_utils.h might be.
>
> ```
> /*-
>  *
>  * Facilities for frontend code to connect to and disconnect from databases.
> ```

But this function neither connects to nor disconnects from databases, either.

>
> Another idea is to change the third API to accept the whole connection 
> string, and
> it extracts dbname from it. In this approach we can make 
> GetDbnameFromConnectionOptions()
> to static function - which does not feel strange for me.

I'm concerned that it reduces the usability; users (or existing
extensions) would need to construct the whole connection string just
to pass the database name. If we want to avoid exposing
GetDbnameFromConnectionOptions(), I'd introduce another exposed
function for that, say GenerateRecoveryConfigWithConnStr().

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Prevent COPY FREEZE on Foreign tables

2025-02-03 Thread Nathan Bossart

On Mon, Feb 03, 2025 at 02:18:21PM -0600, Sami Imseih wrote:
> I think we should just block Foreign tables as we do with
> partition tables. Attached patch does that.

Unless there's some existing way to specify a FREEZE option in the FDW API
(I assume there isn't), I agree with this.

> I was also looking at why we block a parent from COPY FREEZE[1], but
> the comments do not convince me this is a good idea. I think there
> are good cases to allow this considering there is a common use case in
> which a single
> COPY command can load a large amount of data, making the overhead to check the
> partitions worth the value of the FREEZE optimization. I will probably
> start a separate thread for this.

Commit 5c9a551 and its thread [0] appear to have some additional details
about this.

[0] 
https://postgr.es/m/CAKJS1f9BbJ%2BFY3TbdCiap3qXHTFOiwtays9s36-oQkkM9_R5bg%40mail.gmail.com

-- 
nathan

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Peter Smith

Hi Nisha.

Some review comments for patch v67-0001.

==
src/backend/replication/slot.c

ReportSlotInvalidation:

Please see my previous post [1] where I gave some reasons why I think
the _() macro should be used only for the *common* part of the hint
messages. If you agree, then the following code should be changed.

1.
+ /* translator: %s is a GUC variable name */
+ appendStringInfo(&err_hint, "You might need to increase \"%s\".",
+ "max_slot_wal_keep_size");
  break;

Change to:
_("You might need to increase \"%s\".")

~

2.
+ /* translator: %s is a GUC variable name */
+ appendStringInfo(&err_hint, "You might need to increase \"%s\".",
+ "idle_replication_slot_timeout");
+ break;

Change to:
_("You might need to increase \"%s\".")

~

3.
- hint ? errhint("You might need to increase \"%s\".",
"max_slot_wal_keep_size") : 0);
+ err_hint.len ? errhint("%s", _(err_hint.data)) : 0);

Change to:
err_hint.len ? errhint("%s", err_hint.data) : 0);

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuKCv-S%2BPJ2iybZKiqu0GJ1fSuzy2CcvyRViLou98QpVA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Adding facility for injection points (or probe points?) for more advanced tests

2025-02-03 Thread Jeff Davis

On Tue, 2024-01-23 at 12:32 +0900, Michael Paquier wrote:
> Slightly off topic and while I don't forget about it..  Please find
> attached a copy of the patch posted around [1] to be able to define
> injection points with input arguments, so as it is possible to
> execute
> callbacks with values coming from the code path where the point is
> attached.
> 
> For example, a backend could use this kind of macro to have a
> callback
> attached to this point use some runtime value:
> INJECTION_POINT_1ARG("InjectionPointBoo", &some_value);

That sounds useful, but not necessarily required, for the HashAgg tests
I just posted[1].

One extra benefit of supporting arguments is that it would be a more
flexible way to change the local state around the injection point.
Right now the only way is by using IS_INJECTION_POINT_ATTACHED(), which
doesn't permit callback-defined conditions, etc.

If you do add suppport for arguments, would it make sense to just have
all callback functions take a single "void *" argument, rather than
adding branches for the zero-argument and the one-argument case?

+1 to the idea, but I'm fine waiting for additional use cases to get
the API right.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/ff4e59305e5d689e03cd256a736348d3e7958f8f.ca...@j-davis.com

Re: injection points for hash aggregation

2025-02-03 Thread Jeff Davis

On Tue, 2025-02-04 at 12:50 +0900, Michael Paquier wrote:
> On Tue, Feb 04, 2025 at 09:21:49AM +0900, Michael Paquier wrote:
> > Being able to pass down some runtime states to a callback is
> > something
> > I've drafted a patch for, as of:
> > https://www.postgresql.org/message-id/z23zce4w1djuk...@paquier.xyz
> 
> Oops.  I've posted an incorrect link here, I meant this one instead:
> https://www.postgresql.org/message-id/za8zwmptnjk_i...@paquier.xyz

I replied in that thread. Thank you.

The injection point arguments are required for $SUBJECT, or at least
not yet, so we can proceed with this thread independently.

Regards,
Jeff Davis

Re: new commitfest transition guidance

2025-02-03 Thread Tom Lane

Peter Eisentraut  writes:
> During the developer meeting at FOSDEM last Thursday,

BTW, are there minutes available from that meeting?  In past years
some notes have been posted on the wiki, but I'm failing to find
anything right now.

regards, tom lane

Re: SQLFunctionCache and generic plans

2025-02-03 Thread Tom Lane

Pavel Stehule  writes:
> Did you do some performance checks?

This is a good question to ask ...

> I tried some worst case

> CREATE OR REPLACE FUNCTION fx(int)
> RETURNS int AS $$
> SELECT $1 + $1
> $$ LANGUAGE SQL IMMUTABLE;

... but I don't think tests like this will give helpful answers.
That function is simple enough to be inlined:

regression=# explain verbose select fx(f1) from int4_tbl;
  QUERY PLAN   
---
 Seq Scan on public.int4_tbl  (cost=0.00..1.06 rows=5 width=4)
   Output: (f1 + f1)
(2 rows)

So functions.c shouldn't have any involvement at all in the
actually-executed PERFORM expression, and whatever difference
you measured must have been noise.  (If the effect *is* real,
we'd better find out why.)

You need to test with a non-inline-able function.  Looking
at the inlining conditions in inline_function(), one simple
hack is to make the function return SETOF.  That'll only
exercise the returns-set path in functions.c though, so it'd
be advisable to check other inline-blocking conditions too.

regards, tom lane

Re: new commitfest transition guidance

2025-02-03 Thread Jelte Fennema-Nio

On Mon, 3 Feb 2025 at 13:56, Aleksander Alekseev
 wrote:
> > """
> > The status of this patch cannot be changed in this commitfest. You
> > must modify it in the one where it's open!
> > """
>
> Ooops, never mind.

I also thought that error was pretty silly. So it will be changed in
the next release[1] of the commitfest app by this commit[2].

[1]: 
https://www.postgresql.org/message-id/flat/CAGECzQTbHhaZMY-%3DY%3DOGMU_jik4MqCfsc7rDf6AnT%3DAq-B6cZA%40mail.gmail.com
[2]: 
https://github.com/postgres/pgcommitfest/commit/013eeb1befa3cf3f38b7ccb7977881f2c7c0f38d

Re: SQLFunctionCache and generic plans

2025-02-03 Thread Pavel Stehule

po 3. 2. 2025 v 17:00 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > Did you do some performance checks?
>
> This is a good question to ask ...
>
> > I tried some worst case
>
> > CREATE OR REPLACE FUNCTION fx(int)
> > RETURNS int AS $$
> > SELECT $1 + $1
> > $$ LANGUAGE SQL IMMUTABLE;
>
> ... but I don't think tests like this will give helpful answers.
> That function is simple enough to be inlined:
>
> regression=# explain verbose select fx(f1) from int4_tbl;
>   QUERY PLAN
> ---
>  Seq Scan on public.int4_tbl  (cost=0.00..1.06 rows=5 width=4)
>Output: (f1 + f1)
> (2 rows)
>
> So functions.c shouldn't have any involvement at all in the
> actually-executed PERFORM expression, and whatever difference
> you measured must have been noise.  (If the effect *is* real,
> we'd better find out why.)
>
> You need to test with a non-inline-able function.  Looking
> at the inlining conditions in inline_function(), one simple
> hack is to make the function return SETOF.  That'll only
> exercise the returns-set path in functions.c though, so it'd
> be advisable to check other inline-blocking conditions too.
>

I am sorry. I was wrong - I tested inlining on different case

(2025-02-03 17:24:25) postgres=# explain analyze verbose select fx(i) from
generate_series(1,10) g(i);
┌─┐
│ QUERY PLAN
   │
╞═╡
│ Function Scan on pg_catalog.generate_series g  (cost=0.00..0.13 rows=10
width=4) (actual time=0.016..0.018 rows=10 loops=1) │
│   Output: (i + i)
  │
│   Function Call: generate_series(1, 10)
  │
│ Planning:
  │
│   Buffers: shared hit=11
   │
│ Planning Time: 0.190 ms
  │
│ Execution Time: 0.066 ms
   │
└─┘
(7 rows)

(2025-02-03 17:25:06) postgres=# explain analyze verbose select
fx((random()*100)::int) from generate_series(1,10) g(i);
┌─┐
│ QUERY PLAN
   │
╞═╡
│ Function Scan on pg_catalog.generate_series g  (cost=0.00..2.68 rows=10
width=4) (actual time=0.104..0.169 rows=10 loops=1) │
│   Output: fx(((random() * '100'::double precision))::integer)
  │
│   Function Call: generate_series(1, 10)
  │
│ Planning Time: 0.054 ms
  │
│ Execution Time: 0.182 ms
   │
└─┘
(5 rows)

I read https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions, and I
don't remember the rule `if an actual argument to the function call is a
volatile expression, then it must not be referenced in the body more than
once` well, so I didn't apply this rule correctly. I'll recheck this test.

Regards

Pavel



> regards, tom lane
>

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-03 Thread chiranmoy.bhattacha...@fujitsu.com

Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.

---
Chiranmoy


v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch

Re: RFC: Additional Directory for Extensions

2025-02-03 Thread David E. Wheeler

Hi Peter,

> prefix= should only be set when running the "install" target, not when 
> building (make all).

I see. I confirm that works. Still, don’t all the other parameters work when 
passed to any/all targets? Should this one have an extension-specific name?

>> So I suspect the issue is that, when looking for SQL files, the patch needs 
>> to use the directory parameter[4] when it’s set --- and it can be an 
>> absolute path! Honestly I think there’s a case to be made for eliminating 
>> that parameter.
> 
> Possibly.  I didn't know why extensions would use that parameter, before you 
> showed an example.

ISTM it does more harm than good. The location of extension files should be 
highly predictable. I think the search path functionality mitigates the need 
for this parameter, and it should be dropped.

>> I thought it would put the files into /Users/david/pgsql-test/extension, not 
>> /Users/david/pgsql-test/share/extension? I guess that makes sense; but then 
>> perhaps the search path should be for prefixes, in which case I’d use a 
>> config like:
>> ``` ini
>> extension_control_path = '/Users/david/pgsql-test:$system'
>> dynamic_library_path = '/Users/david/pgsql-test/lib:$libdir'
>> ```
>> And the search path would append `share/extension` to each path. But then it 
>> varies from the behavior of `dynamic_library_path`. :-(
> 
> Yes exactly.  This is meant to be symmetrical.

IOW, `extension_control_path`s should always end in `share/extension` and 
`dynamic_library_path`s should always end in `lib`, yes?

> We could have a setting that just sets the top-level prefixes to search and 
> would thus supersede both of these settings.  But then you introduce another 
> lay of complications, such as, the subdirectory structure under the prefix is 
> not necessarily fixed.  It could be lib, lib/postgresql, lib64, etc., similar 
> under share.  It's not even required that there is a common prefix.

I would say that, under those prefixes, the directory names have to be defined 
by PostgreSQL, not by compile-time options. It seems to me that by providing 
search paths there is less of a need to monkey with directory names.

>>> I have solved this by just stripping off "$libdir/" from the front of the 
>>> filename.  This works for now.  We can think about other ways to tweak 
>>> this, perhaps, but I don't see any drawback to this in practice.
>> 
>> I think this makes perfect sense. I presume it’s stripped out *before* 
>> replacing the MODULE_PATHNAME string in SQL files, yes?
> 
> No, actually it's done after.  Does it make a difference?

If you mean it strips it out at runtime, it just feels a little less clean to 
me than if it was formatted properly before `make install`ing.

>> Although the current path explicitly searches for control files, I’d like to 
>> suggest a more generic name, since the point is to find extensions; control 
>> files are just the (current) method for identifying them. I suggest 
>> `extension_path`. Although given the above, maybe it should be all about 
>> prefixes.
> 
> If we implemented a prefix approach, then 'extension_path' could be a good 
> name.  But right now we're not.

I’m not sure what difference it makes, especially since each directory in the 
path is expected to end in `share/extension`,  not `share/control`. Feels more 
symmetrical to me.

>> I like this idea, though I quibble with the term `$system`, which seems like 
>> it would identify SYSCONFDIR, not SHAREDIR/extension. How about `$extdir` 
>> or, if using prefixes, `$coreprefix` or something.
> 
> I'm not attached to '$system', but I don't see how you get from that to 
> SYSCONFDIR.

Because it’s the only pg_config item that includes the word “Sys” in it, 
meaning the operating system.

>  The other suggestions also have various ways of misinterpreting them.  We 
> can keep thinking about this one.

I like $extdir. Short, clear, and not conflicting with existing pg_config 
options. I guess it could be misinterpreted as “extra directory” or something, 
but since there is no such thing it seems like a minor risk. I am likely 
overlooking something though.

>>> Note that the path elements should typically end in
>>> extension if the normal installation layouts are
>>> followed.  (The value for $system already 
>>> includes
>>> the extension suffix.)
>> In fact, if we’re using `prefix` as currently implemented, it should end in 
>> `share/extension`, no?
> 
> It could be share/postgresql/extension, too.  So I didn't want to 
> overdocument this, because it could vary.  The examples just above this are 
> hopefully helpful.

Yeah, but if we can define it specifically, and disallow its modification, it 
simplifies things. And if understand correctly, paths like SHAREDIR defined at 
compile time are absolute paths, not suffixes to a prefix, yes?

But perhaps our packaging friends object to disallowing customization of this 
suffix?

Best,

David




signature.asc
Description:

Prevent COPY FREEZE on Foreign tables

2025-02-03 Thread Sami Imseih

Hi,

I was looking at COPY FREEZE and I found that it's
possible to run this command on a foreign table,
This really does not make sense as this
optimization cannot be applied to a remote table and it
can give a user a false impression that it was.

"""
postgres=# begin;
BEGIN
postgres=*# create foreign table t1 (id int) server r1;
CREATE FOREIGN TABLE
postgres=*#  copy t1 FROM '/tmp/copy_data' freeze;
COPY 99

-- on the foreign server

postgres=# SELECT count(*), all_visible, all_frozen, pd_all_visible
FROM pg_visibility('t1'::regclass)
group by all_visible, all_frozen, pd_all_visible;

 count | all_visible | all_frozen | pd_all_visible
---+-++
  5| f   | f  | f
(1 row)

"""

The other issue here is that one can only use COPY FREEZE
on a foreign table only if the foreign table is created in the
transaction. A truncate will not work, making the error
message wrong.

"""
postgres=# begin;
BEGIN
postgres=*# truncate table foreign_table_1;
TRUNCATE TABLE
postgres=*#  copy foreign_table_1 FROM 'copy_data' freeze;
ERROR:  cannot perform COPY FREEZE because the table was not created
or truncated in the current subtransaction
postgres=!#
"""

I think we should just block Foreign tables as we do with
partition tables. Attached patch does that.

I was also looking at why we block a parent from COPY FREEZE[1], but
the comments do not convince me this is a good idea. I think there
are good cases to allow this considering there is a common use case in
which a single
COPY command can load a large amount of data, making the overhead to check the
partitions worth the value of the FREEZE optimization. I will probably
start a separate thread for this.

Regards,

Sami Imseih
Amazon Web Services (AWS)

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/commands/copyfrom.c#L727-L735


v1-0001-Disallow-Foreign-Tables-with-COPY-FREEZE.patch
Description: Binary data

Re: injection points for hash aggregation

2025-02-03 Thread Michael Paquier

On Mon, Feb 03, 2025 at 12:45:24PM -0800, Jeff Davis wrote:
> A couple questions on the injection points framework:
> 
> * The patch allows forcing the partition fan-out to one. I could
> imagine forcing it to a specific value, is there a way to do that?
> 
> * The injection_points extension offers the concept of a "local"
> injection point. While that makes sense for the callback function, it
> doesn't make sense when using the "local_var = 123" style, because that
> will happen regardless of the condition, right?
>
> * Callbacks are given very little context from the injection point
> itself, so it's hard for me to imagine what a callback might do other
> than logging the name of the injection point or waiting (as the
> extension implements). What else would callbacks be good for?

Being able to pass down some runtime states to a callback is something
I've drafted a patch for, as of:
https://www.postgresql.org/message-id/z23zce4w1djuk...@paquier.xyz

There was no real agreement if this facility was worth having or not,
because there were no use case for it.  Well, perhaps until now based
on what you are saying.

If you want to force a callback to use a specific value or trigger
conditions based on that, you could add your own new callback in the
module injection_points and give it some custom data when attaching
the point, with a SQL function that uses InjectionPointAttach()'s
private_data.  Then you could rely on injection_point_allowed(), for
example, to decide if a callback is run or not.  So it depends on if
you want to use data known when attaching a callback or do something
based on a state given by a backend when it runs an attached callback.

Something worth mentioning.  Heikki and Peter G. have mentioned to me
at the last pgconf.dev that it could be nice to allow files to hold
their own callbacks rather than have an extension do that (don't
recall there is a footprint about that on the lists, that was an
informal offline discussion).  This would have the advantage to limit
the number of code-path-specific callbacks in the injection_points
module (or just limit the number of modules), with callbacks only
existing locally where they are used.  IS_INJECTION_POINT_ATTACHED()
is what I see as a a simplified version of this idea.
--
Michael

signature.asc
Description: PGP signature

Question -- why is there no errhint_internal function?

2025-02-03 Thread Peter Smith

Hi,

For each of the logging functions (see elog.c) there is an associated
XXX_internal function which is equivalent to its partner, but doesn't
translate the fmt message parameter.

errmsg
errmsg_internal - same as errmsg, but doesn't translate the fmt string

errdetail
errdetail_internal - same but errdetail, doesn't translate the fmt string

errhint
errhint_internal - no such thing ???

I noticed today that there is no 'errhint_internal' function partner
for the 'errhint' function.

Now, it might seem that hints are always intended for user output so
of course, you'll always want them translated but there are some
calls to this function (like below) where the actual hint message is
already built and translated before %s parameter substitution, so
AFAICT translation aka gettext lookup of just a "%s" format string
doesn't really achieve anything.

$ grep -r . -e 'errhint("%s"' | grep .c:
./contrib/dblink/dblink.c: message_hint ? errhint("%s", message_hint) : 0,
./contrib/postgres_fdw/connection.c: message_hint ? errhint("%s",
message_hint) : 0,
./src/backend/commands/vacuum.c: hintmsg ? errhint("%s", _(hintmsg)) : 0));
./src/backend/commands/tablecmds.c: (wentry->kind != '\0') ?
errhint("%s", _(wentry->drophint_msg)) : 0));
./src/backend/commands/tablecmds.c: errhint("%s", _(view_updatable_error;
./src/backend/commands/view.c: errhint("%s", _(view_updatable_error;
./src/backend/utils/misc/guc.c: hintmsg ? errhint("%s", _(hintmsg)) : 0));
./src/backend/utils/misc/guc.c: hintmsg ? errhint("%s", _(hintmsg)) : 0));
./src/backend/utils/misc/guc.c: hintmsg ? errhint("%s", _(hintmsg)) : 0));
./src/backend/utils/misc/guc.c: errhint("%s", GUC_check_errhint_string) : 0));
./src/backend/utils/misc/guc.c: errhint("%s", GUC_check_errhint_string) : 0));
./src/backend/utils/misc/guc.c: errhint("%s", GUC_check_errhint_string) : 0));
./src/backend/utils/misc/guc.c: errhint("%s", GUC_check_errhint_string) : 0));
./src/backend/utils/misc/guc.c: errhint("%s", GUC_check_errhint_string) : 0));
./src/pl/plpgsql/src/pl_exec.c: (err_hint != NULL) ? errhint("%s",
err_hint) : 0,
./src/pl/plpython/plpy_elog.c: (hint) ? errhint("%s", hint) : 0,
./src/pl/plpython/plpy_plpymodule.c: (hint != NULL) ? errhint("%s", hint) : 0,
./tmp.txt:src/backend/utils/misc/guc.c: errhint("%s",
GUC_check_errhint_string) : 0));
./tmp.txt:src/backend/utils/misc/guc.c: errhint("%s",
GUC_check_errhint_string) : 0));
./tmp.txt:src/backend/utils/misc/guc.c: errhint("%s",
GUC_check_errhint_string) : 0));
./tmp.txt:src/backend/utils/misc/guc.c: errhint("%s",
GUC_check_errhint_string) : 0));
./tmp.txt:src/backend/utils/misc/guc.c: errhint("%s",
GUC_check_errhint_string) : 0));

~

I wondered if such code as in those examples might prefer to call
errhint_internal to avoid making an unnecessary gettext lookup of
"%s".

OTOH, was an errhint_internal function deliberately omitted because
calling a superfluous gettext was not considered important enough to
bother?

==

Also, quite similar to this question --- I found a bunch of errmsg and
errdetail calls where the fmt string is just "%s". Are those correct,
or should those really be using the XXX_internal version of the
function instead?

$ grep -r . -e 'errmsg("%s"' | grep .c:
./contrib/pgcrypto/px.c: errmsg("%s", px_strerror(err;
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV))),
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV);
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV);
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV);
./src/pl/plperl/plperl.c: errmsg("%s", strip_trailing_ws(sv2cstr(ERRSV);
./src/pl/tcl/pltcl.c: errmsg("%s", emsg),
./src/pl/tcl/pltcl.c: errmsg("%s", UTF_U2E(Tcl_GetString(objv[2]);

$ grep -r . -e '\serrdetail("%s"' | grep .c:
./src/backend/executor/execExprInterp.c: errdetail("%s",
jsestate->escontext.error_data->message)));
./src/backend/executor/execExprInterp.c: errdetail("%s",
jsestate->escontext.error_data->message)));

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PATCH] Fix incorrect range in pg_regress comment

2025-02-03 Thread Michael Paquier

On Sun, Feb 02, 2025 at 10:38:32PM -0500, Tom Lane wrote:
> If I complain about it I gotta fix it, huh?  Okay, done.

Thanks.  :D
--
Michael


signature.asc
Description: PGP signature

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Amit Kapila

On Mon, Feb 3, 2025 at 6:35 PM Shlok Kyal  wrote:
>
> I reviewed the v66 patch. I have few comments:
>
> 1. I also feel the default value should be set to '0' as suggested by
> Vignesh in 1st point of [1].
>

+1. This will ensure that the idle slots won't be invalidated by
default, the same as HEAD. We can change the default value based on
user inputs.

> 2. Should we allow copying of invalidated slots?
> Currently we are able to copy slots which are invalidated:
>
> postgres=# select slot_name, active, restart_lsn, wal_status,
> inactive_since , invalidation_reason from pg_replication_slots;
>  slot_name | active | restart_lsn | wal_status |
> inactive_since  | invalidation_reason
> ---++-++--+-
>  test1 | f  | 0/16FDDE0   | lost   | 2025-02-03
> 18:28:01.802463+05:30 | idle_timeout
> (1 row)
>
> postgres=# select pg_copy_logical_replication_slot('test1', 'test2');
>  pg_copy_logical_replication_slot
> --
>  (test2,0/16FDE18)
> (1 row)
>
> postgres=# select slot_name, active, restart_lsn, wal_status,
> inactive_since , invalidation_reason from pg_replication_slots;
>  slot_name | active | restart_lsn | wal_status |
> inactive_since  | invalidation_reason
> ---++-++--+-
>  test1 | f  | 0/16FDDE0   | lost   | 2025-02-03
> 18:28:01.802463+05:30 | idle_timeout
>  test2 | f  | 0/16FDDE0   | reserved   | 2025-02-03
> 18:29:53.478023+05:30 |
> (2 rows)
>

Is this related to this patch or the behavior of HEAD? If this
behavior is not introduced by this patch then we should discuss this
in a separate thread. I couldn't think of why anyone wants to copy the
invalid slots, so we should probably prohibit copying invalid slots
but that is a matter of separate discussion unless introduced by this
patch.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Amit Kapila

On Tue, Feb 4, 2025 at 4:02 AM Peter Smith  wrote:
>
> On Mon, Feb 3, 2025 at 5:34 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 3, 2025 at 6:16 AM Peter Smith  wrote:
> > >
> > >
> > > 2.
> > > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > > + "max_slot_wal_keep_size");
> > >   break;
> > > 2a.
> > > In this case, shouldn't you really be using macro _("You might need to
> > > increase \"%s\".") so that the common format string would be got using
> > > gettext()?
> > >
> > > ~
> > >
> > >
> > > ~~~
> > >
> > > 3.
> > > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > > + "idle_replication_slot_timeout");
> > > + break;
> > >
> > > 3a.
> > > Ditto above. IMO this common format string should be got using macro.
> > > e.g.: _("You might need to increase \"%s\".")
> > >
> > > ~
> >
> > Instead, we can directly use '_(' in errhint as we are doing in one
> > other similar place "errhint("%s", _(view_updatable_error;". I
> > think we didn't use it for errdetail because, in one of the cases, it
> > needs to use ngettext
> >
>
> -1 for this suggestion because this will end up causing a gettext() on
> the entire hint where the GUC has already been substituted.
>
> e.g. it is effectively doing
>
> _("You might need to increase \"max_slot_wal_keep_size\".")
> _("You might need to increase \"idle_replication_slot_timeout\".")
>
> But that is contrary to the goal of reducing the burden on translators
> by using *common* messages wherever possible. IMO we should only
> request translation of the *common* part of the hint message.
>

Fair point. So, we can ignore my suggestion.

-- 
With Regards,
Amit Kapila.

Re: SQLFunctionCache and generic plans

2025-02-03 Thread Tom Lane

Pavel Stehule  writes:
> I did multiple benchmarking, and still looks so the proposed patch doesn't
> help and has significant overhead

Yeah, your fx() test case is clearly worse.  For me,

HEAD:

regression=# do $$
begin
  for i in 1..100 loop
perform fx((random()*100)::int); -- or fx2
  end loop;
end;
$$;
DO
Time: 5229.184 ms (00:05.229)

PATCH:

regression=# do $$
begin
  for i in 1..100 loop
perform fx((random()*100)::int); -- or fx2
  end loop;
end;
$$;
DO
Time: 6934.413 ms (00:06.934)

Adding some debug printout shows me that BuildCachedPlan is called to
construct a custom plan on every single execution, which is presumably
because the patch doesn't make any attempt to carry plancache state
across successive executions of the same query.  If we were saving that
state it would have soon switched to a generic plan and then won big.

So, even though I thought we could leave that for later, it seems like
maybe we have to have it before we'll have a committable patch.

There might be some residual inefficiency in there though.  In the
unpatched code we'd be calling pg_parse_query and pg_plan_query once
per execution.  You'd think that would cost more than BuildCachedPlan,
which can skip the raw-parsing part.  Even more interesting, the
patch gets slower yet if we use a new-style SQL function:

regression=# create or replace function fx3 (int) returns int immutable
regression-# begin atomic select $1 + $1; end;
CREATE FUNCTION
Time: 0.813 ms
regression=# do $$ 
begin
  for i in 1..100 loop
perform fx3((random()*100)::int); -- or fx2
  end loop;
end;
$$;
DO
Time: 8007.062 ms (00:08.007)

That makes no sense either, because with a new-style SQL
function we should be skipping parse analysis as well.

But wait: HEAD takes
Time: 6632.709 ms (00:06.633)
to do the same thing.  So somehow the new-style SQL function
stuff is very materially slower in this use-case, with or
without this patch.  I do not understand why.

Definitely some performance investigation needs to be done here.
Even without cross-query plan caching, I don't see why the
patch isn't better than it is.  It ought to be at least
competitive with the unpatched code.

(I've not read the v5 patch yet, so I have no theories.)

regards, tom lane

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-03 Thread Sami Imseih

On Mon, Feb 3, 2025 at 11:29 AM Alex Friedman  wrote:
>
> Hi Sami,
>
> Thanks for the feedback.
>
> > 1/ Remove this as
> > "(50% of the maximum, which is about 20GB),"
> >
> > [1] tried to avoid explaining this level of detail, and I
> > agree with that.
>
> I feel it is critical for users to know what is the hard limit of
> multixact members. As PG doesn't (yet) expose how many multixact
> members are in use, the only way for users to know the distance to
> members wraparound is by monitoring the members directory space usage.
> So it seems to me that the 20 GB number is very important to have in
> the docs.

A few paragraphs up the docs, there is this mention:

". There is a separate storage area which holds the list of members
in each multixact, which also uses a 32-bit counter and which must
also be managed."

Maybe we can add more to this paragraph, such as:

"also be managed. This member can grow to 20GB"

And then in the proposed correction:

"
Also, if the storage occupied by multixacts members area exceeds 10GB
(50% of the maximum the members area can grow), aggressive vacuum
scans will occur more often for all tables
"

What do you think?

> > 2/ c/"about 10GB"/"10GB" the "about" does not seem necessary here.
>
> The threshold is actually ~10.015 GiB (due to the 12 bytes wasted per
> 8KB page), or ~10.75 GB, so to avoid confusion by users when
> aggressive autovacuum doesn't trigger exactly at 10GB, I believe we
> should either be exact, or say that we are not being exact. Being
> exact is difficult as it depends on the block size. And as I looked
> through the doc page in question, I noticed there are already several
> cases using the "about" wording, e.g. "about 50MB of pg_xact storage"
> and "about 2GB of pg_commit_ts storage", so here I went for
> consistency with the rest of the doc.

I really don't have a strong opinion about this, except it seemed unnecessary.
But if it helps clarify that it's not an exact 10GB, I am ok with keeping it.

Regards,

Sami

Re: Prevent COPY FREEZE on Foreign tables

2025-02-03 Thread Zhang Mingli

Hi,


Zhang Mingli
www.hashdata.xyz
On Feb 4, 2025 at 04:18 +0800, Sami Imseih , wrote:
>
> This really does not make sense as this
> optimization cannot be applied to a remote table and it
> can give a user a false impression that it was.
+1,

```
+ /*
+ * Raise an error on foreign tables as it's not possible to apply
+ * the COPY FREEZE optimization to a remote relation.
+ */
+ if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot perform COPY FREEZE on a foreign table")));
+ }
+
```
Instead of throwing an error, how about we turn that into a warning?
This way, if someone is batch-importing data for multiple tables, it won’t 
interrupt their script that generates a COPY for each table.
Is it better to give them a heads-up without making them modify their commands 
right away?

Re: SQLFunctionCache and generic plans

2025-02-03 Thread Tom Lane

I wrote:
> But wait: HEAD takes
> Time: 6632.709 ms (00:06.633)
> to do the same thing.  So somehow the new-style SQL function
> stuff is very materially slower in this use-case, with or
> without this patch.  I do not understand why.

"perf" tells me that in the fx3 test, a full third of the runtime
is spent inside stringToNode(), with about three-quarters of that
going into pg_strtok().  This makes no sense to me: we'll be reading
the prosqlbody of fx3(), sure, but that's not enormously long (about
1200 bytes).  And pg_strtok() doesn't look remarkably slow.  There's
no way this should be taking more time than raw parsing + parse
analysis, even for such a trivial query as "select $1 + $1".

There's been some talk of getting rid of our existing nodetree
storage format in favor of something more efficient.  Maybe we
should put a higher priority on getting that done.  But anyway,
that seems orthogonal to the current patch.

> Even without cross-query plan caching, I don't see why the
> patch isn't better than it is.  It ought to be at least
> competitive with the unpatched code.

This remains true.

regards, tom lane

Re: injection points for hash aggregation

2025-02-03 Thread Jeff Davis

On Mon, 2025-02-03 at 12:45 -0800, Jeff Davis wrote:
> * The patch allows forcing the partition fan-out to one. I could
> imagine forcing it to a specific value, is there a way to do that?

I hit "send" too quickly and this caused test failures in CI. Attaching
v2.

Changes:

  * a new injection point to force spilling at 1000 tuples so that the
test is deterministic (required some minor refactoring in
hash_agg_check_limits())
  * added a branch to guard against a shift-by-32, which could not
happen in the code before, because the number of partitions was a
minimum of 4
  * minor refactor of hash_agg_set_limits() to avoid an implicit
assumption. This is not directly related, so I added it as a separate
patch.


Regards,
Jeff Davis

From 513889206e048abc2493d2902787404f86ca45d7 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sun, 2 Feb 2025 05:26:58 -0800
Subject: [PATCH v2 1/2] Add injection points for hash aggregation.

---
 src/backend/executor/nodeAgg.c| 62 ++---
 src/test/modules/injection_points/Makefile|  2 +-
 .../injection_points/expected/hashagg.out | 68 +++
 src/test/modules/injection_points/meson.build |  1 +
 .../modules/injection_points/sql/hashagg.sql  | 26 +++
 5 files changed, 150 insertions(+), 9 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/hashagg.out
 create mode 100644 src/test/modules/injection_points/sql/hashagg.sql

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3005b5c0e3b..35cf18e5282 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -269,6 +269,7 @@
 #include "utils/datum.h"
 #include "utils/dynahash.h"
 #include "utils/expandeddatum.h"
+#include "utils/injection_point.h"
 #include "utils/logtape.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -1489,6 +1490,14 @@ build_hash_tables(AggState *aggstate)
 		   perhash->aggnode->numGroups,
 		   memory);
 
+#ifdef USE_INJECTION_POINTS
+		if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-oversize-table"))
+		{
+			nbuckets = memory / sizeof(TupleHashEntryData);
+			INJECTION_POINT_CACHED("hash-aggregate-oversize-table");
+		}
+#endif
+
 		build_hash_table(aggstate, setno, nbuckets);
 	}
 
@@ -1860,17 +1869,36 @@ hash_agg_check_limits(AggState *aggstate)
 	 true);
 	Size		hashkey_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory,
 		true);
+	bool		do_spill = false;
 
 	/*
-	 * Don't spill unless there's at least one group in the hash table so we
-	 * can be sure to make progress even in edge cases.
+	 * Spill if exceeding memory limit. Don't spill unless there's at least
+	 * one group in the hash table so we can be sure to make progress even in
+	 * edge cases.
 	 */
-	if (aggstate->hash_ngroups_current > 0 &&
-		(meta_mem + hashkey_mem > aggstate->hash_mem_limit ||
-		 ngroups > aggstate->hash_ngroups_limit))
+	if (ngroups > 0 && (meta_mem + hashkey_mem > aggstate->hash_mem_limit))
+		do_spill = true;
+
+	/*
+	 * Spill if exceeding ngroups limit. This is important if transition
+	 * values may grow, e.g. for ARRAY_AGG().
+	 */
+	if (ngroups > aggstate->hash_ngroups_limit)
+		do_spill = true;
+
+#ifdef USE_INJECTION_POINTS
+	if (ngroups > 1000)
 	{
-		hash_agg_enter_spill_mode(aggstate);
+		if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-spill-1000"))
+		{
+			do_spill = true;
+			INJECTION_POINT_CACHED("hash-aggregate-spill-1000");
+		}
 	}
+#endif
+
+	if (do_spill)
+		hash_agg_enter_spill_mode(aggstate);
 }
 
 /*
@@ -1881,6 +1909,7 @@ hash_agg_check_limits(AggState *aggstate)
 static void
 hash_agg_enter_spill_mode(AggState *aggstate)
 {
+	INJECTION_POINT("hash-aggregate-enter-spill-mode");
 	aggstate->hash_spill_mode = true;
 	hashagg_recompile_expressions(aggstate, aggstate->table_filled, true);
 
@@ -2652,6 +2681,7 @@ agg_refill_hash_table(AggState *aggstate)
 	 */
 	hashagg_recompile_expressions(aggstate, true, true);
 
+	INJECTION_POINT("hash-aggregate-process-batch");
 	for (;;)
 	{
 		TupleTableSlot *spillslot = aggstate->hash_spill_rslot;
@@ -2900,6 +2930,15 @@ hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
 			 used_bits, &partition_bits);
 
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
+	{
+		npartitions = 1;
+		partition_bits = 0;
+		INJECTION_POINT_CACHED("hash-aggregate-single-partition");
+	}
+#endif
+
 	spill->partitions = palloc0(sizeof(LogicalTape *) * npartitions);
 	spill->ntuples = palloc0(sizeof(int64) * npartitions);
 	spill->hll_card = palloc0(sizeof(hyperLogLogState) * npartitions);
@@ -2908,7 +2947,10 @@ hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
 		spill->partitions[i] = LogicalTapeCreate(tapeset);
 
 	spill->shift = 32 - used_bits - partition_bits;
-	spill->mask = (

Re: SQL/JSON json_table plan clause

2025-02-03 Thread Amit Langote

Hi Nikita,

On Wed, Dec 18, 2024 at 12:11 AM Nikita Malakhov  wrote:
>
> Hi hackers!
>
> This thread is further work continued in [1], where Amit Langote
> suggested starting discussion on the remaining SQL/JSON feature
> 'PLAN clause for JSON_TABLE' anew.
>
> We'd like to help with merging SQL/JSON patches into vanilla,
> and have adapted PLAN clause to recent changes in JSON_TABLE
> function.

Thanks for working on this.

> While doing this we've found that some tests with the PLAN clause
> were incorrect, along with JSON_TABLE behavior with this clause.
> We've corrected this behavior, but these corrections required reverting
> some removed and heavily refactored code, so we'd be glad for review
> and feedback on this patch.

Sorry, I don't fully understand this paragraph.  Do you mean that
there might be bugs in the existing JSON_TABLE() functionality that
was committed into v17?

-- 
Thanks, Amit Langote

Re: Avoid updating inactive_since for invalid replication slots

2025-02-03 Thread Peter Smith

Hi Nisha.

Some review comments for patch v1-0001

==
GENERAL - Missing Test case?

1.
Should there be some before/after test case for 'active_since' value
with invalid slots to verify that the patch is doing what it says?

==
src/backend/replication/slot.c

ReplicationSlotAcquire:

2.
I saw some other code in ReplicationSlotAcquire doing:

/*
 * Reset the time since the slot has become inactive as the slot is active
 * now.
 */
SpinLockAcquire(&s->mutex);
s->inactive_since = 0;
SpinLockRelease(&s->mutex);

~

Should that be changed to use the new function like:
ReplicationSlotSetInactiveSince(s, 0, true);

Or (if not) then it probably needs a comment to say why not.
~~~

RestoreSlotFromDisk:

3.
+ if (now == 0)
+ now = GetCurrentTimestamp();
+
+ ReplicationSlotSetInactiveSince(slot, now, false);

3a.
The code from v65-0001 in the other thread [1] (where the bulk of this
v1 patch came from) used to have a RestoreSlotFromDisk function
comment saying "Avoid calling ReplicationSlotSetInactiveSince() here,
as it will not set the time for invalid slots."

In other words, yesterday we were deliberately NOT calling function
ReplicationSlotSetInactiveSince, but today we deliberately ARE calling
it  (???).

Why has it changed?

~~~

3b.
The other code that is similar to this deferred assignment of 'now'
(see function update_synced_slots_inactive_since) includes a comment
/* Use the same inactive_since time for all the slots. */.

Should this code do the same?

~


==
src/include/replication/slot.h

4.
+static inline void
+ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz now,
+ bool acquire_lock)
+{
+ if (s->data.invalidated != RS_INVAL_NONE)
+ return;
+
+ if (acquire_lock)
+ SpinLockAcquire(&s->mutex);
+
+ s->inactive_since = now;
+
+ if (acquire_lock)
+ SpinLockRelease(&s->mutex);
+}
+

4a.
I think it might not be a good idea to call the parameter 'now',
because that might not always be the meaning --  e.g. In another
comment I suggested passing a value 0.

A more generic TimestampTz name like 'ts' might be better here; just
the caller argument can be called 'now' when appropriate.

~

4b.
Since this is a static inline function I assume performance is
important (e.g. sometimes it is called within a spinlock). If that is
the case, then maybe writing this logic with fewer conditions would be
better.

SUGGESTION

if (s->data.invalidated == RS_INVAL_NONE)
{
  if (acquire_lock)
  {
SpinLockAcquire(&s->mutex);
s->inactive_since = ts;
SpinLockRelease(&s->mutex);
  }
  else
s->inactive_since = now;
}

==
[1] 
https://www.postgresql.org/message-id/CABdArM7D9u1an6gzfArAL32Jn0QQkKs7JffUxcZ9EqzAaGrfvQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-02-03 Thread vignesh C

On Tue, 28 Jan 2025 at 08:40, Ajin Cherian  wrote:
>
> Here's a patch-set created based on the design changes proposed by Hou-san.

Few comments:
1) Shouldn't we do the same thing for other DecodeXXX functions?
@@ -915,6 +915,11 @@ DecodeInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
return;

+   if (ctx->reorder->can_filter_change &&
+   ReorderBufferFilterByRelFileLocator(ctx->reorder,
XLogRecGetXid(r),
+
 buf->origptr, &target_locator, true))
+   return;
+

2) Let's add some debug logs so that it will be easy to verify the
changes that are getting filtered, or else we will have to debug and
verify them:
@@ -915,6 +915,11 @@ DecodeInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
return;

+   if (ctx->reorder->can_filter_change &&
+   ReorderBufferFilterByRelFileLocator(ctx->reorder,
XLogRecGetXid(r),
+
 buf->origptr, &target_locator, true))
+   return;

3) Also there are no tests currently, probably if we add the above
mentioned debug logs we could add few tests and verify them based on
the logs.

4) Can you elaborate a bit in the commit message why we need to
capture if a transaction has snapshot changes:
Subject: [PATCH v12 1/3] Track transactions with internal snapshot changes

Track transactions which have snapshot changes with a new flag
RBTXN_HAS_SNAPSHOT_CHANGES

Regards,
Vignesh

RE: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-02-03 Thread Zhijie Hou (Fujitsu)

On Friday, January 31, 2025 9:43 PM Kuroda, Hayato/黒田 隼人 

> 
> Dear Ajin, Hou,
> 
> > - Snapshot construction
> 
> I understand the approach that we do not try to filter for all the workloads; 
> do
> just best-effort.
> 
> 3.
> Both ReorderBuffer and RelFileLocatorFilterCache have the same lifetime but
> RelFileLocatorFilterCache is provided as the global variable; it is quite 
> strange.
> Can you somehow avoid this? I considered idea but could not because we do
> not have APIs to Unregister the relcacheCallback.

I think we already have precedent of this (e.g., RelationSyncCache in 
pgoutput.c),
so, I am OK for it unless we come up with better ideas.

> 
> 4.
> For output plugins which does not cache the catalog, it may not able to do
> filtering without the transaction. For them, the early filter may degrade the
> performance because it requires to open transactions for every changes.
> Is my understanding correct?

Partially correct. To be precise, I think we should recommend that output
plugin that cannot filter without transaction can avoid implementing this
callback in the first place, so that it would not cause degradation.

Best Regards,
Hou zj

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Andrey Borodin

> On 3 Feb 2025, at 23:18, Tom Lane  wrote:
> 
> But I don't want to think of
> EEOP_PARAM_CALLBACK as being specifically tied to PL/pgSQL.

I've found 1 users on Github [0]. And a lot of copied code like [1]. And there 
might be some unindexed cases.
Let's keep paramarg2.

Best regards, Andrey Borodin.
[0] 
https://github.com/wanglinn/xadb/blob/7695b7edcb0a89f3173b648c0da5b953538f2aa9/src/backend/pgxc/pool/execRemote.c#L835
[1] 
https://github.com/babelfish-for-postgresql/babelfish_extensions/blob/376cf488804fa02f9b1db5bbfbe74e98627fe96c/contrib/babelfishpg_tsql/src/pl_exec.c#L8030

injection points for hash aggregation

2025-02-03 Thread Jeff Davis

Attached is a patch that adds a few injection points for hash
aggregation.

A couple questions on the injection points framework:

* The patch allows forcing the partition fan-out to one. I could
imagine forcing it to a specific value, is there a way to do that?

* The injection_points extension offers the concept of a "local"
injection point. While that makes sense for the callback function, it
doesn't make sense when using the "local_var = 123" style, because that
will happen regardless of the condition, right?

* Callbacks are given very little context from the injection point
itself, so it's hard for me to imagine what a callback might do other
than logging the name of the injection point or waiting (as the
extension implements). What else would callbacks be good for?

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

From f98ee96920de46a5bba4da4079395d5e959472a4 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sun, 2 Feb 2025 05:26:58 -0800
Subject: [PATCH v1] Add injection points for hash aggregation.

---
 src/backend/executor/nodeAgg.c| 20 ++
 src/test/modules/injection_points/Makefile|  2 +-
 .../injection_points/expected/hashagg.out | 68 +++
 src/test/modules/injection_points/meson.build |  1 +
 .../modules/injection_points/sql/hashagg.sql  | 22 ++
 5 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/injection_points/expected/hashagg.out
 create mode 100644 src/test/modules/injection_points/sql/hashagg.sql

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3005b5c0e3b..267a3229145 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -269,6 +269,7 @@
 #include "utils/datum.h"
 #include "utils/dynahash.h"
 #include "utils/expandeddatum.h"
+#include "utils/injection_point.h"
 #include "utils/logtape.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -1489,6 +1490,14 @@ build_hash_tables(AggState *aggstate)
 		   perhash->aggnode->numGroups,
 		   memory);
 
+#ifdef USE_INJECTION_POINTS
+		if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-oversize-table"))
+		{
+			nbuckets = memory / sizeof(TupleHashEntryData);
+			INJECTION_POINT_CACHED("hash-aggregate-oversize-table");
+		}
+#endif
+
 		build_hash_table(aggstate, setno, nbuckets);
 	}
 
@@ -1881,6 +1890,7 @@ hash_agg_check_limits(AggState *aggstate)
 static void
 hash_agg_enter_spill_mode(AggState *aggstate)
 {
+	INJECTION_POINT("hash-aggregate-enter-spill-mode");
 	aggstate->hash_spill_mode = true;
 	hashagg_recompile_expressions(aggstate, aggstate->table_filled, true);
 
@@ -2652,6 +2662,7 @@ agg_refill_hash_table(AggState *aggstate)
 	 */
 	hashagg_recompile_expressions(aggstate, true, true);
 
+	INJECTION_POINT("hash-aggregate-process-batch");
 	for (;;)
 	{
 		TupleTableSlot *spillslot = aggstate->hash_spill_rslot;
@@ -2900,6 +2911,15 @@ hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
 			 used_bits, &partition_bits);
 
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
+	{
+		npartitions = 1;
+		partition_bits = 0;
+		INJECTION_POINT_CACHED("hash-aggregate-single-partition");
+	}
+#endif
+
 	spill->partitions = palloc0(sizeof(LogicalTape *) * npartitions);
 	spill->ntuples = palloc0(sizeof(int64) * npartitions);
 	spill->hll_card = palloc0(sizeof(hyperLogLogState) * npartitions);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 4f0161fd33a..e680991f8d4 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points hashagg reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/hashagg.out b/src/test/modules/injection_points/expected/hashagg.out
new file mode 100644
index 000..795a53210c4
--- /dev/null
+++ b/src/test/modules/injection_points/expected/hashagg.out
@@ -0,0 +1,68 @@
+-- Test for hash aggregation
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+
+ 
+(1 row)
+
+SELECT injection_points_attach('hash-aggregate-enter-spill-mode', 'notice');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_attach('hash-aggregate-process-batch', 'notice');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_attach('hash-aggregate-single-partition', 'notice');
+ injection_points_attach 
+-
+ 
+(1 r

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Greg Sabino Mullane

I like the ntypes idea: please find attached version 3 which implements it.
I wrote it slightly differently as I did not want to leave any chance of it
escaping the if else blocks without a match.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


0003-Show-more-intuitive-titles-for-psql-commands.patch
Description: Binary data

Re: NOT ENFORCED constraint feature

2025-02-03 Thread Ashutosh Bapat

On Mon, Feb 3, 2025 at 5:55 PM Alvaro Herrera  wrote:
>
> On 2025-Feb-03, Ashutosh Bapat wrote:
>
> > ```
> >   If the
> >   constraint is NOT ENFORCED, the database system 
> > will
> >   not check the constraint.  It is then up to the application code to
> >   ensure that the constraints are satisfied.  The database system might
> >   still assume that the data actually satisfies the constraint for
> >   optimization decisions where this does not affect the correctness of 
> > the
> >   result.
> > ```
>
> IMO the third sentence should be removed because it is bogus.  There's
> no situation in which a not-enforced constraint can be used for any
> query optimizations -- you cannot know if a constraint remains valid
> after it's been turned NOT ENFORCED, because anyone could insert data
> that violates it milliseconds after it stops being enforced.  I think
> the expectation that the application is going to correctly enforce the
> constraint after it's told the database server not to enforce it, is
> going to be problematic.  As I recall, we already do this in FDWs for
> instance and it's already a problem.

What's the use of NOT ENFORCED constraint then?

To me NOT ENFORCED looks similar to informational constraints of IBM
DB2 [1]. It seems that they have TRUSTED/NOT TRUSTED similar to
PostgreSQL's VALID/NOT VALID [2].

[1] https://www.ibm.com/docs/en/db2/11.1?topic=constraints-informational
[2] https://www.ibm.com/docs/en/db2/11.1?topic=statements-create-table

-- 
Best Wishes,
Ashutosh Bapat

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Greg Sabino Mullane

On Mon, Feb 3, 2025 at 1:07 PM Tom Lane  wrote:

> As presented, I think this fails the translatability guideline

Thanks, good point.  Please find attached a slightly more verbose version
that should be better for translations.

For myself, if we were going to do something like this, I'd kind of like to
> cover more cases:
>
> greg=# \dti
> List of tables and indexes
>

I toyed with that for a bit, but as you say, without generating a ton of
combinations to translate, we'd have to fall back on run-time
constructions. Neither is ideal. I also realized that I almost never type
"\dti". Very common for me are \d and \dt and \dv etc. but combinations are
something I never bother with. At that point, I just do a \d. I think given
how rare (granted, anecdotally) those combinations are, it's okay if we
expose people to the "r" word.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

0002-Show-more-intuitive-titles-for-psql-commands.patch
Description: Binary data

Re: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.

2025-02-03 Thread Masahiko Sawada

On Sun, Feb 2, 2025 at 10:00 PM Peter Smith  wrote:
>
> On Mon, Feb 3, 2025 at 4:50 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Sawada-san,
> >
> > > I think it's a good idea to support it at least on HEAD. I've attached
> > > a patch for that.
> >
> > +1. I've confirmed that pg_rewind and -R can't output dbname for now,
> > and your patch allows to do it.
> >
> > Few comments for your patch.
> >
> > 1.
> >
> > pg_basebackup.sgml has below description. I feel this can be ported to
> > pg_rewind.sgml as well.
> >
> > ```
> > The dbname will be recorded only if the dbname was
> > specified explicitly in the connection string or  > linkend="libpq-envars">
> > environment variable.
> > ```
> >
> > 2.
> > I'm not sure whether recovery_gen.h/c is a correct place for the exported 
> > function
> > GetDbnameFromConnectionOptions(). The function itself does not handle
> > postgresql.cuto.conf file.
> > I seeked other header files and felt that connect_utils.h might be.
> >
> > ```
> > /*-
> >  *
> >  * Facilities for frontend code to connect to and disconnect from databases.
> > ```
> >
> > Another idea is to change the third API to accept the whole connection 
> > string, and
> > it extracts dbname from it. In this approach we can make 
> > GetDbnameFromConnectionOptions()
> > to static function - which does not feel strange for me.
> >
> > Best regards,
> > Hayato Kuroda
> > FUJITSU LIMITED
> >
>
> Hi Sawada-san, h
>
> Here are a few more minor comments in addition to what Kuroda-San
> already posted.

Thank you for reviewing the patch.

>
> ==
> typo in patch name /primary_conninfo/primary_connifo/

Will fix.

>
> ==
> Commit message
>
> no details.
> bad link.

Yeah, I cannot add the discussion link before sending the patch.

>
> ==
> src/fe_utils/recovery_gen.c
>
> 1.
> static char *
> FindDbnameInConnParams(PQconninfoOption *conn_opts)
>
> There is a missing forward declaration of this function. Better to add
> it for consistency because the other static function has one.

Will fix.

>
> ~~~
>
> 2.
> +static char *
> +FindDbnameInConnParams(PQconninfoOption *conn_opts)
> +{
> + PQconninfoOption *conn_opt;
> +
> + for (conn_opt = conn_opts; conn_opt->keyword != NULL; conn_opt++)
> + {
> + if (strcmp(conn_opt->keyword, "dbname") == 0 &&
> + conn_opt->val != NULL && conn_opt->val[0] != '\0')
> + return pg_strdup(conn_opt->val);
> + }
> + return NULL;
> +}
>
> 2a.
> I know you copied this, but it seems a bit strange that this function
> is named "Params" when everything about it including its parameter and
> comments and the caller talks about "_opts" or "Options"

Yes, it seems clearer to use ConnOpts instead.

>
> ~
>
> 2b.
> I know you copied this, but normally 'conn_opt' might have been
> written as a for-loop variable.

Fill fix.

>
> ~~~
>
> 3.
> +/*
> + * GetDbnameFromConnectionOptions
> + *
> + * This is a special purpose function to retrieve the dbname from either the
> + * 'connstr' specified by the caller or from the environment variables.
> + *
> + * Returns NULL, if dbname is not specified by the user in the above
> + * mentioned connection options.
> + */
>
> What does "in the above mentioned connection options" mean? In the
> original function comment where this was copied from there was an
> extra sentence ("We follow ... from various connection options.") so
> this had more context, but now that the other sentence is removed
> maybe "above mentioned connection options" looks like it also needs
> rewording.

Agreed, will fix.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

2025-02-03 Thread Nathan Bossart

Barring objections, I am planning to commit this one soon.  I might move
the addition of analyze_delay_point() to its own patch, but otherwise I
think it looks good to go.

-- 
nathan

Re: Commitfest app release on Feb 17 with many improvements

2025-02-03 Thread Nathan Bossart

On Fri, Jan 31, 2025 at 03:23:56PM +0100, Jelte Fennema-Nio wrote:
> At the FOSDEM dev meeting we discussed potential improvements to the
> commitfest app and how to handle deploying future changes with minimal
> disruption to existing workflows. We're going to try a new approach:
> announcing a new commitfest release ~2 weeks in advance, including
> release notes. That way people can try out the changes on the staging
> environment and raise concerns/objections. So hereby such an
> announcement:

Thanks for doing this!

> The next commitfest app release will take place on February 17 and
> will contain the following user facing changes:
> 
> 1. Showing CFBot status on patch and commitfest pages
> 2. Showing a link to the CFBot github diff on patch and commitfest pages
> 3. Showing total additions/deletions of the most recent patchset on
> patch and commitfest pages
> 4. Showing additions/deletions of the first patch of the most recent
> patchset on the patch page
> 5. Showing the version and number of patches of the most recent
> patchset on the patch page
> 6. Canonical links for the patch page are now /patch/{patchid} instead
> of /{cfid}/{patchid}
> 7. If you subscribed for emails or patches where you're an author, you
> will now get an email when your patch needs a rebase.
> 8. Clicking headers will toggle sorting, instead of always enabling it
> 9. Allow sorting patches by name
> 10. Allow sorting patches by total additions/deletions (added together
> for total lines changed)
> 11. Add ID column to commitfest page (allows sorting)
> 12. Allow sorting using the header of closed patches too
> 13. Fix dropdown direction for dropdowns at the bottom of the page
> (they now drop up)
> 14. The previous GiHub and CirrusCI links are now integrated into the
> new CFBot status row of the patch page
> 15. The previous GitHub checkout instructions are now integrated into
> the new CFBot status row as a button that copies the commands to the
> clipboard

I'm not sure #4 will be very helpful, but the rest looks reasonable to me.

-- 
nathan

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Peter Smith

On Mon, Feb 3, 2025 at 5:34 PM Amit Kapila  wrote:
>
> On Mon, Feb 3, 2025 at 6:16 AM Peter Smith  wrote:
> >
> >
> > 2.
> > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > + "max_slot_wal_keep_size");
> >   break;
> > 2a.
> > In this case, shouldn't you really be using macro _("You might need to
> > increase \"%s\".") so that the common format string would be got using
> > gettext()?
> >
> > ~
> >
> >
> > ~~~
> >
> > 3.
> > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > + "idle_replication_slot_timeout");
> > + break;
> >
> > 3a.
> > Ditto above. IMO this common format string should be got using macro.
> > e.g.: _("You might need to increase \"%s\".")
> >
> > ~
>
> Instead, we can directly use '_(' in errhint as we are doing in one
> other similar place "errhint("%s", _(view_updatable_error;". I
> think we didn't use it for errdetail because, in one of the cases, it
> needs to use ngettext
>

-1 for this suggestion because this will end up causing a gettext() on
the entire hint where the GUC has already been substituted.

e.g. it is effectively doing

_("You might need to increase \"max_slot_wal_keep_size\".")
_("You might need to increase \"idle_replication_slot_timeout\".")

But that is contrary to the goal of reducing the burden on translators
by using *common* messages wherever possible. IMO we should only
request translation of the *common* part of the hint message.

e.g.
_("You might need to increase \"%s\".")

~~~

We always do GUC name substitution into a *common* fmt message because
then translators only need to maintain a single translated string
instead of many. You can find examples of this everywhere. For
example, notice the GUC is always substituted into the translated fmt
msgs below; they never have the GUC name included explicitly. The
result is just a single fmt message is needed.

$ grep -r . -e 'errhint("You might need to increase' | grep '.c:'
./src/backend/replication/logical/launcher.c: errhint("You might need
to increase \"%s\".", "max_logical_replication_workers")));
./src/backend/replication/logical/launcher.c: errhint("You might need
to increase \"%s\".", "max_worker_processes")));
./src/backend/storage/lmgr/predicate.c: errhint("You might need to
increase \"%s\".", "max_pred_locks_per_transaction")));
./src/backend/storage/lmgr/predicate.c: errhint("You might need to
increase \"%s\".", "max_pred_locks_per_transaction")));
./src/backend/storage/lmgr/predicate.c: errhint("You might need to
increase \"%s\".", "max_pred_locks_per_transaction")));
./src/backend/storage/lmgr/lock.c: errhint("You might need to increase
\"%s\".", "max_locks_per_transaction")));
./src/backend/storage/lmgr/lock.c: errhint("You might need to increase
\"%s\".", "max_locks_per_transaction")));
./src/backend/storage/lmgr/lock.c: errhint("You might need to increase
\"%s\".", "max_locks_per_transaction")));
./src/backend/storage/lmgr/lock.c: errhint("You might need to increase
\"%s\".", "max_locks_per_transaction")));
./src/backend/storage/lmgr/lock.c: errhint("You might need to increase
\"%s\".", "max_locks_per_transaction")));

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Tom Lane

Greg Sabino Mullane  writes:
> I like the ntypes idea: please find attached version 3 which implements it.
> I wrote it slightly differently as I did not want to leave any chance of it
> escaping the if else blocks without a match.

Hmm, I was envisioning being a little more in-your-face about the
no-match case, because that would surely represent somebody failing
to update the code correctly when adding a new relation kind.
So I was imagining something like the attached.

One problem with it is that while we can leave "List of ???" out
of the set of translatable strings easily, we can't currently
do that for the argument of pg_log_error because it's automatically
a gettext trigger.  I'd rather not burden translators with figuring
out what to do with that.  Is it worth creating pg_log_error_internal,
equivalently to elog and errmsg_internal in the backend?

BTW, I updated the regression tests for this, and that bears out
your argument that one-type commands are the majority.  There
are still a couple "List of relations", but not many.

regards, tom lane

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index aa4363b200..b314fb5d71 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4011,14 +4011,18 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	bool		showSeq = strchr(tabtypes, 's') != NULL;
 	bool		showForeign = strchr(tabtypes, 'E') != NULL;
 
+	int			ntypes;
 	PQExpBufferData buf;
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	int			cols_so_far;
 	bool		translate_columns[] = {false, false, true, false, false, false, false, false, false};
 
-	/* If tabtypes is empty, we default to \dtvmsE (but see also command.c) */
-	if (!(showTables || showIndexes || showViews || showMatViews || showSeq || showForeign))
+	/* Count the number of explicitly-requested relation types */
+	ntypes = showTables + showIndexes + showViews + showMatViews +
+		showSeq + showForeign;
+	/* If none, we default to \dtvmsE (but see also command.c) */
+	if (ntypes == 0)
 		showTables = showViews = showMatViews = showSeq = showForeign = true;
 
 	initPQExpBuffer(&buf);
@@ -4169,14 +4173,55 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	if (PQntuples(res) == 0 && !pset.quiet)
 	{
 		if (pattern)
-			pg_log_error("Did not find any relation named \"%s\".",
-		 pattern);
+		{
+			if (ntypes != 1)
+pg_log_error("Did not find any relations named \"%s\".", pattern);
+			else if (showTables)
+pg_log_error("Did not find any tables named \"%s\".", pattern);
+			else if (showIndexes)
+pg_log_error("Did not find any indexes named \"%s\".", pattern);
+			else if (showViews)
+pg_log_error("Did not find any views named \"%s\".", pattern);
+			else if (showMatViews)
+pg_log_error("Did not find any materialized views named \"%s\".", pattern);
+			else if (showSeq)
+pg_log_error("Did not find any sequences named \"%s\".", pattern);
+			else if (showForeign)
+pg_log_error("Did not find any foreign tables named \"%s\".", pattern);
+			else/* should not get here */
+pg_log_error("Did not find any ??? named \"%s\".", pattern);
+		}
 		else
-			pg_log_error("Did not find any relations.");
+		{
+			if (ntypes != 1)
+pg_log_error("Did not find any relations.");
+			else if (showTables)
+pg_log_error("Did not find any tables.");
+			else if (showIndexes)
+pg_log_error("Did not find any indexes.");
+			else if (showViews)
+pg_log_error("Did not find any views.");
+			else if (showMatViews)
+pg_log_error("Did not find any materialized views.");
+			else if (showSeq)
+pg_log_error("Did not find any sequences.");
+			else if (showForeign)
+pg_log_error("Did not find any foreign tables.");
+			else/* should not get here */
+pg_log_error("Did not find any ??? relations.");
+		}
 	}
 	else
 	{
-		myopt.title = _("List of relations");
+		myopt.title =
+			(ntypes != 1) ? _("List of relations") :
+			(showTables) ? _("List of tables") :
+			(showIndexes) ? _("List of indexes") :
+			(showViews) ? _("List of views") :
+			(showMatViews) ? _("List of materialized views") :
+			(showSeq) ? _("List of sequences") :
+			(showForeign) ? _("List of foreign tables") :
+			"List of ???";		/* should not get here */
 		myopt.translate_header = true;
 		myopt.translate_columns = translate_columns;
 		myopt.n_translate_columns = lengthof(translate_columns);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 74d9ff2998..75a078ada9 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -116,7 +116,7 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 RESET SESSION AUTHORIZATION;
 REASSIGN OWNED BY regress_dep_user1 TO regress_dep_user2;
 \dt deptest
-  List of relations
+List of tables
  Schema |  Name   | Type  |

Re: Avoid updating inactive_since for invalid replication slots

2025-02-03 Thread Amit Kapila

On Tue, Feb 4, 2025 at 8:37 AM Peter Smith  wrote:
>
> Some review comments for patch v1-0001
>
> ==
> GENERAL - Missing Test case?
>
> 1.
> Should there be some before/after test case for 'active_since' value
> with invalid slots to verify that the patch is doing what it says?
>

I think the existing tests should be sufficient. Adding a new test
just for invalid slots means we need to consider the other properties
of slots as well because in general the invalid slots shouldn't be
updated.

>
> RestoreSlotFromDisk:
>
> 3.
> + if (now == 0)
> + now = GetCurrentTimestamp();
> +
> + ReplicationSlotSetInactiveSince(slot, now, false);
>
> 3a.
> The code from v65-0001 in the other thread [1] (where the bulk of this
> v1 patch came from) used to have a RestoreSlotFromDisk function
> comment saying "Avoid calling ReplicationSlotSetInactiveSince() here,
> as it will not set the time for invalid slots."
>
> In other words, yesterday we were deliberately NOT calling function
> ReplicationSlotSetInactiveSince, but today we deliberately ARE calling
> it  (???).
>
> Why has it changed?
>

See my last comment in that thread (1). In short, the invalid slots
should never be update inactive_since and the same should be updated
in docs.

> ~~~
>
> 3b.
> The other code that is similar to this deferred assignment of 'now'
> (see function update_synced_slots_inactive_since) includes a comment
> /* Use the same inactive_since time for all the slots. */.
>
> Should this code do the same?
>

This makes sense.

>
> 4b.
> Since this is a static inline function I assume performance is
> important (e.g. sometimes it is called within a spinlock). If that is
> the case, then maybe writing this logic with fewer conditions would be
> better.
>
> SUGGESTION
>
> if (s->data.invalidated == RS_INVAL_NONE)
> {
>   if (acquire_lock)
>   {
> SpinLockAcquire(&s->mutex);
> s->inactive_since = ts;
> SpinLockRelease(&s->mutex);
>   }
>   else
> s->inactive_since = now;
> }
>

I prefer the current coding in the patch as it is simpler to follow.

(1) - 
https://www.postgresql.org/message-id/CAA4eK1L6A-RC2PkqAFjgmmmS6_vKxrLsG33vdJzeeRKBP8RbOA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Avoid updating inactive_since for invalid replication slots

2025-02-03 Thread Zhijie Hou (Fujitsu)

On Monday, February 3, 2025 8:03 PM Nisha Moond  
wrote:
> 
> Hi Hackers,
> (CC people involved in the earlier discussion)
> 
> Right now, it is possible for the 'inactive_since' value of an invalid 
> replication
> slot to be updated multiple times, which is unexpected behavior.
> As suggested in the ongoing thread [1], this patch introduces a new dedicated
> function to update the inactive_since for a given replication slot, and 
> ensures
> that inactive_since is not updated for an invalid replication slot.
> 
> 
> The v1 patch is attached. Any feedback would be appreciated.

Thanks for sharing the patch, and I agree we should avoid updating
inactive_since for invalid slots.

But I have one question for the code:

+/*
+ * Set slot's inactive_since property unless it was previously invalidated.
+ */
+static inline void
+ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz now,
+   bool 
acquire_lock)
+{
+   if (s->data.invalidated != RS_INVAL_NONE)
+   return;

I am wondering is it safe to access the 'invalidated' flag without taking 
spinlock ?
Strictly speaking, I think it's only safe to access slot property without lock
when the slot is acquiring by the current process. So, if it's safe here,
could you please add some comments to clarify the same ?

Best Regards,
Hou zj

Re: injection points for hash aggregation

2025-02-03 Thread Michael Paquier

On Tue, Feb 04, 2025 at 09:21:49AM +0900, Michael Paquier wrote:
> Being able to pass down some runtime states to a callback is something
> I've drafted a patch for, as of:
> https://www.postgresql.org/message-id/z23zce4w1djuk...@paquier.xyz

Oops.  I've posted an incorrect link here, I meant this one instead:
https://www.postgresql.org/message-id/za8zwmptnjk_i...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: [fixed] Trigger test

2025-02-03 Thread Dmitrii Bondar


Dmitrii Bondar писал(а) 2025-01-29 16:53:

Hi, Hackers!

I was testing a connection pooler with `make installcheck` and noticed 
that `check_foreign_key()` from the `refint` library reuses the same 
cached plan for cascade `UPDATE`/`DELETE` operations. As a result, a 
cascade `DELETE` is applied after an `UPDATE` command on the primary 
key table (which should not happen after the commit 
https://github.com/postgres/postgres/commit/d489fdfc7f4ccf0010fe0397e7272bdfc257e8f2). 
I have attached a file (test.sql) to reproduce an issue and the 
solution.


Regards,
Dmitrii Bondar.
Found a mistake. Now it should work even if the SPI call fails (v2 
attachment). However, the whole solution looks awkward because if 
`check_primary_key` is triggered by a function other than 
check_foreign_key, we still encounter invalid behavior. The root of the 
problem is the inability to see the row that triggered the initial 
`check_foreign_key`.


I am also considering another solution (v3 attachment): instead of using 
static variables, restrict the use of the `check_primary_key` and 
`check_foreign` functions in BEFORE triggers so that the 
`check_primary_key` trigger can find the new row. This still doesn't 
solve the problem (a user could create their own BEFORE trigger that 
make `UPDATE` and trigger `check_primary_key`), but it adds less new 
code, at least.


Regards,
Dmitrii BondarFrom 054b2945dc7d47328fb972c725a3a89a0af300a9 Mon Sep 17 00:00:00 2001
From: Bondar Dmitrii 
Date: Mon, 3 Feb 2025 15:45:21 +0700
Subject: [PATCH v2] Triggers test fix

---
 contrib/spi/refint.c   | 29 +++---
 src/test/regress/expected/triggers.out | 11 +-
 2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index e1aef7cd2a..7344f6c742 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -29,6 +29,9 @@ static int	nFPlans = 0;
 static EPlan *PPlans = NULL;
 static int	nPPlans = 0;
 
+static bool update_in_progress = false;
+static TransactionId update_tx = InvalidTransactionId;
+
 static EPlan *find_plan(char *ident, EPlan **eplan, int *nplans);
 
 /*
@@ -98,6 +101,10 @@ check_primary_key(PG_FUNCTION_ARGS)
 	nargs = trigger->tgnargs;
 	args = trigger->tgargs;
 
+	/* Do not check if It's a cascade update from check_foreign_key */
+	if (update_in_progress && update_tx == GetCurrentTransactionId())
+		return PointerGetDatum(tuple);
+
 	if (nargs % 2 != 1)			/* odd number of arguments! */
 		/* internal error */
 		elog(ERROR, "check_primary_key: odd number of arguments should be specified");
@@ -264,7 +271,6 @@ check_foreign_key(PG_FUNCTION_ARGS)
 #ifdef DEBUG_QUERY
 	elog(DEBUG4, "check_foreign_key: Enter Function");
 #endif
-
 	/*
 	 * Some checks first...
 	 */
@@ -335,10 +341,10 @@ check_foreign_key(PG_FUNCTION_ARGS)
 	kvals = (Datum *) palloc(nkeys * sizeof(Datum));
 
 	/*
-	 * Construct ident string as TriggerName $ TriggeredRelationId and try to
-	 * find prepared execution plan(s).
+	 * Construct ident string as TriggerName $ TriggeredRelationId $ OperationType
+	 * and try to find prepared execution plan(s).
 	 */
-	snprintf(ident, sizeof(ident), "%s$%u", trigger->tgname, rel->rd_id);
+	snprintf(ident, sizeof(ident), "%s$%u$%c", trigger->tgname, rel->rd_id, is_update ? 'U' : 'D');
 	plan = find_plan(ident, &FPlans, &nFPlans);
 
 	/* if there is no plan(s) then allocate argtypes for preparation */
@@ -560,6 +566,9 @@ check_foreign_key(PG_FUNCTION_ARGS)
 	/*
 	 * Ok, execute prepared plan(s).
 	 */
+	if (is_update)
+		update_tx = GetCurrentTransactionId();
+
 	for (r = 0; r < nrefs; r++)
 	{
 		/*
@@ -570,10 +579,11 @@ check_foreign_key(PG_FUNCTION_ARGS)
 
 		relname = args[0];
 
-		snprintf(ident, sizeof(ident), "%s$%u", trigger->tgname, rel->rd_id);
 		plan = find_plan(ident, &FPlans, &nFPlans);
+		update_in_progress = (is_update == 1);
 		ret = SPI_execp(plan->splan[r], kvals, NULL, tcount);
 		/* we have no NULLs - so we pass     here */
+		update_in_progress = false;
 
 		if (ret < 0)
 			ereport(ERROR,
@@ -592,10 +602,15 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		}
 		else
 		{
+			const char* operation;
+
+			if (action == 'c')
+operation = is_update ? "updated" : "deleted";
+			else
+operation = "set to null";
 #ifdef REFINT_VERBOSE
 			elog(NOTICE, "%s: " UINT64_FORMAT " tuple(s) of %s are %s",
- trigger->tgname, SPI_processed, relname,
- (action == 'c') ? "deleted" : "set to null");
+ trigger->tgname, SPI_processed, relname, operation);
 #endif
 		}
 		args += nkeys + 1;		/* to the next relation */
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index e38304bee5..cc02b81bb2 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -116,12 +116,13 @@ delete from pkeys where pkey1 = 40 and pkey2 = '4';
 NOTICE:  check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
 NOTICE:  check_pkeys_fkey_cascade: 1 tuple(

Re: Optimize scram_SaltedPassword performance

2025-02-03 Thread Michael Paquier

On Mon, Feb 03, 2025 at 12:17:27PM +0100, Daniel Gustafsson wrote:
> The numbers are quite outdated but the gist of it holds.  If we speed it up
> serverside we need to counter that with a higher iteration count, and for a
> rogue client it's unlikely to matter since it wont use our code anyways.
> Speeding up HMAC for other usecases is a different story (but also likely to
> have less measurable performance impact).

Thanks, I've managed to somewhat forget this point.  Note that this
has come up as well when adding the GUC scram_iterations in
b577743000cd, where a lower count can be used for a faster connection.

> Storing any part of a cryptograhic calculation, let alone a key, in a static
> variable doesn't seem entirely like a best practice, and it also wont be
> threadsafe.

Agreed.

By the way, something I have forgotten to mention yesterday..  If a
fast-path gets implemented in these HMAC APIs (perhaps not), it sounds
to me that it would be better to achieve the same for the fallback
non-OpenSSL implementation in src/common/hmac.c, as it would reflect
things properly in the interface.  My 2c.
--
Michael


signature.asc
Description: PGP signature

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Greg Sabino Mullane

On Mon, Feb 3, 2025 at 5:38 PM Tom Lane  wrote:

> One problem with it is that while we can leave "List of ???" out of the
> set of translatable strings easily, we can't currently do that for the
> argument of pg_log_error because it's automatically a gettext trigger.  I'd
> rather not burden translators with figuring out what to do with that.  Is
> it worth creating pg_log_error_internal, equivalently to elog and
> errmsg_internal in the backend?
>

I don't think so, unless there are other uses of it waiting. For this
particular item, I'm still of the opinion that leaving it as "List of
relations" is a pretty good default for some future somebody who forgets to
update the lists of relation types. New types are certainly not going to
happen often. Better "List of relations" than a release where everybody and
their cousin starts asking what "List of ???" means.

BTW, I updated the regression tests for this, and that bears out your
> argument that one-type commands are the majority.  There are still a couple
> "List of relations", but not many.
>

Thank you for that, I should have done that in my original patch.

Cheers,
Greg

P.S. I found it amusing to see in some quick grepping that pg_controldata
has a msgstr of "???". So far no language has managed to translate it into
something else.

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: Question -- why is there no errhint_internal function?

2025-02-03 Thread Tom Lane

Peter Smith  writes:
> I noticed today that there is no 'errhint_internal' function partner
> for the 'errhint' function.

> Now, it might seem that hints are always intended for user output so
> of course, you'll always want them translated...

Yeah, I think that was the reasoning.  If it needs a hint it must be
a user-facing error.

> but there are some
> calls to this function (like below) where the actual hint message is
> already built and translated before %s parameter substitution, so
> AFAICT translation aka gettext lookup of just a "%s" format string
> doesn't really achieve anything.

There are lots of places where we don't worry about that overhead.
I'm not excited about trying to get rid of it everywhere --- surely
these paths are not very performance-critical?

regards, tom lane

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Tom Lane

Greg Sabino Mullane  writes:
> ... For this
> particular item, I'm still of the opinion that leaving it as "List of
> relations" is a pretty good default for some future somebody who forgets to
> update the lists of relation types. New types are certainly not going to
> happen often. Better "List of relations" than a release where everybody and
> their cousin starts asking what "List of ???" means.

Hopefully an oversight like that wouldn't make it to release.  But
hiding it by putting out an ordinary-looking title might help it
avoid detection for a long time.

> P.S. I found it amusing to see in some quick grepping that pg_controldata
> has a msgstr of "???". So far no language has managed to translate it into
> something else.

There are several pre-existing "???" strings in describe.c too,
for similar this-shouldn't-happen cases.  I'm not proposing
something that's far outside our existing practice.

regards, tom lane

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Tom Lane

Greg Sabino Mullane  writes:
> I toyed with that for a bit, but as you say, without generating a ton of
> combinations to translate, we'd have to fall back on run-time
> constructions. Neither is ideal. I also realized that I almost never type
> "\dti". Very common for me are \d and \dt and \dv etc. but combinations are
> something I never bother with. At that point, I just do a \d. I think given
> how rare (granted, anecdotally) those combinations are, it's okay if we
> expose people to the "r" word.

Fair.  This is already a step forward, so it doesn't have to be
perfect.

Looking at the code, I'm not thrilled with the strspn() coding
method.  I'd much rather it relied on the bool flags we already
computed (showTables etc).  I'm wondering about adding a step like

int ntypes = (int) showTables + (int) showIndexes + ...

(the explicit coercions to int probably aren't necessary)
and then the code could look like

(ntypes != 1) ? _("List of relations") :
(showTables) ? _("List of tables") :
(showIndexes) ? _("List of indexes") :
...

"ntypes" could also be used to simplify the logic that forces
all the flags on, up at the top of the function.

regards, tom lane

Re: New GUC autovacuum_max_threshold ?

2025-02-03 Thread wenhui qiu

HI Nathan
> I had the opportunity to bring this patch up for discussion at the
> developer meeting at FOSDEM PGDay last week [0].  We discussed a subset
> of the topics folks have already written about in this thread, and AFAICT
> there was general approval among the attendees for proceeding with the
> "hard cap" approach due to its user-friendliness.  Given that, I am
> planning to commit the attached patch in the near future (although I may
> fiddle with the commit message a bit more).
Thanks for your work on this ,that is good news.Any method that solves the
issue of vacuum being triggered by large tables is a good method.When more
comprehensive vacuum statistics become available in the future, we can
improve the calculation method then.


On Tue, Feb 4, 2025 at 3:51 AM Nathan Bossart 
wrote:

> I had the opportunity to bring this patch up for discussion at the
> developer meeting at FOSDEM PGDay last week [0].  We discussed a subset
> of the topics folks have already written about in this thread, and AFAICT
> there was general approval among the attendees for proceeding with the
> "hard cap" approach due to its user-friendliness.  Given that, I am
> planning to commit the attached patch in the near future (although I may
> fiddle with the commit message a bit more).
>
> On Tue, Jan 14, 2025 at 11:35:17PM +0300, Alena Rybakina wrote:
> > #autovacuum_vacuum_max_threshold = 1# max number of row
> updates
> > # before vacuum; -1 disables max
> > # threshold
> >
> > I think instead of "# threshold" should be "#vacuum"?
>
> That would more closely match the description of
> autovacuum_vacuum_insert_threshold, which refers to "insert vacuums," but I
> felt it would be weird to refer to "max vacuums."  IMHO it is clearer to
> say that -1 disables the maximum threshold here.
>
> > There is a typo:
> >
> > * if (threshold > vac_max_thresh)
> > * threshold = vac_max_thres; - here
>
> Fixed.
>
> > I think you should add more information to the description of the
> > Relations_needs_vacanalyze function: what is vac_max_thresh and how is it
> > calculated. It is not clear what the below condition means.
> >
> > /* -1 is used to disable max threshold */
> > vac_max_thresh= (relopts&& relopts->vacuum_max_threshold>= -1)
> > ? relopts->vacuum_max_threshold
> > : autovacuum_vac_max_thresh;
>
> I looked at the commentary for this function and felt that the comments for
> this new parameter are in line with the comments for all the adjacent
> parameters.  There may be an opportunity to improve this commentary, but
> IMHO that would be better handled in a separate patch that improved it for
> all these parameters.
>
> [0] https://2025.fosdempgday.org/devmeeting
>
> --
> nathan
>

Re: Non-text mode for pg_dumpall

2025-02-03 Thread jian he

hi.

just a quick response for v15.

the pg_restore man page says option --list as "List the table of
contents of the archive".
but
$BIN10/pg_restore --format=directory --list --file=1.sql dir10
also output the contents of "global.dat", we should not output it.

in restoreAllDatabases, we can do the following change:
```
/* Open global.dat file and execute/append all the global sql commands. */
if (!opts->tocSummary)
process_global_sql_commands(conn, dumpdirpath, opts->filename);
```


what should happen with
$BIN10/pg_restore --format=directory --globals-only --verbose dir10 --list

Should we error out saying "--globals-only" and "--list" are conflict options?
if so then in main function we can do the following change:

```
if (globals_only)
{
process_global_sql_commands(conn, inputFileSpec, opts->filename);
if (conn)
PQfinish(conn);
pg_log_info("databases restoring is skipped as -g/--globals-only
option is specified");
}
```


in restoreAllDatabases, if num_db_restore == 0, we will still call
process_global_sql_commands.
I am not sure this is what we expected.

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-02-03 Thread Andres Freund

Hi,

On 2025-01-29 14:12:52 -0500, Melanie Plageman wrote:
> From 71f32189aad510b73d221fb0478ffd916e5e5dde Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 27 Jan 2025 12:23:00 -0500
> Subject: [PATCH v14] Eagerly scan all-visible pages to amortize aggressive
>  vacuum
> 
> Amortize the cost of an aggressive vacuum by eagerly scanning some
> all-visible but not all-frozen pages during normal vacuums.

I think it'd be good to explain the problem this is trying to address a bit
more (i.e. the goal is to avoid the situation which a lot of work is deferred
until an aggressive vacuum, which is then very expensive).


> Because the goal is to freeze these all-visible pages, all-visible pages
> that are eagerly scanned and set all-frozen in the visibility map are
> counted as successful eager freezes and those not frozen are considered
> failed eager freezes.

I don't really understand this sentence. "Because the goal is to freeze these
all-visible pages" doesn't really relate with the rest of the sentence.



> +  xreflabel="vacuum_max_eager_freeze_failure_rate">
> +  vacuum_max_eager_freeze_failure_rate 
> (floating point)
> +  
> +   vacuum_max_eager_freeze_failure_rate 
> configuration parameter
> +  
> +  
> +  
> +   
> +Specifies the maximum fraction of pages that
> +VACUUM may scan and fail to 
> set
> +all-frozen in the visibility map before disabling eager scanning. A
> +value of 0 disables eager scanning altogether. The
> +default is 0.03 (3%).
> +   

Fraction of what?


> +   
> +Note that when eager scanning is enabled, successful page freezes
> +do not count against this limit, although they are internally
> +capped at 20% of the all-visible but not all-frozen pages in the
> +relation. Capping successful page freezes helps amortize the
> +overhead across multiple normal vacuums.
> +   

What does it mean that they are not counted, but are capped?


> +   
> +If a table is building up a backlog of all-visible but not all-frozen
> +pages, a normal vacuum may choose to scan skippable pages in an effort to
> +freeze them. Doing so decreases the number of pages the next aggressive
> +vacuum must scan. These are referred to as eagerly
> +scanned pages. Eager scanning can be tuned to attempt
> +to freeze more all-visible pages by increasing
> +. Even if eager
> +scanning has kept the number of all-visible but not all-frozen pages to a
> +minimum, most tables still require periodic aggressive vacuuming.
> +   

Maybe mention that the aggressive vacuuming will often be cheaper than without
eager freezing, even if necessary?


> + * Normal vacuums count all-visible pages eagerly scanned as a success when
> + * they are able to set them all-frozen in the VM and as a failure when they
> + * are not able to set them all-frozen.

Maybe some more punctuation would make this more readable? Or a slight
rephrasing?


> + * Because we want to amortize the overhead of freezing pages over multiple
> + * vacuums, normal vacuums cap the number of successful eager freezes to
> + * MAX_EAGER_FREEZE_SUCCESS_RATE of the number of all-visible but not
> + * all-frozen pages at the beginning of the vacuum. Once the success cap has
> + * been hit, eager scanning is disabled for the remainder of the vacuum of 
> the
> + * relation.

It also caps the maximum "downside" of freezing eagerly, right? Seems worth
mentioning.



> + /*
> +  * Now calculate the eager scan start block. Start at a random spot
> +  * somewhere within the first eager scan region. This avoids eager
> +  * scanning and failing to freeze the exact same blocks each vacuum of 
> the
> +  * relation.
> +  */

If I understand correctly, we're not really choosing a spot inside the first
eager scan region, we determine the bounds of the first region?



> @@ -930,16 +1188,21 @@ lazy_scan_heap(LVRelState *vacrel)
>   vacrel->current_block = InvalidBlockNumber;
>   vacrel->next_unskippable_block = InvalidBlockNumber;
>   vacrel->next_unskippable_allvis = false;
> + vacrel->next_unskippable_eager_scanned = false;
>   vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>  
> - while (heap_vac_scan_next_block(vacrel, &blkno, 
> &all_visible_according_to_vm))
> + while (heap_vac_scan_next_block(vacrel, &blkno, 
> &all_visible_according_to_vm,
> + 
> &was_eager_scanned))

Pedantic^3: Is past tense really appropriate? We *will* be scanning that page
in the body of the loop, right?


> @@ -1064,7 +1327,45 @@ lazy_scan_heap(LVRelState *vacrel)
>   if (got_cleanup_lock)
>   lazy_scan_prune(vacrel, buf, blkno, page,
>   vmbuffer, 
> all_visible_according_to_vm,
> -

Re: Prevent COPY FREEZE on Foreign tables

2025-02-03 Thread Sami Imseih

> Unless there's some existing way to specify a FREEZE option in the FDW API
> (I assume there isn't), I agree with this.

I am not aware of any way to do this either as postgres_fdw is simply
using libpq.

> > I was also looking at why we block a parent from COPY FREEZE[1], but

> Commit 5c9a551 and its thread [0] appear to have some additional details
> about this.
>
> [0] 
> https://postgr.es/m/CAKJS1f9BbJ%2BFY3TbdCiap3qXHTFOiwtays9s36-oQkkM9_R5bg%40mail.gmail.com

Thanks for the thread. I will go through it.
I did not too much investigation here yet but I can imagine this might
be a worthwhile
to see the possibility to remove this limitation on a partitioned table.

Regards,

Sami

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Nathan Bossart

On Mon, Feb 03, 2025 at 03:12:57PM -0500, Tom Lane wrote:
> Greg Sabino Mullane  writes:
>> I toyed with that for a bit, but as you say, without generating a ton of
>> combinations to translate, we'd have to fall back on run-time
>> constructions. Neither is ideal. I also realized that I almost never type
>> "\dti". Very common for me are \d and \dt and \dv etc. but combinations are
>> something I never bother with. At that point, I just do a \d. I think given
>> how rare (granted, anecdotally) those combinations are, it's okay if we
>> expose people to the "r" word.
> 
> Fair.  This is already a step forward, so it doesn't have to be
> perfect.

+1.  I don't use combinations like \dti regularly, either.

-- 
nathan

Re: New GUC autovacuum_max_threshold ?

2025-02-03 Thread Nathan Bossart

I had the opportunity to bring this patch up for discussion at the
developer meeting at FOSDEM PGDay last week [0].  We discussed a subset
of the topics folks have already written about in this thread, and AFAICT
there was general approval among the attendees for proceeding with the
"hard cap" approach due to its user-friendliness.  Given that, I am
planning to commit the attached patch in the near future (although I may
fiddle with the commit message a bit more).

On Tue, Jan 14, 2025 at 11:35:17PM +0300, Alena Rybakina wrote:
> #autovacuum_vacuum_max_threshold = 1    # max number of row updates
>                     # before vacuum; -1 disables max
>                     # threshold
> 
> I think instead of "# threshold" should be "#vacuum"?

That would more closely match the description of
autovacuum_vacuum_insert_threshold, which refers to "insert vacuums," but I
felt it would be weird to refer to "max vacuums."  IMHO it is clearer to
say that -1 disables the maximum threshold here.

> There is a typo:
> 
> * if (threshold > vac_max_thresh)
> * threshold = vac_max_thres; - here

Fixed.

> I think you should add more information to the description of the
> Relations_needs_vacanalyze function: what is vac_max_thresh and how is it
> calculated. It is not clear what the below condition means.
> 
> /* -1 is used to disable max threshold */
> vac_max_thresh= (relopts&& relopts->vacuum_max_threshold>= -1)
> ? relopts->vacuum_max_threshold
> : autovacuum_vac_max_thresh;

I looked at the commentary for this function and felt that the comments for
this new parameter are in line with the comments for all the adjacent
parameters.  There may be an opportunity to improve this commentary, but
IMHO that would be better handled in a separate patch that improved it for
all these parameters.

[0] https://2025.fosdempgday.org/devmeeting

-- 
nathan
>From 392b07ea4d1a14d045937894d29e6679199a513f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 3 Feb 2025 13:20:46 -0600
Subject: [PATCH v7 1/1] Introduce autovacuum_vacuum_max_threshold.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

One way we trigger an autovacuum on a table is to compare the
number of updated or deleted tuples with a value calculated using
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.
The threshold provides the base value for comparison, and the scale
factor provides the fraction of the table size to add to that
value.  This strategy ensures that smaller tables are vacuumed
after fewer updates/deletes than larger tables, which is reasonable
in many cases but can result in infrequent vacuums of very large
tables.  This is undesirable for a couple of reasons, such as very
large tables acquiring a huge amount of bloat between vacuums.

This new parameter provides a way to set a cap on the value
calculated with autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor, which is intended to help trigger
more frequent vacuums on very large tables.  By default, it is set
to 100,000,000 tuples, but it can be disabled by setting it to -1.
It can also be adjusted for individual tables by changing storage
parameters.

Co-authored-by: Frédéric Yhuel 
Reviewed-by: Melanie Plageman 
Reviewed-by: Robert Haas 
Reviewed-by: Laurenz Albe 
Reviewed-by: Michael Banck 
Reviewed-by: Joe Conway 
Reviewed-by: Sami Imseih 
Reviewed-by: David Rowley 
Reviewed-by: wenhui qiu 
Reviewed-by: Robert Treat 
Reviewed-by: Alena Rybakina 
Discussion: 
https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
---
 doc/src/sgml/config.sgml  | 24 +++
 doc/src/sgml/maintenance.sgml |  6 +++--
 doc/src/sgml/ref/create_table.sgml| 15 
 src/backend/access/common/reloptions.c| 11 +
 src/backend/postmaster/autovacuum.c   | 12 ++
 src/backend/utils/misc/guc_tables.c   |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  3 +++
 src/bin/psql/tab-complete.in.c|  2 ++
 src/include/postmaster/autovacuum.h   |  1 +
 src/include/utils/rel.h   |  1 +
 10 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a782f109982..1e4bb613a98 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8580,6 +8580,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

+  
+   autovacuum_vacuum_max_threshold 
(integer)
+   
+autovacuum_vacuum_max_threshold
+configuration parameter
+   
+   
+   
+
+ Specifies the maximum number of updated or deleted tuples needed to
+ trigger a VACUUM in any one table, i.e., a cap on
+ the value calculated with
+ autovacuum_vacuum_threshold and
+ autovacuum_vacuum_scale_factor.  The default is
+ 100,000,000 tuples.

Re: why there is not VACUUM FULL CONCURRENTLY?

2025-02-03 Thread Nathan Bossart

On Thu, Jan 30, 2025 at 04:29:35PM +0100, Alvaro Herrera wrote:
> A bunch of people discussed this patch in today's developer meeting in
> Brussels.  There's pretty much a consensus on using the verb REPACK
> CONCURRENTLY for this new command -- where unadorned REPACK would be
> VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
> REPACK USING INDEX to take the CLUSTER place.

+1

One small thing I thought of after the meeting was that this effectively
forces users to always specify an index if they want to REPACK WITH INDEX.
Today, CLUSTER will use the same index as before if one is not specified.
IMHO requiring users to specify the index is entirely reasonable, but I
figured I'd at least note the behavior change.

-- 
nathan

Re: improve DEBUG1 logging of parallel workers for CREATE INDEX?

2025-02-03 Thread Sami Imseih

At this point I am planning on withdrawing this patch
from the March commitfest. I don't think fixing the REINDEX
debug1 output makes a whole lot of sense. I still think more logging
for (CREATE|ALTER) (INDEX|TABLE) will be a good to have but there
needs to be more discussion about the best approach.

Regards,

Sami

Re: POC, WIP: OR-clause support for indexes

2025-02-03 Thread Alena Rybakina


Thank you for updated version! I agree for your version of the code.

On 02.02.2025 21:00, Alexander Korotkov wrote:

On Fri, Jan 31, 2025 at 4:31 PM Alena Rybakina
  wrote:

I started reviewing at the patch and saw some output "ERROR" in the output of 
the test and is it okay here?

SELECT * FROM tenk1 t1
WHERE t1.thousand = 42 OR t1.thousand = (SELECT t2.tenthous FROM tenk1 t2 WHERE 
t2.thousand = t1.tenthous);
ERROR: more than one row returned by a subquery used as an expression

The output is correct for this query.  But the query is very
unfortunate for the regression test.  I've revised query in the v47
revision [1].

Links.
1.https://www.postgresql.org/message-id/CAPpHfdsBZmNt9qUoJBqsQFiVDX1%3DyCKpuVAt1YnR7JCpP%3Dk8%2BA%40mail.gmail.com
While analyzing the modified query plan from the regression test, I 
noticed that despite using a full seqscan for table t2 in the original 
plan,
its results are cached by Materialize node, and this can significantly 
speed up the execution of the NestedLoop algorithm.


For example, after running the query several times, I got results that 
show that the query execution time was twice as bad.


Original plan:

EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 
WHERE t1.a=t2.b OR t1.a=1; QUERY PLAN 
 
Nested Loop (cost=0.00..70067.00 rows=2502499 width=24) (actual 
time=0.015..1123.247 rows=2003000 loops=1) Join Filter: ((t1.a = t2.b) 
OR (t1.a = 1)) Rows Removed by Join Filter: 1997000 Buffers: shared 
hit=22 -> Seq Scan on bitmap_split_or t1 (cost=0.00..31.00 rows=2000 
width=12) (actual time=0.006..0.372 rows=2000 loops=1) Buffers: shared 
hit=11 -> Materialize (cost=0.00..41.00 rows=2000 width=12) (actual 
time=0.000..0.111 rows=2000 loops=2000) Storage: Memory Maximum Storage: 
110kB Buffers: shared hit=11 -> Seq Scan on bitmap_split_or t2 
(cost=0.00..31.00 rows=2000 width=12) (actual time=0.003..0.188 
rows=2000 loops=1) Buffers: shared hit=11 Planning Time: 0.118 ms 
Execution Time: 1204.874 ms (13 rows)


Query plan after the patch:

EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 
WHERE t1.a=t2.b OR t1.a=1; QUERY PLAN 
--- 
Nested Loop (cost=0.28..56369.00 rows=2502499 width=24) (actual 
time=0.121..2126.606 rows=2003000 loops=1) Buffers: shared hit=16009 
read=2 -> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 
width=12) (actual time=0.017..0.652 rows=2000 loops=1) Buffers: shared 
hit=11 -> Index Scan using t_a_b_idx on bitmap_split_or t1 
(cost=0.28..18.15 rows=1002 width=12) (actual time=0.044..0.627 
rows=1002 loops=2000) Index Cond: (a = ANY (ARRAY[t2.b, 1])) Buffers: 
shared hit=15998 read=2 Planning Time: 0.282 ms Execution Time: 2344.367 
ms (9 rows)


I'm afraid that we may lose this with this optimization. Maybe this can 
be taken into account somehow, what do you think?


--
Regards,
Alena Rybakina
Postgres Professional

Re: Truncate logs by max_log_size

2025-02-03 Thread Jim Jones

Hi Kirill

On 31.01.25 11:46, Kirill Gavrilov wrote:
> Sorry for the long silence.  I fixed the indentation and a trailing
> whitespace. Should look fine now.

The patch applies cleanly, the documentation is clear, and all tests pass.

It is possible to change this new parameter session-wise, which is nice!

postgres=# SET max_log_size TO 7;
SET
postgres=# SHOW max_log_size;
 max_log_size
--
 7B
(1 row)

The default value now is clear and it corresponds to the value set on
postgresql.conf:

#max_log_size = 0   # max size of logged statement 

postgres=# SHOW max_log_size;
 max_log_size
--
 0
(1 row)

Logs are truncated as expected:

postgres=# SET max_log_size TO 6;
SET
postgres=# SELECT length('CALL xyz;');
 length

  9
(1 row)

postgres=# CALL xyz;
ERROR:  syntax error at or near ";"
LINE 1: CALL xyz;

log entry:

2025-02-03 10:58:19.975 CET [123945] ERROR:  syntax error at or near ";"
at character 9
2025-02-03 10:58:19.975 CET [123945] STATEMENT:  CALL x

The issue with log entry sizes for queries containing special characters
was resolved by setting the unit to bytes.

Overall, everythingLGTM.

The new status of this patch is: Ready for Committer

Jim

Re: Show WAL write and fsync stats in pg_stat_io

2025-02-03 Thread Bertrand Drouvot

On Mon, Feb 03, 2025 at 08:50:15AM +, Bertrand Drouvot wrote:
> === 6
> 
> I'll also do some benchmark on my side.

So, I did some tests using:

c=1 && psql -c checkpoint -c 'select pg_switch_wal()' &&
pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT 
pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 2

With 2 message size: 8192 and 10.

Here are the results (outliers removed and tsc clock source):

++-+-+
| Version| Msg Size 10 | Msg Size 8K |
++-+-+
| With PATCH |980 TPS  |910 TPS  |
| On Master  |980 TPS  |910 TPS  |
++-+-+

So the patch does not produce perf regression according to those tests.

Out of curiosity I also played a bit with the IO tracking (and hpet clock 
source)
and got:

+-+-+-+---+
| Test Configuration  |PATCH|   MASTER| % Change |
+-+-+-+---+
| track_io_timing |  |
|   Message size 8192 |805 TPS  |810 TPS  |   -0.6%  |
|   Message size 10   |860 TPS  |860 TPS  |0.0%  |
+-+-+-+---+
| track_wal_io_timing |  |
|   Message size 8192 |810 TPS  |810 TPS  |0.0%  |
|   Message size 10   |860 TPS  |860 TPS  |0.0%  |
+-+-+-+---+
| track_wal_io + track_io |  |
|   Message size 8192 |800 TPS  |800 TPS  |0.0%  |
|   Message size 10   |855 TPS  |860 TPS  |   -0.6%  |
+-+-+-+---+

Based on those results the patch does not show a noticable impact when IO timing
tracking is/are enabled.

FYI, It’s also worth noticing that if hpet is set then it also affect negatively
even if no timing tracking is set. It means that when track IO timing is/are
enabled the perf regression seen above are not fully related to having then
enabled but also (for a large part) to hpet vs tsc.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Increased work_mem for "logical replication tablesync worker" only?

2025-02-03 Thread Dmitry Koterov

Hi.

1. Those are temp files on the destination node (where the logical
subscription exists and tablesync worker runs), not on the source. On the
source, it’s all clear.

2. No “spill” suffix/substring in the file names. I tried to look at the
content of these temp files, I I saw some text fragments from the original
table’s text column there. I.e. it looks like for some reason, the stream
received from the source node’s COPY command goes to that temp files (at
least partially).

3. I made several more experiments, increasing work_mem to several GB (for
the role which tablesync worker uses when copying) definitely helps with
temp files.

Thanks!

On Sun, Feb 2, 2025 at 19:10 Amit Kapila  wrote:

> On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov 
> wrote:
> >
> > Trying to monitor perf during the initial tablesync phase (COPY) right
> after CREATE SUBSCRIPTION. I noticed that the size of
> 17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as
> the COPY command (running internally on the publisher) progresses. Then in
> the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the
> approximate number of rows equals to the number of rows on the source node)
> it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and
> the subscription progresses.
> >
> > It seems like if I increase work_mem to several GB, then the growth of
> 17/main/base/pgsql_tmp becomes less significant.
> >
> > Questions:
> >
> > 1. Are there some diagnostics commands that would allow me to figure out
> what is in those tmp files? Why does the subscriber create those tmp files
> and not just write directly to the data files and WAL? (The table has 2
> bytea columns, i.e. it's TOASTed for sure.)
> >
>
> We do write spill files (ending with '.spill') if the changes are
> large. Can you please share the name of tmp files to avoid any
> assumptions?
>
> --
> With Regards,
> Amit Kapila.
>

Re: Optimize scram_SaltedPassword performance

2025-02-03 Thread Michael Paquier

On Mon, Feb 03, 2025 at 03:11:48PM +0800, Zixuan Fu wrote:
> With this change, the performance improves by approximately **4x** (reducing
> execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
> does not support context reuse, and modifying it would require substantial
> changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
> case, maintaining the existing logic.

I suspect that keeping an interface like pg_hmac_reuse() is going to
encourage incorrect practices in the long term, for a behavior that's
hidden in OpenSSL itself.  In short, while I agree that the
performance could be better based on the information you are providing
(did not test nor check OpenSSL across the branches we support), your
proposal to address the problem makes me wonder that we'd better limit
this change localized into scram_SaltedPassword() rather than increase
the footprint of the APIs in the two HMAC-specific files of
src/common/.
--
Michael


signature.asc
Description: PGP signature

Re: Non-text mode for pg_dumpall

2025-02-03 Thread Srinath Reddy

here's the whole version of delta patch

diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 42c4fe3ce2..90e6b71a50 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -84,7 +84,7 @@ static int restoreAllDatabases(PGconn *conn, const char
*dumpdirpath,
SimpleStringList db_exclude_patterns, RestoreOptions *opts, int
numWorkers);
 static void execute_global_sql_commands(PGconn *conn, const char
*dumpdirpath,
  const char *outfile);
-static void copy_global_file_to_out_file(const char *outfile, FILE *pfile);
+static void copy_global_file(const char *outfile, FILE *pfile);
 static int filter_dbnames_for_restore(PGconn *conn,
   SimpleDatabaseOidList *dbname_oid_list,
   SimpleStringList db_exclude_patterns);
@@ -1178,7 +1178,7 @@ execute_global_sql_commands(PGconn *conn, const char
*dumpdirpath, const char *o
  */
  if (outfile)
  {
- copy_global_file_to_out_file(outfile, pfile);
+ copy_global_file(outfile, pfile);
  return;
  }

@@ -1207,24 +1207,35 @@ execute_global_sql_commands(PGconn *conn, const
char *dumpdirpath, const char *o
 }

 /*
- * copy_global_file_to_out_file
+ * copy_global_file
  *
- * This will copy global.dat file into out file.
+ * This will copy global.dat file into out file, if file is given
+ * else copies to stdout.
+ *
  */
 static void
-copy_global_file_to_out_file(const char *outfile, FILE *pfile)
+copy_global_file(const char *outfile, FILE *pfile)
 {
  char out_file_path[MAXPGPATH];
  FILE *ofile;
  int c;

- snprintf(out_file_path, MAXPGPATH, "%s", outfile);
- ofile = fopen(out_file_path, PG_BINARY_W);
+ if (strcmp(outfile, "-") == 0)
+ {
+ int fn = fileno(stdout);
+ ofile = fdopen(dup(fn), PG_BINARY_W);
+ }
+ else
+ {
+ snprintf(out_file_path, MAXPGPATH, "%s", outfile);
+ ofile = fopen(out_file_path, PG_BINARY_W);
+ }
+

  if (ofile == NULL)
  {
  fclose(pfile);
- pg_fatal("could not open file: \"%s\"", out_file_path);
+ pg_fatal("could not open file: \"%s\"", outfile);
  }

  /* Now append global.dat into out file. */


> Regards,
> Srinath Reddy Sadipiralla,
> EDB: https://www.enterprisedb.com 
>
>>

Re: Optimize scram_SaltedPassword performance

2025-02-03 Thread Fu Zixuan

I initially intended to restrict the modification to the
scram_SaltedPassword(), but I'm unsure how to determine
whether the current implementation is OpenSSL.

Alternatively, could we modify pg_hmac_init() to allow `key=NULL`
in the native HMAC implementation, achieving semantics similar to
OpenSSL? However, without carefully designing the logic for reusing
the context, the performance difference would be minimal. In my tests,
I skipped the initialization of `i_pad` and `o_pad` when `key=NULL`,
but the execution time remained around 10ms. That said, we can avoid
using the pg_hmac_reuse(), though we may need to add additional
explanations in the comments of pg_hmac_init().

--

regards,
Zixuan

On Mon, Feb 3, 2025 at 4:16 PM Michael Paquier  wrote:
>
> On Mon, Feb 03, 2025 at 03:11:48PM +0800, Zixuan Fu wrote:
> > With this change, the performance improves by approximately **4x** (reducing
> > execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
> > does not support context reuse, and modifying it would require substantial
> > changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
> > case, maintaining the existing logic.
>
> I suspect that keeping an interface like pg_hmac_reuse() is going to
> encourage incorrect practices in the long term, for a behavior that's
> hidden in OpenSSL itself.  In short, while I agree that the
> performance could be better based on the information you are providing
> (did not test nor check OpenSSL across the branches we support), your
> proposal to address the problem makes me wonder that we'd better limit
> this change localized into scram_SaltedPassword() rather than increase
> the footprint of the APIs in the two HMAC-specific files of
> src/common/.
> --
> Michael

Re: Non-text mode for pg_dumpall

2025-02-03 Thread jian he

hi.

git clean -fdx && $BIN10/pg_dumpall --format=directory --file=dir10
$BIN10/pg_restore --format=directory --file=1.sql --verbose dir10 >
dir_format 2>&1

there is no "\connect dbname" command.
pipe 1.sql to psql will execute all the database dump into a single
database, which is not good.
we need "\connect dbname" in file 1.sql


<<<>>>--
$BIN10/pg_dumpall --format=directory --exclude-database=src10 --file=dir12_temp
drop table t from database x
$BIN10/pg_restore --format=directory --dbname=x --verbose dir12_temp >
dir_format 2>&1
log info--
pg_restore: found database "template1" (OID: 1) in map.dat file while restoring.
pg_restore: found database "x" (OID: 19554) in map.dat file while restoring.
pg_restore: found total 2 database names in map.dat file
pg_restore: needs to restore 2 databases out of 2 databases
pg_restore: restoring dump of pg_dumpall without -C option, there
might be multiple databases in directory.
pg_restore: restoring database "template1"
pg_restore: connecting to database for restore
pg_restore: implied data-only restore
pg_restore: restoring database "x"
pg_restore: connecting to database for restore
pg_restore: processing data for table "public.t"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 3374; 0 19555 TABLE DATA t jian
pg_restore: error: could not execute query: ERROR:  relation
"public.t" does not exist
Command was: COPY public.t (a) FROM stdin;
pg_restore: warning: errors ignored on restore: 1
pg_restore: number of restored databases are 2

$BIN10/pg_restore --format=directory --list dir12_temp
selected output:

; Selected TOC Entries:
;
217; 1259 19555 TABLE public t jian
3374; 0 19555 TABLE DATA public t jian
3228; 2606 19560 CONSTRAINT public t t_pkey jian

As you can see, dir12_temp has TABLE and TABLE DATA.
so the above log message: "pg_restore: implied data-only restore" is
not what we expected.

BTW, add --create option, it works as i expected.
like
$BIN10/pg_restore --format=directory --create --dbname=x --verbose
dir12_temp > dir_format 2>&1
output is what i expected.

<<<>>>--
with the changes in filter_dbnames_for_restore.
so --exclude-database=pattern
will behave differently when you specify the --file option or not.

* --file option specified
-exclude-database=pattern not allow any special wildcard character.
it does not behave the same as the doc mentioned.
* --file option not specified, it behaves the same as the doc mentioned.

That's kind of tricky, either more words in the doc explain the
scarenio where --file option is specified
or disallow --file option when --exclude-database is specified.


we need to update pg_restore.sgml about MAX_ON_EXIT_NICELY 100?


there is some corner like  num_db_restore == 0, num_db_restore >= 100
in that scarenio, the execute_global_sql_commands already executed,
which is not ideal, since you have pg_fatal and some sql commands
already executed.
maybe we can be if 0 < num_db_restore < 100 then
call execute_global_sql_commands and restoreAllDatabases.


the attached patch trying to do that.
attached patch also doing some cosmetic changes.


v14-0001-pg_restore-dump-global-objects-at-least-one-d.no-cfbot
Description: Binary data

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Andrey Borodin




> On 3 Feb 2025, at 02:56, Tom Lane  wrote:
> 
> I decided to see what would happen if we tried to avoid the code
> duplication in pl_funcs.c by making some "walker" infrastructure
> akin to expression_tree_walker.  While that doesn't seem useful
> for the dump_xxx functions, it works very nicely for the free_xxx
> functions and now for the mark_xxx ones as well.  pl_funcs.c
> nets out about 400 lines shorter than in the v4 patch.  The
> code coverage score for the file is still awful :-(, but that's
> because we're not testing the dump_xxx functions at all.
> 
> PFA v5.  The new 0001 patch refactors the free_xxx infrastructure
> to create plpgsql_statement_tree_walker(), and then in what's now
> 0003 we can use that instead of writing a lot of duplicate code.


Pre-preliminary refactoring looks good to me, as the rest of the patch set.

(Well, maybe paramarg2 resonates a bit, just from similarity with varchar2)

ecpg tests seem to fail on Windows[0], but looks like it's not related to this 
thread.


Best regards, Andrey Borodin.

[0] https://cirrus-ci.com/task/4835794898124800

Re: Unsafe access BufferDescriptors array in BufferGetLSNAtomic()

2025-02-03 Thread Tender Wang

Xuneng Zhou  于2025年1月8日周三 13:35写道：

> Hi Tender,
>
> I’ve looked through the patch, and I believe there is a potential issue.
> The default size for BufferDescriptors appears to be 16,384. Passing and
> casting a negative buffer ID to a large unsigned integer in
> GetBufferDescriptor, and then using it as an array subscript, could
> potentially lead to an overflow.
>
> void
> BufferManagerShmemInit(void)
> {
> boolfoundBufs,
> foundDescs,
> foundIOCV,
> foundBufCkpt;
>
> /* Align descriptors to a cacheline boundary. */
> BufferDescriptors = (BufferDescPadded *)
> ShmemInitStruct("Buffer Descriptors",
> NBuffers *
> sizeof(BufferDescPadded),
> &foundDescs);
>
> int NBuffers = 16384;
>
> The changes proposed in the patch seem reasonable to me, but it might be
> helpful to include an explanation of the error case and how it’s handled.
>

Thanks for reviewing.
The  BufferGetLSNAtomic() with this patch looks not complex. I think no
need more explanation here.


> Best regards,
> [Xuneng]
>
> The new status of this patch is: Waiting on Author
>

I change the status to Ready for commiter
-- 
Thanks,
Tender Wang

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Nisha Moond

On Mon, Feb 3, 2025 at 12:04 PM Amit Kapila  wrote:
>
> On Mon, Feb 3, 2025 at 6:16 AM Peter Smith  wrote:
> >
> >
> > 2.
> > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > + "max_slot_wal_keep_size");
> >   break;
> > 2a.
> > In this case, shouldn't you really be using macro _("You might need to
> > increase \"%s\".") so that the common format string would be got using
> > gettext()?
> >
> > ~
> >
> >
> > ~~~
> >
> > 3.
> > + appendStringInfo(&err_hint, "You might need to increase \"%s\".",
> > + "idle_replication_slot_timeout");
> > + break;
> >
> > 3a.
> > Ditto above. IMO this common format string should be got using macro.
> > e.g.: _("You might need to increase \"%s\".")
> >
> > ~
>
> Instead, we can directly use '_(' in errhint as we are doing in one
> other similar place "errhint("%s", _(view_updatable_error;". I
> think we didn't use it for errdetail because, in one of the cases, it
> needs to use ngettext
>

Please find the v67 patches:
 - The patches have been rebased after separating the inactive_since
related changes.
 - Patches 001 and 002 incorporate the above comments and the comments
from [1] and [2].
 - No change in patch-003 since the last version.


[1] 
https://www.postgresql.org/message-id/CALDaNm0FS%2BFqQk2dadiJFCMM_MhKROMsJUb%3Db8wtRH6isScQsQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAHut%2BPs_6%2BNBOt%2BKpQQaBG2R3T-FLS93TbUC27uzyDMu%3D37n-Q%40mail.gmail.com

--
Thanks,
Nisha
From 26c4c85a709221d59e9911b693c910f489672ade Mon Sep 17 00:00:00 2001
From: Nisha Moond 
Date: Mon, 3 Feb 2025 15:20:40 +0530
Subject: [PATCH v67 1/3] Introduce inactive_timeout based replication slot
 invalidation

Till now, postgres has the ability to invalidate inactive
replication slots based on the amount of WAL (set via
max_slot_wal_keep_size GUC) that will be needed for the slots in
case they become active. However, choosing a default value for
this GUC is a bit tricky. Because the amount of WAL a database
generates, and the allocated storage per instance will vary
greatly in production, making it difficult to pin down a
one-size-fits-all value.

It is often easy for users to set a timeout of say 1 or 2 or n
days, after which all the inactive slots get invalidated. This
commit introduces a GUC named idle_replication_slot_timeout.
When set, postgres invalidates slots (during non-shutdown
checkpoints) that are idle for longer than this amount of
time.

Note that the idle timeout invalidation mechanism is not applicable
for slots that do not reserve WAL or for slots on the standby server
that are being synced from the primary server (i.e., standby slots
having 'synced' field 'true'). Synced slots are always considered to be
inactive because they don't perform logical decoding to produce changes.
---
 doc/src/sgml/config.sgml  |  40 
 doc/src/sgml/logical-replication.sgml |   5 +
 doc/src/sgml/system-views.sgml|   7 +
 src/backend/replication/slot.c| 171 --
 src/backend/utils/adt/timestamp.c |  18 ++
 src/backend/utils/misc/guc_tables.c   |  14 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_basebackup/pg_createsubscriber.c   |   4 +
 src/bin/pg_upgrade/server.c   |   7 +
 src/include/replication/slot.h|   3 +
 src/include/utils/guc_hooks.h |   2 +
 src/include/utils/timestamp.h |   3 +
 12 files changed, 260 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a782f10998..a065fbbaab 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4423,6 +4423,46 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows

   
 
+ 
+  idle_replication_slot_timeout (integer)
+  
+   idle_replication_slot_timeout configuration parameter
+  
+  
+  
+   
+Invalidate replication slots that have remained idle longer than this
+duration. If this value is specified without units, it is taken as
+minutes. A value of zero disables the idle timeout invalidation
+mechanism. The default is one day. This parameter can only be set in
+the postgresql.conf file or on the server command
+line.
+   
+
+   
+Slot invalidation due to idle timeout occurs during checkpoint.
+Because checkpoints happen at checkpoint_timeout
+intervals, there can be some lag between when the
+idle_replication_slot_timeout was exceeded and when
+the slot invalidation is triggered at the next checkpoint.
+To avoid such lags, users can force a checkpoint to promptly invalidate
+inactive slots. The duration of slot inactivity is calculated using the
+slot's pg_replication_slots.inactive_since
+value.
+   
+
+   
+Note that the idle timeout invalidation mechanism is no

Re: Skip collecting decoded changes of already-aborted transactions

2025-02-03 Thread Masahiko Sawada

On Wed, Jan 29, 2025 at 11:12 PM Peter Smith  wrote:
>
> On Tue, Jan 28, 2025 at 9:26 PM Masahiko Sawada  wrote:
> >
> > On Mon, Jan 27, 2025 at 7:01 PM Peter Smith  wrote:
> > >
> ...
>
> > > To be honest, I didn't understand the "CLEAR" part of that name. It
> > > seems more like it should've been called something like
> > > RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
> > > whatever  instead of something that appears to be saying "has the
> > > RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
> > > to over-comment everything but OTOH currently there is no way really
> > > to understand what these flags mean without looking through all the
> > > code to try to figure them out from the usage.
> > >
> > > My recurring gripe about these flags is simply that their meanings and
> > > how to use them should be apparent just by looking at reorderbuffer.h
> > > and not having to guess anything or look at how they get used in the
> > > code. It doesn't matter if that is achieved by better constant names,
> > > by more comments or by enhanced macros/functions with asserts but
> > > currently just looking at that file still leaves the reader with lots
> > > of unanswered questions.
> >
> > I see your point. IIUC we have the comments about what the checks with
> > the flags means but not have the description about the relationship
> > among the flags. I think we can start a new thread for clarifying
> > these flags and their usage. We can also discuss renaming
> > RBTXN_IS_SERIALIZED[_CLEARE] there too.
> >
>
> OK.
>
> ==
>
> Some comments for patch v17-0002.

Thank you for reviewing the patch.

>
> ==
> .../replication/logical/reorderbuffer.c
>
> ReorderBufferSkipPrepare:
>
> 1.
> + /* txn must have been marked as a prepared transaction */
> + Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
> +
>   txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
>
> Should this also be asserting that the _SENT_PREPARE flag is false,
> because we cannot be skipping it if we already sent the prepare.
>
> ~~~
>
> ReorderBufferFinishPrepared:
>
> 2.
>
> - txn->txn_flags |= RBTXN_PREPARE;
> -
>   /*
> - * The prepare info must have been updated in txn even if we skip
> - * prepare.
> + * txn must have been marked as a prepared transaction and skipped but
> + * not sent a prepare. Also, the prepare info must have been updated
> + * in txn even if we skip prepare.
>   */
> + Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 0);
> + Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
>   Assert(txn->final_lsn != InvalidXLogRecPtr);
>
> 2a.
> If it must have been prepared *and* skipped (as the comment says) then
> the first assert should be written as:
> Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE))
> == (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));
>
> or easier to just have 2 asserts:
> Assert(txn->txn_flags & RBTXN_IS_PREPARED);
> Assert(txn->txn_flags & RBTXN_SKIPPED_PREPARE);
>

Agreed with all the above comments. Since checking
prepared-transaction-related-flags is getting complicated I've
introduced RBTXN_PREPARE_STATUS_FLAGS so that we can check the desired
prepared transaction status easily.

> ~
>
> 2b.
> later in the same function there is code:
>
> if (is_commit)
>   rb->commit_prepared(rb, txn, commit_lsn);
> else
>   rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
>
> So it is OK to do a commit_prepared/rollback_prepared even though no
> prepare() has been sent?

IIUC ReorderBufferReplay() is responsible for sending a prepare
message in this case. See the comment around there:

/*
 * By this time the txn has the prepare record information and it is
 * important to use that so that downstream gets the accurate
 * information. If instead, we have passed commit information here
 * then downstream can behave as it has already replayed commit
 * prepared after the restart.
 */

I've attached the updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v18-0001-Skip-logical-decoding-of-already-aborted-transac.patch
Description: Binary data


v18-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patch
Description: Binary data

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Tom Lane

I wrote:
> Admittedly this is all moot unless some other extension starts
> using EEOP_PARAM_CALLBACK, and I didn't find any evidence of that
> using Debian Code Search.  But I don't want to think of
> EEOP_PARAM_CALLBACK as being specifically tied to PL/pgSQL.

However ... given that I failed to find any outside users today,
I'm warming to the idea of "void *paramarg[2]".  That does look
less random than two separate fields.  There are probably not
any extensions that would need to change their code, and even
if there are, we impose bigger API breaks than this one in
every major release.

So I'm willing to do that if it satisfies your concern.

regards, tom lane

Re: Skip collecting decoded changes of already-aborted transactions

2025-02-03 Thread Masahiko Sawada

On Thu, Jan 30, 2025 at 7:07 PM Peter Smith  wrote:
>
> On Fri, Jan 31, 2025 at 11:04 AM Masahiko Sawada  
> wrote:
> >
> > On Wed, Jan 29, 2025 at 9:32 PM Peter Smith  wrote:
> > >
> > > ==
> > > .../replication/logical/reorderbuffer.c
> > >
> > > ReorderBufferCheckAndTruncateAbortedTXN:
> > >
> > > 2.
> > > It seemed tricky that the only place that is setting the
> > > RBTXN_IS_COMMITTED flag is the function
> > > ReorderBufferCheckAndTruncateAbortedTXN because neither the function
> > > name nor the function comment gives any indication that it should be
> > > having this side effect
> >
> > Hmm, it doesn't seem so tricky to me that a function with the name
> > ReorderBufferCheckAndTruncateAbortedTXN() checks the transaction
> > status to truncate an aborted transaction and caches the transaction
> > status as a side effect.
> >
>
> I was coming at this from a different perspective, asking myself the
> question "When can I know the RBTXN_IS_COMMITTED bit setting?" -- aka
> rbtxn_is_committed()?
>
> AFAICT it turns out we can only have confidence in that result when
> know ReorderBufferCheckAndTruncateAbortedTXN was called already for
> this tx. But this happens only when ReorderBufferCheckMemoryLimit()
> gets called. So, these bitflags are getting set as a side-effect of
> calling unrelated functions. (e.g. the fact we can't test if a tx was
> aborted/committed unless ReorderBufferCheckMemoryLimit is called
> seemed unusual to me).

I'm not sure if ReorderBufferCheckMemoryLimit() is an unrelated
function because the whole idea (also mentioned in the commit message)
is that we check the transaction status only for large transactions to
avoid CLOG lookup overheads. TBH I'm not sure why readers expect these
transaction status flags to always be set. Also in the function
comment we have:

 * the transaction is aborted. The transaction status is cached in
 * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
 * next call.

which describes the side-effect of the function that it caches the
transaction status.

> I don't know what the solution is; maybe some
> more comments would be enough.

I'm not sure how we can improve the comment TBH.

>
> > >
> > > ~~~
> > >
> > > ReorderBufferProcessTXN:
> > >
> > > 3.
> > >   if (rbtxn_prepared(txn))
> > > + {
> > >   rb->prepare(rb, txn, commit_lsn);
> > > + txn->txn_flags |= RBTXN_SENT_PREPARE;
> > > + }
> > >
> > > In ReorderBufferStreamCommit there is an assertion that we are not
> > > trying to do another prepare() if the _SENT_PREPARE flag is already
> > > set. Should this code have a similar assert?
> >
> > We can have a similar assert there but why do you think it's needed there?
>
> No particular reason, other than for consistency to have similar
> assertions everywhere that the RBTXN_SENT_PREPARE flag is set.

Okay, addressed in the v18 patch I've just sent[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoDmYZtLnPLuiERT6Cibv1Gf1DwDjzBevtqKYn0ZzMQqBQ%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Optimize scram_SaltedPassword performance

2025-02-03 Thread Daniel Gustafsson

> On 3 Feb 2025, at 08:45, Yura Sokolov  wrote:
> 
> 03.02.2025 10:11, Zixuan Fu пишет:
>> Hi Hackers,  
>> 
>> While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
>> consumed more CPU time than expected. After some investigation, I found
>> that the function performs many HMAC iterations (4096 rounds for
>> SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
>> excessive overhead.

While I don't disagree with speeding up this in general, the whole point of the
SCRAM iterations is to take a lot of time as a way to combat brute forcing.
Any attacker is likely to use a patched libpq (or not use libpq at all) so
clientside it doesn't matter as much but if we make it 4x faster serverside we
don't really achieve much other than making attacks more feasible.

The relevant portion from RFC 7677 §4 is:

The strength of this mechanism is dependent in part on the hash
iteration-count, as denoted by "i" in [RFC5802].  As a rule of thumb,
the hash iteration-count should be such that a modern machine will take
0.1 seconds to perform the complete algorithm; however, this is
unlikely to be practical on mobile devices and other relatively low-
performance systems.  At the time this was written, the rule of thumb
gives around 15,000 iterations required; however, a hash iteration-
count of 4096 takes around 0.5 seconds on current mobile handsets.
This computational cost can be avoided by caching the ClientKey
(assuming the Salt and hash iteration-count is stable).  Therefore, the
recommendation of this specification is that the hash iteration- count
SHOULD be at least 4096, but careful consideration ought to be given to
using a significantly higher value, particularly where mobile use is
less important.

The numbers are quite outdated but the gist of it holds.  If we speed it up
serverside we need to counter that with a higher iteration count, and for a
rogue client it's unlikely to matter since it wont use our code anyways.
Speeding up HMAC for other usecases is a different story (but also likely to
have less measurable performance impact).

>> OpenSSL has an optimization for this case: when the key remains the
>> same, the HMAC context can be reused with a lightweight state reset by
>> passing NULL as the key. To take advantage of this, I introduced
>> `pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.

> Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled
> in `pg_hmac_init`.

Storing any part of a cryptograhic calculation, let alone a key, in a static
variable doesn't seem entirely like a best practice, and it also wont be
threadsafe.

--
Daniel Gustafsson

Re: Index AM API cleanup

2025-02-03 Thread Peter Eisentraut


On 30.01.25 04:33, Mark Dilger wrote:

The previously committed patch (v19.1) already changed the gist strategy number 
mapping to use the (Row)CompareType, as in Mark's proposal.

In this patch set, I'm attaching the existing standalone gist translate 
function as the index AM API function for the gist index AM.  And then all 
existing callers are changed to call through the index AM API functions 
provided by Mark's patch set.

Patches 0001, 0002, 0003 are some preparatory renaming and refactoring patches.


These all look sensible.  I like that they reduce code duplication. No 
objection.


Patches 0004 and 0005 are patch v19-0008 from Mark's (via Andrew) v19 patch 
set, split into two patches, and with some function renaming from my side.


0004 needs the copyright notice updated to 2025.


Patch 0006 then pulls it all together.  The key change is that we also need to pass the 
opclass to the index AM API functions, so that access methods like gist can use it.  
Actually, I changed that to pass opfamily and opcintype instead.  I think this matches 
better with the rest of the "Index AM API cleanup" patch set, because it's more 
common to have the opfamily and type handy than the opclass.  (And internally, the gist 
support function is attached to the opfamily anyway, so it's actually simpler that way.)

I think this fits together quite well now.  Several places where gist was 
hardcoded are now fully (or mostly) independent of gist.  Also, the somewhat 
hackish get_equal_strategy_number() in the logical replication code disappears 
completely and is replaced by a proper index AM API function.  (This also takes 
care of patch v19-0011 from Mark's patch set.)


I find the hardcoded GIST_AM_OID in GetOperatorFromCompareType() antithetical 
to the purpose of this patch-set.


Yeah, actually this turned out to be unnecessary, because you can easily 
obtain the AM OID from the passed-in opclass ID.  So I have fixed it 
that way.  I have committed this whole patch set now with the mentioned 
adjustments.  I'll post a rebased version of the main patch set soon.

new commitfest transition guidance

2025-02-03 Thread Peter Eisentraut

During the developer meeting at FOSDEM last Thursday, there was strong 
consensus that we should no longer move patches to the next commitfest 
in a semi-automatic, mechanical way, has commitfest managers have 
traditionally done.  Patches should only be moved forward by someone 
involved in the patch who knows that the patch is actually being worked 
on.  That way, there is a higher likelihood that the patches arriving in 
the next commitfest actually have someone currently invested in them.


My interpretation of this is that patches should be moved forward by 
either an author, possibly a reviewer, possibly a committer signed up 
for the patch, or maybe even a colleague of an author who knows that the 
author is on vacation and will get back to it in a couple of weeks, or 
some similar situation.


CF 2025-01 has just ended, so I suggest that everyone try this now.  We 
can check in perhaps two weeks whether this results in lots of stuff 
falling through the cracks or still too much stuff with unclear status 
being moved forward, and then see what that might mean going forward.

RE: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-02-03 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> I'm concerned that users could be confused if two different names
> refer to substantially the same thing.
> 
> Having said that, I guess that we need to drastically change the
> messages. For example, I think that the wal_level worker should say
> something like "successfully made 'logical' wal_level effective"
> instead of saying something like "changed wal_level value". Also,
> users might not need gradual messages when increasing 'minimal' to
> 'logical' or decreasing 'logical' to 'minimal'.

+1 for something like "successfully made 'logical' wal_level effective", and
removing gradual messages.

> > 6.
> > With the patch present, the wal_level can be changed to the minimal even 
> > when
> the
> > streaming replication is going. If we do that, the walsender exits 
> > immediately
> and
> > the below FATAL appears periodically until the standby stops. Same things 
> > can
> be
> > said for the logical replication:
> >
> > ```
> > FATAL:  streaming replication receiver "walreceiver" could not connect to 
> > the
> primary server:
> > connection to server on socket "/tmp/.s.PGSQL." failed:
> > FATAL:  WAL senders require "wal_level" to be "replica" or "logical
> > ```
> >
> > I know this is not a perfect, but can we avoid the issue by reject the GUC 
> > update
> > if the walsender exists? Another approach is not to update the value when
> replication
> > slots need to be invalidated.
> 
> Does it mean that we reject the config file from being reloaded in
> that case? I have no idea how to reject it in a case where the
> wal_level in postgresql.conf changed and the user did 'pg_ctl reload'.

I imagined like attached. When I modified wal_level to minimal and send SIGHUP,
postmaster reported below lines and failed to update wal_level.

```
LOG:  received SIGHUP, reloading configuration files
LOG:  wal_level cannot be set to "minimal" while walsender exists
LOG:  configuration file "...postgresql.conf" contains errors; unaffected 
changes were applied
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED



add_check_hook.patch
Description: add_check_hook.patch

Re: NOT ENFORCED constraint feature

2025-02-03 Thread Ashutosh Bapat

On Mon, Feb 3, 2025 at 1:20 PM Alvaro Herrera  wrote:
>
> On 2025-Feb-03, Ashutosh Bapat wrote:
>
> > VALID, NOT ENFORCED changed to VALID, ENFORCED - data validation
> > required, constraint is enforced
>
> There's no such thing as a VALID NOT ENFORCED constraint.  It just
> cannot exist.

The document in the patch says
```
If the
  constraint is NOT ENFORCED, the database system will
  not check the constraint.  It is then up to the application code to
  ensure that the constraints are satisfied.  The database system might
  still assume that the data actually satisfies the constraint for
  optimization decisions where this does not affect the correctness of the
  result.
```
If a constraint is NOT VALID, NOT ENFORCED it can't be used for
optimization. Constraints which are VALID, NOT ENFORCED can be used
for optimizatin. That's a correct state if the application is
faithfully making sure that the constraint is satisfied, as suggested
in our documentation. Otherwise, I don't see how NOT ENFORCED
constraints would be useful.

>
> > NOT VALID, NOT ENFORCED changed to NOT_VALID, ENFORCED - no data
> > validation required, constraint is enforced on the new tuples/changes
>
> This may make sense, but it needs special nonstandard syntax.  If you
> start with a NOT VALID NOT ENFORCED constraint (which is the only way to
> have a NOT ENFORCED constraint) and apply ALTER TABLE ALTER CONSTRAINT
> ENFORCE, you will end up with a VALID ENFORCED constraint, therefore
> validation must be run.
>
> If you wanted to add a nonstandard command
> ALTER TABLE ALTER CONSTRAINT ENFORCE NO VALIDATE

Which state transition needs it? ALTER TABLE ALTER CONSTRAINT ENFORCE
is enough to change NOT VALID, NOT ENFORCED constraint to NOT VALID,
ENFORCED constraint; it does not need NO VALIDATE.

-- 
Best Wishes,
Ashutosh Bapat

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-02-03 Thread Tatsuo Ishii

> I've looked at it again and I think the code is correct, but I
> miswrote that the array needs to be sorted. The above query returns:
>  x | y | nth_value
> ---+---+---
>  1 | 1 | 2
>  2 | 2 | 1
>  3 |   | 2
>  4 | 4 |
>  5 |   | 4
>  6 | 6 | 7
>  7 | 7 | 6
> (7 rows)
> 
> This is correct, for values of x:
> 
> 1: The first non-null value of y is at position 0, however we have
> EXCLUDE CURRENT ROW so it picks the next non-null value at position 1
> and stores it in the array, returning 2.
> 2: We can now take the first non-null value of y at position 0 and
> store it in the array, returning 1.
> 3. We take 1 preceding, using the position stored in the array, returning 2.
> 4. 1 preceding and 1 following are both null, and we exclude the
> current row, so returning null.
> 5. 1 preceding is at position 3, store it in the array, returning 4.
> 6. 1 preceding is null and we exclude the current row, so store
> position 6 in the array, returning 7.
> 7. 1 preceding is at position 5, store it in the array and return 6.
> 
> It will be unordered when the EXCLUDE clause is used but the code
> should handle this correctly.

I ran this query (not using IGNORE NULLS) and get a result.

SELECT
  x,
  nth_value(x,2) OVER w
FROM generate_series(1,5) g(x)
WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE 
CURRENT ROW);
 x | nth_value 
---+---
 1 | 3
 2 | 3
 3 | 2
 4 | 3
 5 | 4
(5 rows)

Since there's no NULL in x column, I expected the same result using
IGNORE NULLS, but it was not:

SELECT
  x,
nth_value(x,2) IGNORE NULLS OVER w
FROM generate_series(1,5) g(x)
WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE 
CURRENT ROW);
 x | nth_value 
---+---
 1 | 3
 2 | 4
 3 | 4
 4 | 3
 5 | 4
(5 rows)

I suspect the difference is in the code path of
ignorenulls_getfuncarginframe and the code path in
WinGetFuncArgInFrame, which takes care of EXCLUDE like this.

case FRAMEOPTION_EXCLUDE_CURRENT_ROW:
if (abs_pos >= winstate->currentpos &&
winstate->currentpos >= 
winstate->frameheadpos)
abs_pos++;

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: POC, WIP: OR-clause support for indexes

2025-02-03 Thread Alexander Korotkov

On Mon, Feb 3, 2025 at 12:22 PM Alena Rybakina
 wrote:
>
> Thank you for updated version! I agree for your version of the code.
>
> On 02.02.2025 21:00, Alexander Korotkov wrote:
>
> On Fri, Jan 31, 2025 at 4:31 PM Alena Rybakina
>  wrote:
>
> I started reviewing at the patch and saw some output "ERROR" in the output of 
> the test and is it okay here?
>
> SELECT * FROM tenk1 t1
> WHERE t1.thousand = 42 OR t1.thousand = (SELECT t2.tenthous FROM tenk1 t2 
> WHERE t2.thousand = t1.tenthous);
> ERROR: more than one row returned by a subquery used as an expression
>
> The output is correct for this query.  But the query is very
> unfortunate for the regression test.  I've revised query in the v47
> revision [1].
>
> Links.
> 1. 
> https://www.postgresql.org/message-id/CAPpHfdsBZmNt9qUoJBqsQFiVDX1%3DyCKpuVAt1YnR7JCpP%3Dk8%2BA%40mail.gmail.com
>
> While analyzing the modified query plan from the regression test, I noticed 
> that despite using a full seqscan for table t2 in the original plan,
> its results are cached by Materialize node, and this can significantly speed 
> up the execution of the NestedLoop algorithm.
>
> For example, after running the query several times, I got results that show 
> that the query execution time was twice as bad.
>
> Original plan:
>
> EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE 
> t1.a=t2.b OR t1.a=1; QUERY PLAN 
> 
>  Nested Loop (cost=0.00..70067.00 rows=2502499 width=24) (actual 
> time=0.015..1123.247 rows=2003000 loops=1) Join Filter: ((t1.a = t2.b) OR 
> (t1.a = 1)) Rows Removed by Join Filter: 1997000 Buffers: shared hit=22 -> 
> Seq Scan on bitmap_split_or t1 (cost=0.00..31.00 rows=2000 width=12) (actual 
> time=0.006..0.372 rows=2000 loops=1) Buffers: shared hit=11 -> Materialize 
> (cost=0.00..41.00 rows=2000 width=12) (actual time=0.000..0.111 rows=2000 
> loops=2000) Storage: Memory Maximum Storage: 110kB Buffers: shared hit=11 -> 
> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) (actual 
> time=0.003..0.188 rows=2000 loops=1) Buffers: shared hit=11 Planning Time: 
> 0.118 ms Execution Time: 1204.874 ms (13 rows)
>
> Query plan after the patch:
>
> EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE 
> t1.a=t2.b OR t1.a=1; QUERY PLAN 
> ---
>  Nested Loop (cost=0.28..56369.00 rows=2502499 width=24) (actual 
> time=0.121..2126.606 rows=2003000 loops=1) Buffers: shared hit=16009 read=2 
> -> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) 
> (actual time=0.017..0.652 rows=2000 loops=1) Buffers: shared hit=11 -> Index 
> Scan using t_a_b_idx on bitmap_split_or t1 (cost=0.28..18.15 rows=1002 
> width=12) (actual time=0.044..0.627 rows=1002 loops=2000) Index Cond: (a = 
> ANY (ARRAY[t2.b, 1])) Buffers: shared hit=15998 read=2 Planning Time: 0.282 
> ms Execution Time: 2344.367 ms (9 rows)
>
> I'm afraid that we may lose this with this optimization. Maybe this can be 
> taken into account somehow, what do you think?

The important aspect is that the second plan have lower cost than the
first one.  So, that's the question to the cost model.  The patch just
lets optimizer consider more comprehensive plurality of paths.  You
can let optimizer select the first plan by tuning *_cost params.  For
example, setting cpu_index_tuple_cost = 0.02 makes first plan win for
me.

Other than that the test query is quite unfortunate as t1.a=1 is very
frequent. I've adjusted the query so that nested loop with index scan
wins both in cost and execution time.

I've also adjusted another test query as proposed by Andrei.

I'm going to push this patch is there is no more notes.

Links.
1. 
https://www.postgresql.org/message-id/fc1017ca-877b-4f86-b491-154cf123eedd%40gmail.com

--
Regards,
Alexander Korotkov
Supabase


v48-0001-Revise-the-header-comment-for-match_clause_to_in.patch
Description: Binary data


v48-0002-Allow-usage-of-match_orclause_to_indexcol-for-jo.patch
Description: Binary data

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-03 Thread Alex Friedman

Hi Sami,

Thanks for the feedback.

> 1/ Remove this as
> "(50% of the maximum, which is about 20GB),"
>
> [1] tried to avoid explaining this level of detail, and I
> agree with that.

I feel it is critical for users to know what is the hard limit of
multixact members. As PG doesn't (yet) expose how many multixact
members are in use, the only way for users to know the distance to
members wraparound is by monitoring the members directory space usage.
So it seems to me that the 20 GB number is very important to have in
the docs.

> 2/ c/"about 10GB"/"10GB" the "about" does not seem necessary here.

The threshold is actually ~10.015 GiB (due to the 12 bytes wasted per
8KB page), or ~10.75 GB, so to avoid confusion by users when
aggressive autovacuum doesn't trigger exactly at 10GB, I believe we
should either be exact, or say that we are not being exact. Being
exact is difficult as it depends on the block size. And as I looked
through the doc page in question, I noticed there are already several
cases using the "about" wording, e.g. "about 50MB of pg_xact storage"
and "about 2GB of pg_commit_ts storage", so here I went for
consistency with the rest of the doc.

Thanks,

Alex

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Tom Lane

Andrey Borodin  writes:
> (Well, maybe paramarg2 resonates a bit, just from similarity with varchar2)

I'm not wedded to that name; do you have a better idea?

Another idea could be to make it an array:

-void   *paramarg;/* private data for same */
+void   *paramarg[2]; /* private data for same */

That would require touching other code using that field, but there
probably isn't much --- at least within our own tree, plpgsql itself
is the only user of paramarg.  Still, possibly breaking code that
didn't need to be broken doesn't seem like an improvement.

> ecpg tests seem to fail on Windows[0], but looks like it's not related to 
> this thread.

Yeah, known problem with Meson dependencies[1]; it's breaking
pretty much all the cfbot's windows builds right now, although
maybe there's a timing issue that allows some to pass.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAGECzQSvM3iSDmjF%2B%3DKof5an6jN8UbkP_4cKKT9w6GZavmb5yQ%40mail.gmail.com

Better title output for psql \dt \di etc. commands

2025-02-03 Thread Greg Sabino Mullane

Please find attached a patch to enable more intuitive titles when using
psql and doing the following actions:

\dt \di \dv \dm \ds \dE

In other words, everything in listTables()

We still default to "List or relations", but if we know that the user has
requested exactly one specific type of relation, we use that instead. So
instead of

greg=# \dv
List of relations
 Schema |  Name   | Type | Owner
+-+--+---
 public | pg_stat_statements  | view | greg
 public | pg_stat_statements_info | view | greg

we get:

greg=# \dv
  List of views
 Schema |  Name   | Type | Owner
+-+--+---
 public | pg_stat_statements  | view | greg
 public | pg_stat_statements_info | view | greg
(2 rows)

The same applies for "misses" as well:

greg=# \di
Did not find any indexes.

greg=# \di foo*
Did not find any indexes named "foo*".

I thought about keeping that last one as singular, as it is replacing "Did
not find any relation named", but not only does it make more sense as a
plural given that wildcards can produce multiple relations, it makes the
code cleaner. :)

Inspired by a recent thread[0] over in pgsql-bugs entitled "Bug in psql".
(while not a bug, it is certainly a welcome enhancement, IMHO).

Cheers,
Greg

[0]
https://www.postgresql.org/message-id/CADrHaBEVSYwemoJWtry2%2B82KHy9tZirH2PVfi-uD96R5J8FCsw%40mail.gmail.com
From eaf1097eb0ac02d2ab093e4e4261d0a15adf8ac2 Mon Sep 17 00:00:00 2001
From: Greg Sabino Mullane 
Date: Mon, 3 Feb 2025 12:19:14 -0500
Subject: [PATCH] Show more intuitive titles for psql commands \dt \di \dv \dm
 \ds and \dE

This applies to the modified versions as well e.g. \dt+ \dtS+

Rather than say "List of relations", if someone has entered in a \di, we say "List of indexes"

Per complaint on the bugs list on Jan 29, 2025 from xpusostomos@gmail.com
---
 src/bin/psql/describe.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index aa4363b200..19366be2c8 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4012,9 +4012,11 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	bool		showForeign = strchr(tabtypes, 'E') != NULL;
 
 	PQExpBufferData buf;
+	PQExpBufferData title;
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	int			cols_so_far;
+	char	   *item_type;
 	bool		translate_columns[] = {false, false, true, false, false, false, false, false, false};
 
 	/* If tabtypes is empty, we default to \dtvmsE (but see also command.c) */
@@ -4161,6 +4163,15 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	if (!res)
 		return false;
 
+	item_type = psprintf("%s",
+		 (strspn(tabtypes, "tS+") == strlen(tabtypes)) ? "tables" :
+		 (strspn(tabtypes, "iS+") == strlen(tabtypes)) ? "indexes" :
+		 (strspn(tabtypes, "vS+") == strlen(tabtypes)) ? "views" :
+		 (strspn(tabtypes, "mS+") == strlen(tabtypes)) ? "materialized view" :
+		 (strspn(tabtypes, "sS+") == strlen(tabtypes)) ? "sequences" :
+		 (strspn(tabtypes, "ES+") == strlen(tabtypes)) ? "foreign tables" :
+		 "relations");
+
 	/*
 	 * Most functions in this file are content to print an empty table when
 	 * there are no matching objects.  We intentionally deviate from that
@@ -4169,14 +4180,16 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	if (PQntuples(res) == 0 && !pset.quiet)
 	{
 		if (pattern)
-			pg_log_error("Did not find any relation named \"%s\".",
-		 pattern);
+			pg_log_error("Did not find any %s named \"%s\".", item_type, pattern);
 		else
-			pg_log_error("Did not find any relations.");
+			pg_log_error("Did not find any %s.", item_type);
 	}
 	else
 	{
-		myopt.title = _("List of relations");
+		initPQExpBuffer(&title);
+		printfPQExpBuffer(&title, _("List of %s"), item_type);
+
+		myopt.title = title.data;
 		myopt.translate_header = true;
 		myopt.translate_columns = translate_columns;
 		myopt.n_translate_columns = lengthof(translate_columns);
-- 
2.30.2

Re: Non-text mode for pg_dumpall

2025-02-03 Thread jian he

On Mon, Feb 3, 2025 at 5:14 PM jian he  wrote:
>
> there is some corner like  num_db_restore == 0, num_db_restore >= 100
> in that scarenio, the execute_global_sql_commands already executed,
> which is not ideal, since you have pg_fatal and some sql commands
> already executed.
> maybe we can be if 0 < num_db_restore < 100 then
> call execute_global_sql_commands and restoreAllDatabases.
>
>
> the attached patch trying to do that.
> attached patch also doing some cosmetic changes.

hi.
please ignore the previous patch. see this email attached patch.
previously I complained that the ``pg_restore --list`` needed a db
connection and also called execute_global_sql_commands in [1]
this email attached patch fixes the problem, now pg_restore --list no
need db connection.

now the logic is:
if num_db_restore value is ok (0 < num_db_restore < MAX_ON_EXIT_NICELY)
*AND* we didn't specify --list option
then call execute_global_sql_commands.

[1] 
https://postgr.es/m/CACJufxHUDGWe=2ZukvMfuwEcSK8CsVYm=9+rtpnrw7crcfo...@mail.gmail.com

v14-0001-fix-pg_restore-list-option-and-handle-invoke-.no-cfbot
Description: Binary data

Re: Increased work_mem for "logical replication tablesync worker" only?

2025-02-03 Thread Dmitry Koterov

What's also suspicious is that on the destination node, after copying
finishes,

# explain select 1 from mytable;
 Seq Scan on mytable (cost=0.00..6821514.12 rows=250410012 width=4)

# SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE relname =
'mytable';
   relname   | estimated_rows
-+
 mytable |  150342468

Notice the discrepancy between how many rows EXPLAIN thinks there is in the
table, and how many rows pg_stat_user_tables thinks about it (250M vs.
150M).

On the source node, those 2 numbers are almost the same. It's only on the
destination they are different, right after the copying. (There are not a
lot of writes to this table happening on the source happening while copying
BTW.)

Maybe that could hint on why temp files are used?


On Mon, Feb 3, 2025 at 4:21 AM Dmitry Koterov 
wrote:

> Here is the list of tmp files:
>
> postgres@pg-101a:~/17/main/base/pgsql_tmp$ ls -la
> total 5422297
> drwx-- 2 postgres postgres  9 Feb  3 04:08 .
> drwx-- 8 postgres postgres  8 Jan 29 01:27 ..
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:05 pgsql_tmp196534.0
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:05 pgsql_tmp196534.1
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:05 pgsql_tmp196534.2
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:06 pgsql_tmp196534.3
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:07 pgsql_tmp196534.4
> -rw--- 1 postgres postgres 1073741824 Feb  3 04:08 pgsql_tmp196534.5
> -rw--- 1 postgres postgres  819396608 Feb  3 04:08 pgsql_tmp196534.6
>
> With work_mem=4GB, all those files *on the destination node* seemed to
> appear immediately with 4GB size and keep growing since then, while COPY
> progresses on the source node (i.e. it looked like PG tried hard to utilize
> work_mem, but after reaching the limit, dumped everything to pgsql_tmp
> still).
>
> The table structure being copied (just 1 index there):
>
> CREATE TABLE mytable (
> id bigint NOT NULL PRIMARY KEY,
> snippet bytea,
> title bytea,
> updated_at timestamp with time zone,
> rich_snippet bytea
> );
>
> Directories sizes on the destination node while tablesync is working (it's
> copied in to an almost empty database):
>
> $ watch du -sh 17/main/base/* 17/main/pg_wal
> 2.2M17/main/base/1
> 14G 17/main/base/16385
> 2.3M17/main/base/16387
> 2.2M17/main/base/4
> 2.3M17/main/base/5
> 12G 17/main/base/pgsql_tmp
> 6.3G17/main/pg_wal
>
> So the question, why does it use temp files. Why not just writes directly
> to WAL+data.
>
>
> On Mon, Feb 3, 2025 at 3:04 AM Dmitry Koterov 
> wrote:
>
>> Hi.
>>
>> 1. Those are temp files on the destination node (where the logical
>> subscription exists and tablesync worker runs), not on the source. On the
>> source, it’s all clear.
>>
>> 2. No “spill” suffix/substring in the file names. I tried to look at the
>> content of these temp files, I I saw some text fragments from the original
>> table’s text column there. I.e. it looks like for some reason, the stream
>> received from the source node’s COPY command goes to that temp files (at
>> least partially).
>>
>> 3. I made several more experiments, increasing work_mem to several GB
>> (for the role which tablesync worker uses when copying) definitely helps
>> with temp files.
>>
>> Thanks!
>>
>> On Sun, Feb 2, 2025 at 19:10 Amit Kapila  wrote:
>>
>>> On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov 
>>> wrote:
>>> >
>>> > Trying to monitor perf during the initial tablesync phase (COPY) right
>>> after CREATE SUBSCRIPTION. I noticed that the size of
>>> 17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as
>>> the COPY command (running internally on the publisher) progresses. Then in
>>> the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the
>>> approximate number of rows equals to the number of rows on the source node)
>>> it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and
>>> the subscription progresses.
>>> >
>>> > It seems like if I increase work_mem to several GB, then the growth of
>>> 17/main/base/pgsql_tmp becomes less significant.
>>> >
>>> > Questions:
>>> >
>>> > 1. Are there some diagnostics commands that would allow me to figure
>>> out what is in those tmp files? Why does the subscriber create those tmp
>>> files and not just write directly to the data files and WAL? (The table has
>>> 2 bytea columns, i.e. it's TOASTed for sure.)
>>> >
>>>
>>> We do write spill files (ending with '.spill') if the changes are
>>> large. Can you please share the name of tmp files to avoid any
>>> assumptions?
>>>
>>> --
>>> With Regards,
>>> Amit Kapila.
>>>
>>

Re: new commitfest transition guidance

2025-02-03 Thread Aleksander Alekseev

Hi Peter,

> CF 2025-01 has just ended, so I suggest that everyone try this now.  We
> can check in perhaps two weeks whether this results in lots of stuff
> falling through the cracks or still too much stuff with unclear status
> being moved forward, and then see what that might mean going forward.

This doesn't work unfortunately.

When I try to move my patches from CF 2025-01 to CF 2025-03 I get an error:

"""
The status of this patch cannot be changed in this commitfest. You
must modify it in the one where it's open!
"""

-- 
Best regards,
Aleksander Alekseev

Re: new commitfest transition guidance

2025-02-03 Thread Aleksander Alekseev

Hi,

> This doesn't work unfortunately.
>
> When I try to move my patches from CF 2025-01 to CF 2025-03 I get an error:
>
> """
> The status of this patch cannot be changed in this commitfest. You
> must modify it in the one where it's open!
> """

Ooops, never mind. The application asked me to log in and I didn't
notice that when I did it also moved the patch to the next CF. This
was the actual reason why I got an error.

Sorry for the noise.

-- 
Best regards,
Aleksander Alekseev

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Tom Lane

Pavel Borisov  writes:
> Minor notes on the patches:

> If dump_* functions could use the newly added walker, the code would
> look better. I suppose the main complication is that dump_* functions
> contain a lot of per-statement prints/formatting. So maybe a way to
> implement this is to put these statements into the existing tree
> walker i.e. plpgsql_statement_tree_walker_impl() and add an argument
> bool dump_debug into it. So in effect called with dump_debug=false
> plpgsql_statement_tree_walker_impl() would walk silent, and with
> dump_debug=false it would walk and print what is supposed to be
> printed currently in dump_* functions. Maybe there are other problems
> besides this?

I'm not thrilled with that idea, mainly because it would add overhead
to the performance-relevant cases (mark and free) to benefit a rarely
used debugging feature.  I'm content to leave the debug code out of
this for now --- it seems to me that it's serving a different master
and doesn't have to be unified with the other routines.

> For exec_check_rw_parameter():

> I think renaming expr->expr_simple_expr to sexpr saves few bytes but
> doesn't makes anything simpler, so if possible I'd prefer using just
> expr->expr_simple_expr with necessary casts. Furtermore in this
> function mostly we use cast results fexpr, opexpr and sbsref of
> expr->expr_simple_expr that already has separate names.

Hmm, I thought it looked cleaner like this, but I agree beauty
is in the eye of the beholder.  Anybody else have a preference?

> Transferring target param as int paramid looks completely ok. But we
> have unconditional checking Assert(paramid == expr->target_param + 1),
> so it looks as a redundant split as of now. Do we plan a true split
> and removal of this assert in the future?

We've already fetched the target variable using the paramid (cf
plpgsql_param_eval_var_check), so I think checking that the
expression does match it seems like a useful sanity check.
Agreed, it shouldn't ever not match, but that's why that's just
an Assert.

> Thanks for creating and working on this patch!

Thanks for reviewing it!

regards, tom lane

Re: SQLFunctionCache and generic plans

2025-02-03 Thread Pavel Stehule

Hi

I did multiple benchmarking, and still looks so the proposed patch doesn't
help and has significant overhead

testcase:

create or replace function fx(int) returns int as $$ select $1 + $1; $$
language sql immutable;
create or replace function fx2(int) returns int as $$ select 2 * $1; $$
language sql immutable;

I tested

do $$
begin
  for i in 1..100 loop
perform fx((random()*100)::int); -- or fx2
  end loop;
end;
$$;

Results (master, patched):
fx: 17067 ms, 22165 ms
fx2: 2234 ms, 2311 ms

the execution of dynamic sql

2025-02-03 18:47:33) postgres=# do $$
begin
  for i in 1..100 loop
execute 'select $1 + $1' using (random()*100)::int;
  end loop;
end;
$$;
DO
Time: 13412.990 ms (00:13.413)

In the profiler I see a significant overhead of the parser, so it looks
like there is some more (overhead), but plan cache is not used.

Please, can somebody recheck my tests?

Regards

Pavel

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Andrey Borodin

> On 3 Feb 2025, at 22:36, Tom Lane  wrote:
> 
> I'm not wedded to that name; do you have a better idea?

I'd propose something like attached. But feel free to ignore my suggestion: I 
do not understand context of these structure members.

Best regards, Andrey Borodin.

rename.diff
Description: Binary data

Re: SIGSEGV, FPE fix in pg_controldata

2025-02-03 Thread Alexander Korotkov

Hi Ian,

On Thu, Dec 12, 2024 at 12:23 PM Ilyasov Ian  wrote:
> SIGSEGV was caused by two places in pg_controldata.c where there
> is a work with localtime(&time_tmp));. So I added a check for not NULL.
>
> 
>
> Where casting second operand in % (XLogSegmentsPerXLogId(wal_segsz_bytes)) to 
> unsigned seems enough. Would be glad to hear your thoughts.

Thank you for catching this.  I think catching invalid timestamps is
good except we could use already existing string indicating this and
don't bother translators.  Also, I don't think we should change
segment size to uint32 as it's already defined as int in awfully a lot
of places.  Additionally WalSegMaxSize is clearly within the 32-bit
integer max value.  So, I think we can just adjust the check before
XLByteToSeg().  What do you think?

--
Regards,
Alexander Korotkov
Supabase

v2-0001-Fix-possible-pg_control_data-errors-on-corrupted-.patch
Description: Binary data

Re: Non-text mode for pg_dumpall

2025-02-03 Thread Mahendra Singh Thalor

On Mon, 3 Feb 2025 at 14:23, Srinath Reddy  wrote:
>
> Hi,
> I found a bug ,while using "./pg_restore pdd -f -" actually it has to copy 
> everything(global sql commands + remaining dump ) into stdout as per the "-f, 
> --file=FILENAME  output file name (- for stdout)" but it is copying 
> global sql commands to a file literally naming it as "-" and remaining dump 
> is written to stdout without those global sql commands."-" is not a output 
> file it signifies stdout in terminal cmds.so we have to handle this case.
> because of above reason "./pg_restore pdd -g -f -"  also  does the same 
> creates a file "-" and writes globals to that file instead of stdout.

I also tested this but in my testing, I can see that all globals are
printed into the console also. Patch was creating a "-" file that was
wrong.
Yes, we should consider "-" as a stdout. In the latest patch, I have
fixed this issue.

On Mon, 3 Feb 2025 at 14:44, jian he  wrote:
>
> hi.
>
> git clean -fdx && $BIN10/pg_dumpall --format=directory --file=dir10
> $BIN10/pg_restore --format=directory --file=1.sql --verbose dir10 >
> dir_format 2>&1
>
> there is no "\connect dbname" command.
> pipe 1.sql to psql will execute all the database dump into a single
> database, which is not good.
> we need "\connect dbname" in file 1.sql

We can't add this command directly to the dump file. We need to add
some TOC entry for this command. I will try to make a TOC entry for
this command.

>
>
> <<<>>>--
> $BIN10/pg_dumpall --format=directory --exclude-database=src10 
> --file=dir12_temp
> drop table t from database x
> $BIN10/pg_restore --format=directory --dbname=x --verbose dir12_temp >
> dir_format 2>&1
> log info--
> pg_restore: found database "template1" (OID: 1) in map.dat file while 
> restoring.
> pg_restore: found database "x" (OID: 19554) in map.dat file while restoring.
> pg_restore: found total 2 database names in map.dat file
> pg_restore: needs to restore 2 databases out of 2 databases
> pg_restore: restoring dump of pg_dumpall without -C option, there
> might be multiple databases in directory.
> pg_restore: restoring database "template1"
> pg_restore: connecting to database for restore
> pg_restore: implied data-only restore
> pg_restore: restoring database "x"
> pg_restore: connecting to database for restore
> pg_restore: processing data for table "public.t"
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 3374; 0 19555 TABLE DATA t jian
> pg_restore: error: could not execute query: ERROR:  relation
> "public.t" does not exist
> Command was: COPY public.t (a) FROM stdin;
> pg_restore: warning: errors ignored on restore: 1
> pg_restore: number of restored databases are 2
> 
> $BIN10/pg_restore --format=directory --list dir12_temp
> selected output:
>
> ; Selected TOC Entries:
> ;
> 217; 1259 19555 TABLE public t jian
> 3374; 0 19555 TABLE DATA public t jian
> 3228; 2606 19560 CONSTRAINT public t t_pkey jian
>
> As you can see, dir12_temp has TABLE and TABLE DATA.
> so the above log message: "pg_restore: implied data-only restore" is
> not what we expected.

I will do some tests with pg_dump and -t option.

>
> BTW, add --create option, it works as i expected.
> like
> $BIN10/pg_restore --format=directory --create --dbname=x --verbose
> dir12_temp > dir_format 2>&1
> output is what i expected.
>
> <<<>>>--
> with the changes in filter_dbnames_for_restore.
> so --exclude-database= class="parameter">pattern
> will behave differently when you specify the --file option or not.
>
> * --file option specified
> -exclude-database=pattern not allow any special wildcard character.
> it does not behave the same as the doc mentioned.
> * --file option not specified, it behaves the same as the doc mentioned.
>
> That's kind of tricky, either more words in the doc explain the
> scarenio where --file option is specified
> or disallow --file option when --exclude-database is specified.

We will do some more doc changes for this in next versions.

>
> we need to update pg_restore.sgml about MAX_ON_EXIT_NICELY 100?

Temporary, we increased this size. Based on other opinions, we will do
more changes for this.

>
> there is some corner like  num_db_restore == 0, num_db_restore >= 100
> in that scarenio, the execute_global_sql_commands already executed,
> which is not ideal, since you have pg_fatal and some sql commands
> already executed.
> maybe we can be if 0 < num_db_restore < 100 then
> call execute_global_sql_commands and restoreAllDatabases.

Got it. Fixed it as per delta patch and added some extra condition to
the IF clause.

Here, I am attaching an updated patch for review and testing.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


v15_pg_dumpall-with-non-text_format-4th_feb.patch
Description: Binary data

Re: Better title output for psql \dt \di etc. commands

2025-02-03 Thread Tom Lane

Greg Sabino Mullane  writes:
> Please find attached a patch to enable more intuitive titles when using
> psql and doing the following actions:
> \dt \di \dv \dm \ds \dE
> In other words, everything in listTables()
> We still default to "List or relations", but if we know that the user has
> requested exactly one specific type of relation, we use that
> instead.

As presented, I think this fails the translatability guideline about
"Do not construct sentences at run-time" [1].  You could get around
that by selecting one of several whole title strings, though.

For myself, if we were going to do something like this, I'd kind
of like to cover more cases:

greg=# \dti
List of tables and indexes

But I'm really not sure how to maintain translatability without
a combinatorial explosion.  It's fairly easy to see how we might
produce output like

greg=# \dtvi
List of tables, views, indexes

but "List of", "tables", "views", and "indexes" would all have
to be separately translated, and I fear that might not hang
together well.

regards, tom lane

[1] https://www.postgresql.org/docs/devel/nls-programmer.html#NLS-GUIDELINES

Re: Add “FOR UPDATE NOWAIT” lock details to the log.

2025-02-03 Thread Yuki Seino


Thank you for your comment. Sorry for being late.


SELECT FOR UPDATE SKIP LOCKED might skip a large number of rows, 
leading to
a high volume of log messages from log_lock_failure in your current 
patch.

Could this be considered unintended behavior? Would it be better to limit
the number of log messages per query?
It is necessary to suppress the generation of a large amount of logs due 
to SKIP LOCKED.
But I think that when using SKIP LOCKED, the locks are often 
intentionally bypassed.
It seems unnatural to log only the first write or to set a specific 
number of samples.


What do you think if we simply don't log anything for SKIP LOCKED?


Regards,

sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum

2025-02-03 Thread Yura Sokolov

Good day,

Investigating some performance issues of a client, our engineers found
msgnumLock to be contended.

Looking closer it is obvious it is not needed at all: it used only as
memory barrier. It is even stated in comment at file start:

* We deal with that by having a spinlock that readers must take for just
* long enough to read maxMsgNum, while writers take it for just long enough
* to write maxMsgNum.  (The exact rule is that you need the spinlock to
* read maxMsgNum if you are not holding SInvalWriteLock, and you need the
* spinlock to write maxMsgNum unless you are holding both locks.)
*
* Note: since maxMsgNum is an int and hence presumably atomically readable/
* writable, the spinlock might seem unnecessary.  The reason it is needed
* is to provide a memory barrier: we need to be sure that messages written
* to the array are actually there before maxMsgNum is increased, and that
* readers will see that data after fetching maxMsgNum.

So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
barriers:
- pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
is held)
- pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)

It improved situation for our client.

Note: pg_(write|read)_barrier() is chosen instead of
pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
least on x86_64 where it is translated to just a compiler barrier (empty
asm statement).
At least pg_atomic_read_membarrier_u32() is implemented currently as a
write operation, that's not good for contended place.

-

regards
Yura Sokolov aka funny-falconFrom 351b3ecb5d085559beb13ffd6dd4416dab4b4a84 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v0] sinvaladt.c: use atomic operations on maxMsgNum

msgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.

Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.
---
 src/backend/storage/ipc/sinvaladt.c | 56 -
 1 file changed, 23 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 2da91738c32..7d4912c6419 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -86,19 +86,10 @@
  * has no need to touch anyone's ProcState, except in the infrequent cases
  * when SICleanupQueue is needed.  The only point of overlap is that
  * the writer wants to change maxMsgNum while readers need to read it.
- * We deal with that by having a spinlock that readers must take for just
- * long enough to read maxMsgNum, while writers take it for just long enough
- * to write maxMsgNum.  (The exact rule is that you need the spinlock to
- * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
- * spinlock to write maxMsgNum unless you are holding both locks.)
- *
- * Note: since maxMsgNum is an int and hence presumably atomically readable/
- * writable, the spinlock might seem unnecessary.  The reason it is needed
- * is to provide a memory barrier: we need to be sure that messages written
- * to the array are actually there before maxMsgNum is increased, and that
- * readers will see that data after fetching maxMsgNum.  Multiprocessors
- * that have weak memory-ordering guarantees can fail without the memory
- * barrier instructions that are included in the spinlock sequences.
+ * We deal with that by using atomic operations with proper memory barriers.
+ * (The exact rule is that you need the read barrier after atomic read
+ * maxMsgNum if you are not holding SInvalWriteLock, and you need the
+ * write barrier before write maxMsgNum unless you are holding both locks.)
  */
 
 
@@ -168,11 +159,9 @@ typedef struct SISeg
 	 * General state information
 	 */
 	int			minMsgNum;		/* oldest message still needed */
-	int			maxMsgNum;		/* next message number to be assigned */
+	pg_atomic_uint32 maxMsgNum; /* next message number to be assigned */
 	int			nextThreshold;	/* # of messages to call SICleanupQueue */
 
-	slock_t		msgnumLock;		/* spinlock protecting maxMsgNum */
-
 	/*
 	 * Circular buffer holding shared-inval messages
 	 */
@@ -243,9 +232,8 @@ SharedInvalShmemInit(void)
 
 	/* Clear message counters, save size of procState array, init spinlock */
 	shmInvalBuffer->minMsgNum = 0;
-	shmInvalBuffer->maxMsgNum = 0;
+	pg_atomic_init_u32(&shmInvalBuffer->maxMsgNum, 0);
 	shmInvalBuffer->nextThreshold = CLEANUP_MIN;
-	SpinLockInit(&shmInvalBuffer->msgnumLock);
 
 	/* The buffer[] array is initially all unused, so we need not fill it */
 
@@ -303,7 +291,7 @@ SharedInvalBackendInit(bool sendOnly)
 
 	/* mark myself active, with all extant messages already read */

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Nisha Moond

On Mon, Feb 3, 2025 at 2:55 PM Amit Kapila  wrote:
>
> On Fri, Jan 31, 2025 at 5:50 PM Nisha Moond  wrote:
> >
> > Please find the attached v66 patch set. The base patch(v65-001) is
> > committed now, so I have rebased the patches.
> >
>
> *
>
>  The time when the slot became inactive. NULL if 
> the
> -slot is currently being streamed.
> +slot is currently being streamed. If the slot becomes invalidated,
> +this value will remain unchanged until server shutdown.
> ...
> @@ -2408,7 +2527,9 @@ RestoreSlotFromDisk(const char *name)
>   /*
>   * Set the time since the slot has become inactive after loading the
>   * slot from the disk into memory. Whoever acquires the slot i.e.
> - * makes the slot active will reset it.
> + * makes the slot active will reset it. Avoid calling
> + * ReplicationSlotSetInactiveSince() here, as it will not set the time
> + * for invalid slots.
>   */
>   slot->inactive_since = GetCurrentTimestamp();
>
> It looks inconsistent to set inactive_since on restart for invalid
> slots but not at other times. We don't need to set inactive_since for
> invalid slots. The invalid slots should not be updated. Ideally, this
> should be taken care in the patch that introduces inactive_since but
> we can do that now. Let's do this as a separate patch altogether in a
> new thread.

Created a new thread [1] to address the inactive_since update for
invalid slots in a separate patch.

[1] 
https://www.postgresql.org/message-id/CABdArM7QdifQ_MHmMA%3DCc4v8%2BMeckkwKncm2Nn6tX9wSCQ-%2Biw%40mail.gmail.com

--
Thanks,
Nisha

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-03 Thread Tom Lane

Andrey Borodin  writes:
>> On 3 Feb 2025, at 22:36, Tom Lane  wrote:
>> I'm not wedded to that name; do you have a better idea?

> I'd propose something like attached. But feel free to ignore my suggestion: I 
> do not understand context of these structure members.

Hmm, you're suggesting naming those field members after PL/pgSQL's
specific use of them.  But the intent was that they are generic
workspace for anything that provides a EEOP_PARAM_CALLBACK
callback --- that is, the "param" in the field name refers to the
fact that this is an expression step for some kind of Param, and
not to what PL/pgSQL happens to do with the field.

Admittedly this is all moot unless some other extension starts
using EEOP_PARAM_CALLBACK, and I didn't find any evidence of that
using Debian Code Search.  But I don't want to think of
EEOP_PARAM_CALLBACK as being specifically tied to PL/pgSQL.

regards, tom lane

Re: POC, WIP: OR-clause support for indexes

2025-02-03 Thread Alena Rybakina



On 03.02.2025 14:32, Alexander Korotkov wrote:

On Mon, Feb 3, 2025 at 12:22 PM Alena Rybakina
  wrote:

Thank you for updated version! I agree for your version of the code.

On 02.02.2025 21:00, Alexander Korotkov wrote:

On Fri, Jan 31, 2025 at 4:31 PM Alena Rybakina
  wrote:

I started reviewing at the patch and saw some output "ERROR" in the output of 
the test and is it okay here?

SELECT * FROM tenk1 t1
WHERE t1.thousand = 42 OR t1.thousand = (SELECT t2.tenthous FROM tenk1 t2 WHERE 
t2.thousand = t1.tenthous);
ERROR: more than one row returned by a subquery used as an expression

The output is correct for this query.  But the query is very
unfortunate for the regression test.  I've revised query in the v47
revision [1].

Links.
1.https://www.postgresql.org/message-id/CAPpHfdsBZmNt9qUoJBqsQFiVDX1%3DyCKpuVAt1YnR7JCpP%3Dk8%2BA%40mail.gmail.com

While analyzing the modified query plan from the regression test, I noticed 
that despite using a full seqscan for table t2 in the original plan,
its results are cached by Materialize node, and this can significantly speed up 
the execution of the NestedLoop algorithm.

For example, after running the query several times, I got results that show 
that the query execution time was twice as bad.

Original plan:

EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.a=t2.b OR 
t1.a=1; QUERY PLAN 

 Nested Loop (cost=0.00..70067.00 rows=2502499 width=24) (actual time=0.015..1123.247 
rows=2003000 loops=1) Join Filter: ((t1.a = t2.b) OR (t1.a = 1)) Rows Removed by Join 
Filter: 1997000 Buffers: shared hit=22 -> Seq Scan on bitmap_split_or t1 
(cost=0.00..31.00 rows=2000 width=12) (actual time=0.006..0.372 rows=2000 loops=1) 
Buffers: shared hit=11 -> Materialize (cost=0.00..41.00 rows=2000 width=12) (actual 
time=0.000..0.111 rows=2000 loops=2000) Storage: Memory Maximum Storage: 110kB Buffers: 
shared hit=11 -> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) 
(actual time=0.003..0.188 rows=2000 loops=1) Buffers: shared hit=11 Planning Time: 0.118 
ms Execution Time: 1204.874 ms (13 rows)

Query plan after the patch:

EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.a=t2.b 
OR t1.a=1; QUERY PLAN 
---
 Nested Loop (cost=0.28..56369.00 rows=2502499 width=24) (actual time=0.121..2126.606 
rows=2003000 loops=1) Buffers: shared hit=16009 read=2 -> Seq Scan on 
bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) (actual time=0.017..0.652 
rows=2000 loops=1) Buffers: shared hit=11 -> Index Scan using t_a_b_idx on 
bitmap_split_or t1 (cost=0.28..18.15 rows=1002 width=12) (actual time=0.044..0.627 
rows=1002 loops=2000) Index Cond: (a = ANY (ARRAY[t2.b, 1])) Buffers: shared 
hit=15998 read=2 Planning Time: 0.282 ms Execution Time: 2344.367 ms (9 rows)

I'm afraid that we may lose this with this optimization. Maybe this can be 
taken into account somehow, what do you think?

The important aspect is that the second plan have lower cost than the
first one.  So, that's the question to the cost model.  The patch just
lets optimizer consider more comprehensive plurality of paths.  You
can let optimizer select the first plan by tuning *_cost params.  For
example, setting cpu_index_tuple_cost = 0.02 makes first plan win for
me.

Other than that the test query is quite unfortunate as t1.a=1 is very
frequent. I've adjusted the query so that nested loop with index scan
wins both in cost and execution time.

I've also adjusted another test query as proposed by Andrei.

I'm going to push this patch is there is no more notes.

Links.
1.https://www.postgresql.org/message-id/fc1017ca-877b-4f86-b491-154cf123eedd%40gmail.com



Okay.I agree with your codeand have no more notes

--
Regards,
Alena Rybakina
Postgres Professional

Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-03 Thread Shlok Kyal

On Fri, 31 Jan 2025 at 17:50, Nisha Moond  wrote:
>
> On Fri, Jan 31, 2025 at 2:32 PM Amit Kapila  wrote:
> >
> > On Fri, Jan 31, 2025 at 10:40 AM Peter Smith  wrote:
> > >
> > > ==
> > > src/backend/replication/slot.c
> > >
> > > ReportSlotInvalidation:
> > >
> > > 1.
> > > +
> > > + case RS_INVAL_IDLE_TIMEOUT:
> > > + Assert(inactive_since > 0);
> > > + /* translator: second %s is a GUC variable name */
> > > + appendStringInfo(&err_detail,
> > > + _("The slot has remained idle since %s, which is longer than the
> > > configured \"%s\" duration."),
> > > + timestamptz_to_str(inactive_since),
> > > + "idle_replication_slot_timeout");
> > > + break;
> > > +
> > >
> > > errdetail:
> > >
> > > I guess it is no fault of this patch because I see you've only copied
> > > nearby code, but AFAICT this function is still having an each-way bet
> > > by using a mixture of _() macro which is for strings intended be
> > > translated, but then only using them in errdetail_internal() which is
> > > for strings that are NOT intended to be translated. Isn't it
> > > contradictory? Why don't we use errdetail() here?
> > >
> >
> > Your question is valid and I don't have an answer. I encourage you to
> > start a new thread to clarify this.
> >
> > > errhint:
> > >
> > > Also, the way the 'hint' is implemented can only be meaningful for
> > > RS_INVAL_WAL_REMOVED. This is also existing code that IMO it was
> > > always strange, but now that this patch has added another kind of
> > > switch (cause) this hint implementation now looks increasingly hacky
> > > to me; it is also inflexible -- e.g. if you ever wanted to add
> > > different hints. A neater implementation would be to make the code
> > > more like how the err_detail is handled, so then the errhint string
> > > would only be assigned within the "case RS_INVAL_WAL_REMOVED:"
> > >
> >
> > This makes sense to me.
> >
> > +
> > + case RS_INVAL_IDLE_TIMEOUT:
> > + Assert(inactive_since > 0);
> > + /* translator: second %s is a GUC variable name */
> > + appendStringInfo(&err_detail,
> > + _("The slot has remained idle since %s, which is longer than the
> > configured \"%s\" duration."),
> > + timestamptz_to_str(inactive_since),
> > + "idle_replication_slot_timeout");
> >
> > I think the above message should be constructed on a model similar to
> > the following nearby message:"The slot's restart_lsn %X/%X exceeds the
> > limit by %llu bytes.". So, how about the following: "The slot's idle
> > time %s exceeds the configured \"%s\" duration"?
> >
> > Also, similar to max_slot_wal_keep_size, we should give a hint in this
> > case to increase idle_replication_slot_timeout.
> >
> > It is not clear why the injection point test is doing
> > pg_sync_replication_slots() etc. in the patch. The test should be
> > simple such that after creating a new physical or logical slot, enable
> > the injection point, then run the manual checkpoint command, and check
> > the invalidation status of the slot.
> >
>
> Thanks for the review! I have incorporated the above comments. The
> test in patch-002 has been optimized as suggested and now completes in
> less than a second.
> Please find the attached v66 patch set. The base patch(v65-001) is
> committed now, so I have rebased the patches.
>
> Thank you, Kuroda-san, for working on patch-002.
>

Hi Nisha,

I reviewed the v66 patch. I have few comments:

1. I also feel the default value should be set to '0' as suggested by
Vignesh in 1st point of [1].

2. Should we allow copying of invalidated slots?
Currently we are able to copy slots which are invalidated:

postgres=# select slot_name, active, restart_lsn, wal_status,
inactive_since , invalidation_reason from pg_replication_slots;
 slot_name | active | restart_lsn | wal_status |
inactive_since  | invalidation_reason
---++-++--+-
 test1 | f  | 0/16FDDE0   | lost   | 2025-02-03
18:28:01.802463+05:30 | idle_timeout
(1 row)

postgres=# select pg_copy_logical_replication_slot('test1', 'test2');
 pg_copy_logical_replication_slot
--
 (test2,0/16FDE18)
(1 row)

postgres=# select slot_name, active, restart_lsn, wal_status,
inactive_since , invalidation_reason from pg_replication_slots;
 slot_name | active | restart_lsn | wal_status |
inactive_since  | invalidation_reason
---++-++--+-
 test1 | f  | 0/16FDDE0   | lost   | 2025-02-03
18:28:01.802463+05:30 | idle_timeout
 test2 | f  | 0/16FDDE0   | reserved   | 2025-02-03
18:29:53.478023+05:30 |
(2 rows)

3. We have similar behaviour as above for physical slots.

[1]: 
https://www.postgresql.org/message-id/CALDaNm14QrW5j6su%2BEAqjwnHbiwXJwO%2Byk73_%3D7yvc5TVY-43g%40mail.gmail.com


Thanks and Regards,
Shlok Kyal

Re: Show WAL write and fsync stats in pg_stat_io

2025-02-03 Thread Nazir Bilal Yavuz

Hi,

On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Feb 03, 2025 at 01:07:26PM +0900, Michael Paquier wrote:
> > On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:
> > > On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
> > >  wrote:
> > >> I think that's the main reason why ff99918c625 added this new GUC 
> > >> (looking at
> > >> the commit message). I'd feel more comfortable if we keep it.
> > >
> > > As Michael suggested, I will run a couple of benchmarks to see the
> > > actual effect of this change. Then let's see if this affects anything.
> >
> > I've looked at bit at all that today, and something like the attached
> > is what seems like the best streamlined version to me for the main
> > feature.  I am also planning to run some short benchmarks with
> > track_io_timing=on on HEAD and with the patch, then see the
> > difference, without any relationship to track_wal_io_timing.
>
> Thanks!
>
> I've a few comments:

Thank you both for the v13 and the review!

> === 1
>
> +   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
> +   io_start, 1, 
> wal_segment_size);
>
> In case wal_init_zero is false, then we're only seeking to the end and write a
> solitary byte. Then, is reporting "wal_segment_size" correct?

I think you are right. It would make sense to have two
pgstat_count_io_op_time() calls here. One for wal_segment_size and one
for solitary byte.

> === 2
>
> + /*
> +  * Measure I/O timing to write WAL data, for pg_stat_wal
> +  * and/or pg_stat_io.
> +  */
> + start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);
>
> I think that makes sense done that way (as track_wal_io_timing does not have
> any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
> comment to reflect the code ordering? (I mean to say re-word to "for 
> pg_stat_io
> and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.
>
> === 3
>
> What about adding a message in the doc as mentioned in [1]? (I'd not be 
> surprised
> if some people wonder why the "bytes" fields differ).
>
> === 4
>
> pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
> simplified with switch but that could be done after this one goes in.
>
> === 5
>
> I think this patch will help simplify the per-backend WAL related patch, 
> that's
> nice.

And I agree with the other comments you mentioned.

> === 6
>
> I'll also do some benchmark on my side.

Thanks!

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Optimize scram_SaltedPassword performance

2025-02-03 Thread Yura Sokolov

03.02.2025 14:17, Daniel Gustafsson пишет:
>> On 3 Feb 2025, at 08:45, Yura Sokolov  wrote:
>>
>> 03.02.2025 10:11, Zixuan Fu пишет:
>>> Hi Hackers,  
>>>
>>> While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
>>> consumed more CPU time than expected. After some investigation, I found
>>> that the function performs many HMAC iterations (4096 rounds for
>>> SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
>>> excessive overhead.
> 
> While I don't disagree with speeding up this in general, the whole point of 
> the
> SCRAM iterations is to take a lot of time as a way to combat brute forcing.
> Any attacker is likely to use a patched libpq (or not use libpq at all) so
> clientside it doesn't matter as much but if we make it 4x faster serverside we
> don't really achieve much other than making attacks more feasible.
> 
> The relevant portion from RFC 7677 §4 is:
> 
>   The strength of this mechanism is dependent in part on the hash
>   iteration-count, as denoted by "i" in [RFC5802].  As a rule of thumb,
>   the hash iteration-count should be such that a modern machine will take
>   0.1 seconds to perform the complete algorithm; however, this is
>   unlikely to be practical on mobile devices and other relatively low-
>   performance systems.  At the time this was written, the rule of thumb
>   gives around 15,000 iterations required; however, a hash iteration-
>   count of 4096 takes around 0.5 seconds on current mobile handsets.
>   This computational cost can be avoided by caching the ClientKey
>   (assuming the Salt and hash iteration-count is stable).  Therefore, the
>   recommendation of this specification is that the hash iteration- count
>   SHOULD be at least 4096, but careful consideration ought to be given to
>   using a significantly higher value, particularly where mobile use is
>   less important.
> 
> The numbers are quite outdated but the gist of it holds.  If we speed it up
> serverside we need to counter that with a higher iteration count, and for a
> rogue client it's unlikely to matter since it wont use our code anyways.
> Speeding up HMAC for other usecases is a different story (but also likely to
> have less measurable performance impact).

There is no sense to not speedup server if client can be made faster. HMAC
should protect from an attacker who have very optimized software on very
powerful hardware, since it should protect against BRUTE force against
known cipher text either.

If 4096 iterations could be performed fast in attacker's environment,
there's no point to slow down our environment. It is meaningless.

>>> OpenSSL has an optimization for this case: when the key remains the
>>> same, the HMAC context can be reused with a lightweight state reset by
>>> passing NULL as the key. To take advantage of this, I introduced
>>> `pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.
> 
>> Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled
>> in `pg_hmac_init`.
> 
> Storing any part of a cryptograhic calculation, let alone a key, in a static
> variable doesn't seem entirely like a best practice, and it also wont be
> threadsafe.

`prev_key` stores not the key, but pointer to key. Just to ensure
`pg_hmac_reuse` were called with exactly same key. It doesn't leak the key
itself.

Storing `prev_hash` is more considered as "storing part of key". I may
agree it is not exactly good idea if `key` is "human-brain-generated".

Thread-safety is more-or-less trivial with `thread_local` and synonyms for.

--

Yura

Avoid updating inactive_since for invalid replication slots

2025-02-03 Thread Nisha Moond

Hi Hackers,
(CC people involved in the earlier discussion)

Right now, it is possible for the 'inactive_since' value of an invalid
replication slot to be updated multiple times, which is unexpected
behavior.
As suggested in the ongoing thread [1], this patch introduces a new
dedicated function to update the inactive_since for a given
replication slot, and ensures that inactive_since is not updated for
an invalid replication slot.


The v1 patch is attached. Any feedback would be appreciated.


[1] 
https://www.postgresql.org/message-id/CAA4eK1L6A-RC2PkqAFjgmmmS6_vKxrLsG33vdJzeeRKBP8RbOA%40mail.gmail.com

--
Thanks,
Nisha


v1-0001-Avoid-updating-inactive_since-for-invalid-replica.patch
Description: Binary data

Re: RFC: Logging plan of the running query

2025-02-03 Thread torikoshia


Hi,

Rebased the patch since it couldn't be applied anymore.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 3f580c17214595eb8d6013674f5f054ec352ab7a Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 3 Feb 2025 21:22:40 +0900
Subject: [PATCH v40] Add function to log the plan of the currently running
 query

Currently, we have to wait for the query execution to finish
to know its plan either using EXPLAIN ANALYZE or auto_explain.
This is not so convenient, for example when investigating
long-running queries on production environments.
To improve this situation, this patch adds pg_log_query_plan()
function that requests to log the plan of the currently
executing query.

By default, only superusers are allowed to request to log the
plans because allowing any users to issue this request at an
unbounded rate would cause lots of log messages and which can
lead to denial of service.

On receipt of the request, codes for logging plan is wrapped
to every ExecProcNode and when the executor runs one of
ExecProcNode, the plan is actually logged. These wrappers are
unwrapped when once the plan is logged.
In this way, we can avoid adding the overhead which we'll face
when adding CHECK_FOR_INTERRUPTS() like mechanisms in somewhere
in executor codes safely.
---
 contrib/auto_explain/auto_explain.c  |  23 +-
 doc/src/sgml/func.sgml   |  50 +++
 src/backend/access/transam/xact.c|   7 +
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/explain.c   | 372 ++-
 src/backend/executor/execMain.c  |  12 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c  |   4 +
 src/backend/utils/init/globals.c |   2 +
 src/include/catalog/pg_proc.dat  |   6 +
 src/include/commands/explain.h   |  12 +
 src/include/miscadmin.h  |   1 +
 src/include/nodes/execnodes.h|   3 +
 src/include/storage/procsignal.h |   2 +
 src/include/tcop/pquery.h|   2 +-
 src/test/regress/expected/misc_functions.out |  54 ++-
 src/test/regress/sql/misc_functions.sql  |  41 +-
 17 files changed, 555 insertions(+), 42 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f1ad876e82..7f97c1fd51 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -399,26 +399,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->format = auto_explain_log_format;
 			es->settings = auto_explain_log_settings;
 
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-ExplainPrintTriggers(es, queryDesc);
-			if (es->costs)
-ExplainPrintJITSummary(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-es->str->data[0] = '{';
-es->str->data[es->str->len - 1] = '}';
-			}
+			ExplainAssembleLogOutput(es, queryDesc, auto_explain_log_format,
+	 auto_explain_log_triggers,
+	 auto_explain_log_parameter_max_length);
 
 			/*
 			 * Note: we rely on the existing logging of context or
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7efc81936a..26fd78e926 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28409,6 +28409,25 @@ acl  | {postgres=arwdDxtm/postgres,foo=r/postgres}

   
 
+  
+   
+
+ pg_log_query_plan
+
+pg_log_query_plan ( pid integer )
+boolean
+   
+   
+Requests to log the plan of the query currently running on the
+backend with specified process ID.
+It will be logged at LOG message level and
+will appear in the server log based on the log
+configuration set (See 
+for more information), but will not be sent to the client
+regardless of .
+
+  
+
   

 
@@ -28527,6 +28546,37 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 because it may generate a large number of log messages.

 
+   
+pg_log_query_plan can be used
+to log the plan of a backend process. For example:
+
+postgres=# SELECT pg_log_query_plan(201116);
+ pg_log_query_plan
+---
+ t
+(1 row)
+
+The format of the query plan is the same as when VERBOSE,
+COSTS, SETTINGS and
+FORMAT TEXT are used in the EXPLAIN
+command. For example:
+
+LOG:  query plan running on backend with PID 201116 is:
+Quer

Re: Enhancing Memory Context Statistics Reporting

2025-02-03 Thread Rahila Syed

Hi,


> >
> > Just idea; as an another option, how about blocking new requests to
> > the target process (e.g., causing them to fail with an error or
> > returning NULL with a warning) if a previous request is still
> pending?
> > Users can simply retry the request if it fails. IMO failing quickly
> > seems preferable to getting stuck for a while in cases with
> concurrent
> > requests.
> >
> > Thank you for the suggestion. I agree that it is better to fail
> > early and avoid waiting for a timeout in such cases. I will add a
> > "pending request" tracker for this in shared memory. This approach
> > will help prevent sending a concurrent request if a request for the
> > same backend is still being processed.
> >


Please find attached a patch that adds a request_pending field in
shared memory. This allows us to detect concurrent requests early
and return a WARNING message immediately, avoiding unnecessary
waiting and potential timeouts. This is added in v12-0002* patch.

I imagined it would work something like this:
>
> requesting backend:
> ---
> * set request_ts to current timestamp
> * signal the target process, to generate memory context info
> * wait until the DSA gets filled with stats_ts > request_ts
> * return the data, don't erase anything
>
> target backend
> --
> * clear the signal
> * generate the statistics
> * set stats_ts to current timestamp
> * wait all the backends waiting for the stats (through CV)
>

The attached v12-0002* patch implements this. We determine
the latest statistics based on the stats timestamp, if it is greater
than the timestamp when the request was sent, the statistics are
considered up to date and are returned immediately.  Otherwise,
the  client waits for the latest statistics to be published until the
timeout is reached.

With the latest changes, I don't see a dip in tps even when
concurrent requests are run in pgbench script.

 pgbench -n -f monitoring.sql -P 1 postgres -T 60
pgbench (18devel)
progress: 1.0 s, 816.9 tps, lat 1.218 ms stddev 0.317, 0 failed
progress: 2.0 s, 821.9 tps, lat 1.216 ms stddev 0.177, 0 failed
progress: 3.0 s, 817.1 tps, lat 1.224 ms stddev 0.209, 0 failed
progress: 4.0 s, 791.0 tps, lat 1.262 ms stddev 0.292, 0 failed
progress: 5.0 s, 780.8 tps, lat 1.280 ms stddev 0.326, 0 failed
progress: 6.0 s, 675.2 tps, lat 1.482 ms stddev 0.503, 0 failed
progress: 7.0 s, 674.0 tps, lat 1.482 ms stddev 0.387, 0 failed
progress: 8.0 s, 821.0 tps, lat 1.217 ms stddev 0.272, 0 failed
progress: 9.0 s, 903.0 tps, lat 1.108 ms stddev 0.196, 0 failed
progress: 10.0 s, 886.9 tps, lat 1.128 ms stddev 0.160, 0 failed
progress: 11.0 s, 887.1 tps, lat 1.126 ms stddev 0.243, 0 failed
progress: 12.0 s, 871.0 tps, lat 1.147 ms stddev 0.227, 0 failed
progress: 13.0 s, 735.0 tps, lat 1.361 ms stddev 0.329, 0 failed
progress: 14.0 s, 655.9 tps, lat 1.522 ms stddev 0.331, 0 failed
progress: 15.0 s, 674.0 tps, lat 1.484 ms stddev 0.254, 0 failed
progress: 16.0 s, 659.0 tps, lat 1.517 ms stddev 0.289, 0 failed
progress: 17.0 s, 641.0 tps, lat 1.558 ms stddev 0.281, 0 failed
progress: 18.0 s, 707.8 tps, lat 1.412 ms stddev 0.324, 0 failed
progress: 19.0 s, 746.3 tps, lat 1.341 ms stddev 0.219, 0 failed
progress: 20.0 s, 659.9 tps, lat 1.513 ms stddev 0.372, 0 failed
progress: 21.0 s, 651.8 tps, lat 1.533 ms stddev 0.372, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
progress: 22.0 s, 635.2 tps, lat 1.574 ms stddev 0.519, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
progress: 23.0 s, 730.0 tps, lat 1.369 ms stddev 0.408, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again

where monitoring.sql is as follows:
SELECT * FROM pg_get_process_memory_contexts(
  (SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
  , false);

I have split the patch into 2 patches with v12-0001* consisting of fixes
needed to allow using the MemoryContextStatsInternals for this
feature and
v12-0002* containing all the remaining changes for the feature.

A few outstanding issues are as follows:

1.  Currently one DSA  is created per backend when the first request for
statistics is made and remains for the lifetime of the server.
I think I should add logic to periodically destroy DSAs, when memory
context statistics are not being *actively* queried from the backend,
as determined by the statistics timestamp.
2. The two issues reported by Fujii-san here: [1].
i. I have proposed a fix for the first issue here [2].
ii. I am able to reproduce the second issue. This happens when we try
to query statistics of a backend running infinite_recurse.sql. While I am
working on finding a root-cause, I think it happens due to some memory
being overwritten

Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-02-03 Thread torikoshia


Hi,

When a hot standby is restarted in a state where subtransactions have 
overflowed, it may become inaccessible:


  $ psql: error: connection to server at "localhost" (::1), port 5433 
failed: FATAL:  the database system is not yet accepting connections

  DETAIL:  Consistent recovery state has not been yet reached.

However, the log message that indicates the cause of this issue seems to 
be only output at the DEBUG1 level:


  elog(DEBUG1,
   "recovery snapshot waiting for non-overflowed snapshot or "
   "until oldest active xid on standby is at least %u (now %u)",
   standbySnapshotPendingXmin,
   running->oldestRunningXid);

I believe this message would be useful not only for developers but also 
for users.

How about changing the log level from DEBUG1 to NOTICE or else?


Background:
One of our customers recently encountered an issue where the hot standby 
became inaccessible after a restart.
The issue resolved itself after some time and I suspect it was caused by 
a subtransaction overflow.
If the log level had been higher one, it would have been easier to 
diagnose the problem.
..Even if it was a NOTICE, it may be difficult to notice the cause if 
the log_min_message is set to default WARNING, but well, it seems a 
higher log level is better than DEBUG1.


I would appreciate your thoughts.


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f88..20c505add7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1125,7 +1125,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 "recovery snapshots are now enabled");
 			}
 			else
-elog(DEBUG1,
+elog(NOTICE,
 	 "recovery snapshot waiting for non-overflowed snapshot or "
 	 "until oldest active xid on standby is at least %u (now %u)",
 	 standbySnapshotPendingXmin,
@@ -1303,7 +1303,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	if (standbyState == STANDBY_SNAPSHOT_READY)
 		elog(DEBUG1, "recovery snapshots are now enabled");
 	else
-		elog(DEBUG1,
+		elog(NOTICE,
 			 "recovery snapshot waiting for non-overflowed snapshot or "
 			 "until oldest active xid on standby is at least %u (now %u)",
 			 standbySnapshotPendingXmin,

Re: POC, WIP: OR-clause support for indexes

2025-02-03 Thread Pavel Borisov

On Mon, 3 Feb 2025 at 15:54, Alena Rybakina  wrote:
>
>
> On 03.02.2025 14:32, Alexander Korotkov wrote:
>
> On Mon, Feb 3, 2025 at 12:22 PM Alena Rybakina
>  wrote:
>
> Thank you for updated version! I agree for your version of the code.
>
> On 02.02.2025 21:00, Alexander Korotkov wrote:
>
> On Fri, Jan 31, 2025 at 4:31 PM Alena Rybakina
>  wrote:
>
> I started reviewing at the patch and saw some output "ERROR" in the output of 
> the test and is it okay here?
>
> SELECT * FROM tenk1 t1
> WHERE t1.thousand = 42 OR t1.thousand = (SELECT t2.tenthous FROM tenk1 t2 
> WHERE t2.thousand = t1.tenthous);
> ERROR: more than one row returned by a subquery used as an expression
>
> The output is correct for this query.  But the query is very
> unfortunate for the regression test.  I've revised query in the v47
> revision [1].
>
> Links.
> 1. 
> https://www.postgresql.org/message-id/CAPpHfdsBZmNt9qUoJBqsQFiVDX1%3DyCKpuVAt1YnR7JCpP%3Dk8%2BA%40mail.gmail.com
>
> While analyzing the modified query plan from the regression test, I noticed 
> that despite using a full seqscan for table t2 in the original plan,
> its results are cached by Materialize node, and this can significantly speed 
> up the execution of the NestedLoop algorithm.
>
> For example, after running the query several times, I got results that show 
> that the query execution time was twice as bad.
>
> Original plan:
>
> EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE 
> t1.a=t2.b OR t1.a=1; QUERY PLAN 
> 
>  Nested Loop (cost=0.00..70067.00 rows=2502499 width=24) (actual 
> time=0.015..1123.247 rows=2003000 loops=1) Join Filter: ((t1.a = t2.b) OR 
> (t1.a = 1)) Rows Removed by Join Filter: 1997000 Buffers: shared hit=22 -> 
> Seq Scan on bitmap_split_or t1 (cost=0.00..31.00 rows=2000 width=12) (actual 
> time=0.006..0.372 rows=2000 loops=1) Buffers: shared hit=11 -> Materialize 
> (cost=0.00..41.00 rows=2000 width=12) (actual time=0.000..0.111 rows=2000 
> loops=2000) Storage: Memory Maximum Storage: 110kB Buffers: shared hit=11 -> 
> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) (actual 
> time=0.003..0.188 rows=2000 loops=1) Buffers: shared hit=11 Planning Time: 
> 0.118 ms Execution Time: 1204.874 ms (13 rows)
>
> Query plan after the patch:
>
> EXPLAIN ANALYZE SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE 
> t1.a=t2.b OR t1.a=1; QUERY PLAN 
> ---
>  Nested Loop (cost=0.28..56369.00 rows=2502499 width=24) (actual 
> time=0.121..2126.606 rows=2003000 loops=1) Buffers: shared hit=16009 read=2 
> -> Seq Scan on bitmap_split_or t2 (cost=0.00..31.00 rows=2000 width=12) 
> (actual time=0.017..0.652 rows=2000 loops=1) Buffers: shared hit=11 -> Index 
> Scan using t_a_b_idx on bitmap_split_or t1 (cost=0.28..18.15 rows=1002 
> width=12) (actual time=0.044..0.627 rows=1002 loops=2000) Index Cond: (a = 
> ANY (ARRAY[t2.b, 1])) Buffers: shared hit=15998 read=2 Planning Time: 0.282 
> ms Execution Time: 2344.367 ms (9 rows)
>
> I'm afraid that we may lose this with this optimization. Maybe this can be 
> taken into account somehow, what do you think?
>
> The important aspect is that the second plan have lower cost than the
> first one.  So, that's the question to the cost model.  The patch just
> lets optimizer consider more comprehensive plurality of paths.  You
> can let optimizer select the first plan by tuning *_cost params.  For
> example, setting cpu_index_tuple_cost = 0.02 makes first plan win for
> me.
>
> Other than that the test query is quite unfortunate as t1.a=1 is very
> frequent. I've adjusted the query so that nested loop with index scan
> wins both in cost and execution time.
>
> I've also adjusted another test query as proposed by Andrei.
>
> I'm going to push this patch is there is no more notes.
>
> Links.
> 1. 
> https://www.postgresql.org/message-id/fc1017ca-877b-4f86-b491-154cf123eedd%40gmail.com
>
>
> Okay.I agree with your code and have no more notes

Hi, Alexander!
I've looked at patchset v48 and it looks good to me.

Regards,
Pavel Borisov
Supabase

1 2 >

1 - 100 of 125 matches

Mail list logo