date:20241113

On 11/13/24 05:38, Ashutosh Bapat wrote:
> ...
> 
> Here's way we can fix SnapBuildProcessRunningXacts() similar to
> DecodeCommit(). DecodeCommit() uses SnapBuildXactNeedsSkip() to decide
> whether a given transaction should be decoded or not.
> /*
>  * Should the contents of transaction ending at 'ptr' be decoded?
>  */
> bool
> SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
> {
> return ptr < builder->start_decoding_at;
> }
> 
> Similar to SnapBuild::start_decoding_at we could maintain a field
> SnapBuild::start_reading_at to the LSN from which the WAL sender would
> start reading WAL. If candidate_restart_lsn produced by a running
> transactions WAL record is less than SnapBuild::start_reading_at,
> SnapBuildProcessRunningXacts() won't call
> LogicalIncreaseRestartDecodingForSlot() with that candiate LSN. We
> won't access the slot here and the solution will be inline with
> DecodeCommit() which skips the transactions.
> 

Could you maybe write a patch doing this? That would allow proper
testing etc.


regards

-- 
Tomas Vondra

BitmapOr node not used in plan for ANY/IN but is for sequence of ORs ...

2024-11-13 Thread Jim Vanns

(sent to general users mailing list yesterday - but perhaps this is a more
suitable audience?)

In PG16.4, we have a table of key/pair data (around 30M rows) where there
are about 7 distinct keys and each has a conditional or partial index on
them (the distribution is different for each key/value pair combination).
I've found that when we have a query that uses an OR then those partial
indexes are used but not if the query is written to use ANY/IN, which is
more convenient from a programmer POV (especially any with 3rd party query
generators etc.). Naturally, the result sets returned by the queries are
identical due to the filter semantics of any of the 3 solution variants.

Here's a shareable, MRP;

https://dbfiddle.uk/OKs_7HWv

Is there any trick I can do to get the planner to make use of the
conditional/partial index? Or is this simply an unoptimised code path yet
to be exploited!?

Cheers,

Jim

-- 
Jim Vanns
Principal Production Engineer
Industrial Light & Magic, London

Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT

2024-11-13 Thread Amit Langote

On Tue, Nov 12, 2024 at 5:46 PM Alvaro Herrera  wrote:
> On 2024-Nov-11, Amit Langote wrote:
>
> > The restriction also exists in the CREATE TABLE PARTITION OF path:
> >
> > [...]
> >
> > I think we can remove the restriction at least for the leaf partition
> > case just like in the ATTACH PARTITION path.
>
> Sure, WFM.

I rebased my patch over 14e87ffa5c54 and noticed that NOT NULL
constraints can also (or not) be marked NO INHERIT.  I think we should
allow NO INHERIT NOT NULL constraints on leaf partitions just like
CHECK constraints, so changed AddRelationNotNullConstraints() that way
and added some tests.

However, I noticed that there is no clean way to do that for the following case:

ALTER TABLE leaf_partition ADD CONSTRAINT col_nn_ni NOT NULL col NO INHERIT;

If I change the new function AdjustNotNullInheritance() added by your
commit to not throw an error for leaf partitions, the above constraint
(col_nn_ni) is not stored at all, because the function returns true in
that case, which means the caller does nothing with the constraint.  I
am not sure what the right thing to do would be.  If we return false,
the caller will store the constraint the first time around, but
repeating the command will again do nothing, not even prevent the
addition of a duplicate constraint.

-- 
Thanks, Amit Langote

v2-0001-Allow-inherited-constraints-to-be-marked-NO-INHER.patch
Description: Binary data

Re: altering a column's collation leaves an invalid foreign key

2024-11-13 Thread jian he

On Thu, Nov 7, 2024 at 8:15 PM Peter Eisentraut  wrote:
>
> Apparently this is intentional.  It's the difference between RESTRICT
> and NO ACTION.  In ri_restrict(), there is a comment:
>
>  /*
>   * If another PK row now exists providing the old key values, we should
>   * not do anything.  However, this check should only be made in the NO
>   * ACTION case; in RESTRICT cases we don't wish to allow another
> row to be
>   * substituted.
>   */
>
> In any case, this patch does not change this behavior.  It exists in old
> versions as well.
>

https://stackoverflow.com/questions/14921668/difference-between-restrict-and-no-action
mentioned about the difference between "no action" and "restrict".

RI_FKey_restrict_upd comments also says:

 * The SQL standard intends that this referential action occur exactly when
 * the update is performed, rather than after.  This appears to be
 * the only difference between "NO ACTION" and "RESTRICT".  In Postgres
 * we still implement this as an AFTER trigger, but it's non-deferrable.


DROP TABLE IF EXISTS fktable, pktable;
CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
CREATE TABLE fktable (x text collate case_insensitive REFERENCES
pktable on update restrict on delete restrict);
INSERT INTO pktable VALUES ('A'), ('Å');
INSERT INTO fktable VALUES ('a');
update pktable set x = 'a' where x = 'A';
--
DROP TABLE IF EXISTS fktable, pktable;
CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
CREATE TABLE fktable (x text collate case_insensitive REFERENCES
pktable on update no action on delete no action);
INSERT INTO pktable VALUES ('A'), ('Å');
INSERT INTO fktable VALUES ('a');
update pktable set x = 'a' where x = 'A';

In the above two cases, the last queries behavior difference does not
look like "no action" vs "restrict" mentioned in the doc (create_table.sgml),
or the above stackoverflow link.
so this part, i am still confused.

-<<>>>-
CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
INSERT INTO pktable VALUES ('A'), ('Å'), ('H');
INSERT INTO fktable VALUES ('a');

fktable foreign key variants:
collate case_insensitive REFERENCES pktable on update set default on
delete set default
collate case_insensitive REFERENCES pktable on update set null on
delete set null
collate case_insensitive REFERENCES pktable on update cascade on delete cascade

`update pktable set x = 'a' where x = 'A';`
will act as if the column "x" value has changed.
so it will do the action to fktable.

following the same logic,
maybe we should let "on update no action on delete no action"
fail for the following case:

DROP TABLE IF EXISTS fktable, pktable;
CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
CREATE TABLE fktable (x text collate case_insensitive REFERENCES
pktable on update no action on delete no action);
INSERT INTO pktable VALUES ('A'), ('Å');
INSERT INTO fktable VALUES ('a');

---expect it to fail. since column "x " value changed and is still
being referenced.
update pktable set x = 'a' where x = 'A';
-<<>>>-

I added a "on update cascade, on delete cascade" tests on collate.icu.utf8.sql
for both foreign key and primary key are nondeterministic collation.


v8-0001-add-more-tests-for-pk-fk-tie-with-nondetermini.no-cfbot
Description: Binary data

Re: Some dead code in get_param_path_clause_serials()

2024-11-13 Thread Andrei Lepikhov


On 11/13/24 16:34, Richard Guo wrote:

The function get_param_path_clause_serials() is used to get the set of
pushed-down clauses enforced within a parameterized Path.  Since we
don't currently support parameterized MergeAppend paths, and it
doesn't look like that is going to change anytime soon (as explained
in the comments for generate_orderedappend_paths), we don't need to
consider MergeAppendPath in this function.  Is it worth removing the
related code, as attached?

This change won't make any measurable difference in performance; it's
just for clarity's sake.
I've passed through the logic of 
get_param_path_clause_serials/reparameterize_path_by_child/reparameterize_path.
Agree, it seems not useful to parameterise ordered appends in the near 
future. Even So, it would be better to insert changes, induced by new 
feature by one commit.
By the way, Do you know if anyone gave a speech on the current state of 
internal plan parameterisation and its possible development directions? 
It would be helpful to have such an explanation.


--
regards, Andrei Lepikhov

Re: Converting contrib SQL functions to new style

2024-11-13 Thread Ronan Dunklau

Le mardi 12 novembre 2024, 09:30:30 heure normale d’Europe centrale Michael 
Paquier a écrit :
> On Thu, Nov 07, 2024 at 10:06:37AM +0900, Michael Paquier wrote:
> > Good point.  Checking all these contrib updates one-by-one is an ant's
> > work, but I'll see if I can get at least some of them done on HEAD.
> 
> I've begun looking at that a bit, and there are a couple of things
> that we could do better with xml2 in 0005 at least in the context of
> this patch: xpath_nodeset() and xpath_list() don't have any test
> coverage.  That's not an issue directly related to this patch, but
> perhaps we should add something for the functions that we are
> manipulating after this upgrade path at least?  That's one way to
> automatically make sure that these changes work the same way as the
> original.
> 
> The same argument comes up with lo_oid() in 0006.

Ok, please find attached a new complete patch series including tests for the 
uncovered functions. Tests pass both before and after the move to SQL-body 
functions.


> 
> 0004 for pg_freespace is fine regarding that for example as we have
> calls of pg_freespace(regclass) in its sql/.  I've applied it to begin
> with something.

Thank you for this one, removed from the new series.

> 
> Tomas Vondra has posted a patch for a bug fix with pageinspect, so
> this would create some conflicts noise for him if 0003 was applied
> today, so let's wait a bit:
> https://www.postgresql.org/message-id/3385a58f-5484-49d0-b790-9a198a0bf236@v
> ondra.me

Agreed, let's wait on this one. I did not include it in the series. 

> 
> 0001 and 0002 are much larger than the 4 others, and I'm lacking the
> > steam to check them in more details today.

Thanks !

--
Ronan Dunklau
>From 48f04a556536426fc7022ac33477f1629a31ca51 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Wed, 13 Nov 2024 08:28:37 +0100
Subject: [PATCH v4 1/6] Add tests for xpath_nodeset and xpath_list functions.

---
 contrib/xml2/expected/xml2.out | 30 ++
 contrib/xml2/sql/xml2.sql  |  8 
 2 files changed, 38 insertions(+)

diff --git a/contrib/xml2/expected/xml2.out b/contrib/xml2/expected/xml2.out
index eba6ae60364..5b292090982 100644
--- a/contrib/xml2/expected/xml2.out
+++ b/contrib/xml2/expected/xml2.out
@@ -222,3 +222,33 @@ $$
 $$);
 ERROR:  failed to apply stylesheet
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages', 'result', 'item') from articles;
+   xpath_nodeset   
+---
+ test37
+(1 row)
+
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages', 'item_without_toptag') from articles;
+xpath_nodeset 
+--
+ test37
+(1 row)
+
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages') from articles;
+ xpath_nodeset  
+
+ test37
+(1 row)
+
+SELECT xpath_list(article_xml::text, '/article/author|/article/pages') from articles;
+ xpath_list 
+
+ test,37
+(1 row)
+
+SELECT xpath_list(article_xml::text, '/article/author|/article/pages', '|') from articles;
+ xpath_list 
+
+ test|37
+(1 row)
+
diff --git a/contrib/xml2/sql/xml2.sql b/contrib/xml2/sql/xml2.sql
index ac49cfa7c52..5eae992b55b 100644
--- a/contrib/xml2/sql/xml2.sql
+++ b/contrib/xml2/sql/xml2.sql
@@ -137,3 +137,11 @@ $$
   
 $$);
+
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages', 'result', 'item') from articles;
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages', 'item_without_toptag') from articles;
+SELECT xpath_nodeset(article_xml::text, '/article/author|/article/pages') from articles;
+
+SELECT xpath_list(article_xml::text, '/article/author|/article/pages') from articles;
+SELECT xpath_list(article_xml::text, '/article/author|/article/pages', '|') from articles;
+
-- 
2.47.0

>From 5f027b8c4021c13c8e97d1dc56fa6548824c90c9 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Wed, 13 Nov 2024 08:45:13 +0100
Subject: [PATCH v4 2/6] Add test for lo_oid function

---
 contrib/lo/expected/lo.out | 6 ++
 contrib/lo/sql/lo.sql  | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/contrib/lo/expected/lo.out b/contrib/lo/expected/lo.out
index c63e4b1c704..65798205a5a 100644
--- a/contrib/lo/expected/lo.out
+++ b/contrib/lo/expected/lo.out
@@ -47,4 +47,10 @@ SELECT lo_get(43214);
 DELETE FROM image;
 SELECT lo_get(43214);
 ERROR:  large object 43214 does not exist
+SELECT lo_oid(1::lo);
+ lo_oid 
+
+  1
+(1 row)
+
 DROP TABLE image;
diff --git a/contrib/lo/sql/lo.sql b/contrib/lo/sql/lo.sql
index 77039509245..ca36cdb3098 100644
--- a/con

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-11-13 Thread Yugo NAGATA

On Tue, 12 Nov 2024 17:38:25 +0900
torikoshia  wrote:

> On 2024-11-12 14:17, Yugo Nagata wrote:
> > On Tue, 12 Nov 2024 14:03:50 +0900
> > Yugo Nagata  wrote:
> > 
> >> On Tue, 12 Nov 2024 01:27:53 +0500
> >> Kirill Reshke  wrote:
> >> 
> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia  
> >> > wrote:
> >> > >
> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> >> > >
> >> > > Thanks for working on this!
> >> >
> >> > Thanks for reviewing the v7 patch series!
> >> >
> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao 
> >> > > > 
> >> > > > wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the 
> >> > > >> > number of
> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > > >> > there WAS errors, and we should not tolerate more than $limit 
> >> > > >> > errors .
> >> > > >> > I do find this behavior to be consistent.
> >> > > >>
> >> > > >> +1
> >> > > >>
> >> > > >>
> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > > >> > replacements, as if REJECT LIMIT is inf.
> >> > > >>
> >> > > >> +1
> >> > > >
> >> > > > After thinking for a while, I'm now more opposed to this approach. I
> >> > > > think we should count rows with erroneous data as errors only if
> >> > > > null substitution for these rows failed, not the total number of rows
> >> > > > which were modified.
> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
> >> > >
> >> > > IMHO I prefer the previous interpretation.
> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> >> > > is used to specify how many malformed rows are acceptable in the
> >> > > "original" data source.
> >> 
> >> I also prefer the previous version.
> >> 
> >> > I do like the first version of interpretation, but I have a struggle
> >> > with it. According to this interpretation, we will fail COPY command
> >> > if the number
> >> > of malformed data rows exceeds the limit, not the number of rejected
> >> > rows (some percentage of malformed rows are accepted with nulnot-null 
> >> > constraintl
> >> > substitution)
> 
> I feel your concern is valid.
> Currently 'reject' can occur only when converting a column's input value 
> to its data type, but if we introduce set_to_null option 'reject' also 
> occurs when inserting null, i.e. not null constraint.

I can suppose "reject" means failure of COPY command here, that is, a reject
of executing the command, not an error of data input. If so, we can interpret
REJECT_LIMIT as "the number of malformed rows allowed before the COPY command
is REJECTed" (not the number of rejected rows). In this case, I think we don't
have to rename the option name.

Of course, if there is more proper name that makes it easy for users to
understand the behaviour of the option, renaming should be nice.

> >> The documentation says that REJECT_LIMIT "Specifies the maximum number 
> >> of errors",
> >> and there are no wording "reject" in the description, so I wonder it 
> >> is unclear
> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use 
> >> ERROR_LIMIT
> >> since it is supposed to be used with ON_ERROR.
> >> 
> >> Alternatively, if we emphasize that errors are handled other than 
> >> terminating
> >> the command,perhaps MALFORMED_LIMIT as proposed above or 
> >> TOLERANCE_LIMIT may be
> >> good, for example.
> > 
> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a 
> > limit on
> > the number of rows with any malformed value allowed before the COPY 
> > command is
> > rejected", we would not have to rename it.
> 
> The meaning of REJECT_LIMIT is what you described, and I think Kirill 
> worries about cases when malformed rows are accepted(=not REJECTed) with 
> null substitution. REJECT_LIMIT counts this case as REJECTed.

I am a bit confused. You mean "REJECT" is raising a soft error of data
input here instead of terminating COPY?

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-11-13 Thread Nisha Moond

Please find the v48 patch attached.

On Thu, Sep 19, 2024 at 9:40 AM shveta malik  wrote:
>
> When we promote hot standby with synced logical slots to become new
> primary, the logical slots are never invalidated with
> 'inactive_timeout' on new primary.  It seems the check in
> SlotInactiveTimeoutCheckAllowed() is wrong. We should allow
> invalidation of slots on primary even if they are marked as 'synced'.

fixed.

> I have raised 4 issues so far on v46, the first 3 are in [1],[2],[3].
> Once all these are addressed, I can continue reviewing further.
>

Fixed issues reported in [1], [2].

[1]: 
https://www.postgresql.org/message-id/CAJpy0uAwxc49Dz6t%3D-y_-z-MU%2BA4RWX4BR3Zri_jj2qgGMq_8g%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAJpy0uC6nN3SLbEuCvz7-CpaPdNdXxH%3DfeW5MhYQch-JWV0tLg%40mail.gmail.com


v48-0001-Introduce-inactive_timeout-based-replication-slo.patch
Description: Binary data

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

2024-11-13 Thread Amit Kapila

On Tue, Nov 12, 2024 at 6:29 PM Tomas Vondra  wrote:
>
> Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
> fix this particular case. But I'd be happier if we could also add
> asserts checking the LSN advances, to detect similar issues that we may
> be unaware of yet.
>

As most of us lean towards fixing
LogicalIncreaseRestartDecodingForSlot(), let's fix that in the HEAD
and back branches. Separately we can consider other asserts just for
HEAD that you think will make the code robust and help avoid such bugs
in the future.

-- 
With Regards,
Amit Kapila.

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4




On 11/13/24 10:38, Amit Kapila wrote:
> On Tue, Nov 12, 2024 at 6:29 PM Tomas Vondra  wrote:
>>
>> Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
>> fix this particular case. But I'd be happier if we could also add
>> asserts checking the LSN advances, to detect similar issues that we may
>> be unaware of yet.
>>
> 
> As most of us lean towards fixing
> LogicalIncreaseRestartDecodingForSlot(), let's fix that in the HEAD
> and back branches. Separately we can consider other asserts just for
> HEAD that you think will make the code robust and help avoid such bugs
> in the future.
> 

+1 to that


regards

-- 
Tomas Vondra

Re: Add reject_limit option to file_fdw

2024-11-13 Thread torikoshia

On 2024-11-12 15:23, Kirill Reshke wrote:

Thanks for your review!

On Tue, 12 Nov 2024 at 06:17, torikoshia  
wrote:

On 2024-11-12 01:49, Fujii Masao wrote:
> On 2024/11/11 21:45, torikoshia wrote:
>>> Thanks for adding the comment. It clearly states that REJECT_LIMIT
>>> can be
>>> a single-quoted string. However, it might also be helpful to mention
>>> that
>>> it can be provided as an int64 in the COPY command option. How about
>>> updating it like this?
>>>
>>> 
>>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY
>>> command
>>> option or as a single-quoted string for the foreign table option
>>> using
>>> file_fdw. Therefore this function needs to handle both formats.
>>> 
>>
>> Thanks! it seems better.
>>
>>
>> Attached v3 patch.
>
> Thanks for updating the patch! It looks like you forgot to attach it,
> though.

Oops, thanks for pointing it out.
Here it is.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Hi!

A little question from me.

This is your doc for reject_limit:

+  
+   reject_limit
+
+   
+
+ Specifies the maximum number of errors tolerated while
converting a column's
+ input value to its data type, the same as 
COPY's

+REJECT_LIMIT option.
+
+   
+  
+

This is how it looks on the current HEAD for copy.

REJECT_LIMIT

  Specifies the maximum number of errors tolerated while converting 
a
  column's input value to its data type, when 
ON_ERROR is

  set to ignore.
  If the input causes more errors than the specified value, the
COPY
  command fails, even with ON_ERROR set to
ignore.
  This clause must be used with
ON_ERROR=ignore
  and maxerror must
be positive bigint.
  If not specified, 
ON_ERROR=ignore
  allows an unlimited number of errors, meaning 
COPY will

  skip all erroneous data.

There is a difference. Should we add REJECT_LIMIT vs ON_ERROR
clarification for file_fdw too? or maybe we put a reference for COPY
doc.

As you may know, some options for file_fdw are the same as those for 
COPY. While the manual provides detailed descriptions of these options 
in the COPY section, the explanations in file_fdw are shorter and less 
detailed.

I intended to follow this approach, but do you think it should be 
changed?

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Re: Add reject_limit option to file_fdw

2024-11-13 Thread torikoshia


On 2024-11-12 14:51, Yugo Nagata wrote:

Thanks for your review!


On Tue, 12 Nov 2024 10:16:50 +0900
torikoshia  wrote:


On 2024-11-12 01:49, Fujii Masao wrote:
> On 2024/11/11 21:45, torikoshia wrote:
>>> Thanks for adding the comment. It clearly states that REJECT_LIMIT
>>> can be
>>> a single-quoted string. However, it might also be helpful to mention
>>> that
>>> it can be provided as an int64 in the COPY command option. How about
>>> updating it like this?
>>>
>>> 
>>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY
>>> command
>>> option or as a single-quoted string for the foreign table option
>>> using
>>> file_fdw. Therefore this function needs to handle both formats.
>>> 
>>
>> Thanks! it seems better.
>>
>>
>> Attached v3 patch.
>
> Thanks for updating the patch! It looks like you forgot to attach it,
> though.

Oops, thanks for pointing it out.
Here it is.


I have one minor comment.

+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+errmsg("skipped more than 
REJECT_LIMIT (%lld) rows due to data
type incompatibility",
+   (long long) 
cstate->opts.reject_limit)));

Do we have to keep this message consistent with ones of COPY command?
With foreign data wrappers, it seems common that the option name is 
passed in
lowercase, so is it better to use "skipped more than reject_limit ..." 
rather

than using uppercase?


Agreed.
Attached v4 patch.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 114d8dd2c57bf8b9cf712b8fa401c7b8eaa8a958 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Wed, 13 Nov 2024 21:32:25 +0900
Subject: [PATCH v4] file_fdw: Add reject_limit option to file_fdw

Commit 4ac2a9bece introduced REJECT_LIMIT option for COPY. This patch
extends support for this option to file_fdw.

As well as REJECT_LIMIT option for COPY, this option limits the
maximum number of erroneous rows that can be skipped.
If more rows encounter data type conversion errors than allowed by
reject_limit, the access to the file_fdw foreign table will fail
with an error, even when on_error = 'ignore'.

Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, the
value for the foreign table's option must be single-quoted.
To accommodate this, the patch modifies defGetCopyRejectLimitOption
to accept not only int64 but also strings.
---
 contrib/file_fdw/data/agg.bad  |  1 +
 contrib/file_fdw/expected/file_fdw.out | 16 ++--
 contrib/file_fdw/file_fdw.c|  8 
 contrib/file_fdw/sql/file_fdw.sql  |  6 +-
 doc/src/sgml/file-fdw.sgml | 12 
 src/backend/commands/copy.c| 16 +++-
 6 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/contrib/file_fdw/data/agg.bad b/contrib/file_fdw/data/agg.bad
index 3415b15007..04279ce55b 100644
--- a/contrib/file_fdw/data/agg.bad
+++ b/contrib/file_fdw/data/agg.bad
@@ -2,3 +2,4 @@
 100;@99.097@
 0;@aaa@
 42;@324.78@
+1;@bbb@
diff --git a/contrib/file_fdw/expected/file_fdw.out b/contrib/file_fdw/expected/file_fdw.out
index 593fdc782e..81efe63eef 100644
--- a/contrib/file_fdw/expected/file_fdw.out
+++ b/contrib/file_fdw/expected/file_fdw.out
@@ -206,10 +206,10 @@ SELECT * FROM agg_csv c JOIN agg_text t ON (t.a = c.a) ORDER BY c.a;
 SELECT * FROM agg_bad;   -- ERROR
 ERROR:  invalid input syntax for type real: "aaa"
 CONTEXT:  COPY agg_bad, line 3, column b: "aaa"
--- on_error and log_verbosity tests
+-- on_error, log_verbosity and reject_limit tests
 ALTER FOREIGN TABLE agg_bad OPTIONS (ADD on_error 'ignore');
 SELECT * FROM agg_bad;
-NOTICE:  1 row was skipped due to data type incompatibility
+NOTICE:  2 rows were skipped due to data type incompatibility
   a  |   b
 -+
  100 | 99.097
@@ -224,6 +224,18 @@ SELECT * FROM agg_bad;
   42 | 324.78
 (2 rows)
 
+ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1'); -- ERROR
+SELECT * FROM agg_bad;
+ERROR:  skipped more than reject_limit (1) rows due to data type incompatibility
+CONTEXT:  COPY agg_bad, line 5, column b: "bbb"
+ALTER FOREIGN TABLE agg_bad OPTIONS (SET reject_limit '2');
+SELECT * FROM agg_bad;
+  a  |   b
+-+
+ 100 | 99.097
+  42 | 324.78
+(2 rows)
+
 ANALYZE agg_bad;
 -- misc query tests
 \t on
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 043204c3e7..6c64f81098 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -77,6 +77,7 @@ static const struct FileFdwOption valid_options[] = {
 	{"encoding", ForeignTableRelationId},
 	{"on_error", ForeignTableRelationId},
 	{"log_verbosity", ForeignTableRelationId},
+	{"reject_limit", ForeignTableRelationId},
 	{"force_not_null", Attrib

Re: BitmapOr node not used in plan for ANY/IN but is for sequence of ORs ...

On 11/13/24 13:08, Jim Vanns wrote:
> (sent to general users mailing list yesterday - but perhaps this is a
> more suitable audience?)
> 
> In PG16.4, we have a table of key/pair data (around 30M rows) where
> there are about 7 distinct keys and each has a conditional or partial
> index on them (the distribution is different for each key/value pair
> combination).  I've found that when we have a query that uses an OR then
> those partial indexes are used but not if the query is written to use
> ANY/IN, which is more convenient from a programmer POV (especially any
> with 3rd party query generators etc.). Naturally, the result sets
> returned by the queries are identical due to the filter semantics of any
> of the 3 solution variants.
> 
> Here's a shareable, MRP;
> 
> https://dbfiddle.uk/OKs_7HWv 
> 
> Is there any trick I can do to get the planner to make use of the
> conditional/partial index? Or is this simply an unoptimised code path
> yet to be exploited!?
> 

I believe this is "simply" not implemented, so there's no way to
convince the planner to use these partial indexes.

The proximate cause is that the planner does not treat ANY()/IN() as
equivalent to an OR clause, and does not even consider building the
"bitmap OR" path for those queries. That's what happens at the very
beginning of generate_bitmap_or_paths().

Perhaps we could "expand" the ANY/IN clauses into an OR clause, so that
restriction_is_or_clause() returns "true". But I haven't tried and I'm
sure there'd be more stuff to fix to make this work.

regards

-- 
Tomas Vondra

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/13/24 11:59, Amit Kapila wrote:
> On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
>  wrote:
>>
>> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada  
>> wrote:
>>>
>>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra  wrote:

 But neither of those fixes prevents backwards move for confirmed_flush
 LSN, as enforced by asserts in the 0005 patch. I don't know if this
 assert is incorrect or now. It seems natural that once we get a
 confirmation for some LSN, we can't move before that position, but I'm
 not sure about that. Maybe it's too strict.
>>>
>>> Hmm, I'm concerned that it might be another problem. I think there are
>>> some cases where a subscriber sends a flush position older than slot's
>>> confirmed_flush as a feedback message.
>>>
> 
> Right, it can happen for cases where subscribers doesn't have to do
> anything (for example DDLs) like I have explained in one of my emails
> [1]
> 

Thanks. I admit not being entirely familiar with all the details, but
doesn't that email explain more "Why it currently happens?" rather than
"Is this what should be happening?"

Sure, the subscriber needs to confirm changes for which nothing needs to
be done, like DDL. But isn't there a better way to do that, rather than
allowing confirmed_lsn to go backwards?

>>> But it seems to be dangerous if
>>> we always accept it as a new confirmed_flush value. It could happen
>>> that confirm_flush could be set to a LSN older than restart_lsn.
>>>
> 
> Possible, though I haven't tried to reproduce such a case. But, will
> it create any issues? I don't know if there is any benefit in allowing
> to move confirmed_flush LSN backward. AFAIR, we don't allow such
> backward values to persist. They will temporarily be in memory. I
> think as a separate patch we should prevent it from moving backward.
> 
>>
>> If confirmed_flush LSN moves backwards, it means the transactions
>> which were thought to be replicated earlier are no longer considered
>> to be replicated. This means that the restart_lsn of the slot needs to
>> be at least far back as the oldest of starting points of those
>> transactions. Thus restart_lsn of slot has to be pushed further back.
>>
> 
> I don't see a reason to move restart_lsn backward. Why do you think so?
> 

I think what Ashutosh is saying that if confirmed_flush is allowed to
move backwards, that may result in start_lsn moving backwards too. And
we need to be able to decode all transactions committed since start_lsn,
so if start_lsn moves backwards, maybe restart_lsn needs to move
backwards too. I have no idea if confirmed_flush/start_lsn can move
backwards enough to require restart_lsn to move, though.

Anyway, these discussions are a good illustration why I think allowing
these LSNs to move backwards is a problem. It either causes bugs (like
with breaking replication slots) and/or it makes the reasoning about
correct behavior much harder.

regards

-- 
Tomas Vondra

Re: define pg_structiszero(addr, s, r)

2024-11-13 Thread Ranier Vilela

Em qua., 13 de nov. de 2024 às 04:50, Bertrand Drouvot <
bertranddrouvot...@gmail.com> escreveu:

> Hi,
>
> On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> > So that seems worth the addition, especially for
> > smaller sizes where this is 6 times faster here.
>
> So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
> attached?
>
I ran the latest version (allzeros_small.c) with v12.


>
> If so, that gives us:
>
> == with BLCKSZ 32
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 22421 nanoseconds
> size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
> SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per
> byte)
> SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per
> byte)
> SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per
byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per
byte)


> == with BLCKSZ 63
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 29525 nanoseconds
> size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
> SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per
> byte)
> SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per
> byte)
> SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per
> byte)
>
gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per
byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per
byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per
byte)


>
> == with BLCKSZ 256
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 120483 nanoseconds
> size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
> SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per
> byte)
> SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per
> byte)
> SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per
byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per
byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per
byte)


>
> == with BLCKSZ 8192
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 3393459 nanoseconds
> size_t: done in 707304 nanoseconds (4.79774 times faster than byte per
> byte)
> SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per
> byte)
> SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per
> byte)
> SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per
byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per
byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per
byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per
byte


>
> That's better for small size but given the extra len checks that
> has been added I think we're back to David's point in [1]: What if the
> function
> is not inlined for some reason?
>
> So, out of curiosity, let's see what happens if not inlined in [2] (see the
> -O2 -DNOT_INLINE compiler window):
>
> - if a[3]: it looks like gcc is smart enough to create an optimized version
> for that size using constant propagation
> - if a[63]: Same as above
> - if a[256]: Same as above
> - if a[8192]: Same as above
>
> I did a quick check with clang and it looks like it is not as smart as gcc
> for the non inline case.
>
> Anyway it's not like we have the choice: we need (at least) one len check
> for
> safety reason (to not crash or read invalid data).
>
> So, I'd vote for pg_mem

Re: Converting contrib SQL functions to new style


On 13.11.24 09:15, Ronan Dunklau wrote:

Le mardi 12 novembre 2024, 09:30:30 heure normale d’Europe centrale Michael
Paquier a écrit :

On Thu, Nov 07, 2024 at 10:06:37AM +0900, Michael Paquier wrote:

Good point.  Checking all these contrib updates one-by-one is an ant's
work, but I'll see if I can get at least some of them done on HEAD.


I've begun looking at that a bit, and there are a couple of things
that we could do better with xml2 in 0005 at least in the context of
this patch: xpath_nodeset() and xpath_list() don't have any test
coverage.  That's not an issue directly related to this patch, but
perhaps we should add something for the functions that we are
manipulating after this upgrade path at least?  That's one way to
automatically make sure that these changes work the same way as the
original.

The same argument comes up with lo_oid() in 0006.


Ok, please find attached a new complete patch series including tests for the
uncovered functions. Tests pass both before and after the move to SQL-body
functions.


By the way, if we're going to touch all these extension script files to 
make them more modern SQL-like, we could also use named parameters more. 
 For example,


+CREATE OR REPLACE FUNCTION regexp_match(string citext, pattern citext) 
RETURNS TEXT[]

+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_match( $1::pg_catalog.text, 
$2::pg_catalog.text, 'i' );


could be

+CREATE OR REPLACE FUNCTION regexp_match(string citext, pattern citext) 
RETURNS TEXT[]

+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_match( string::pg_catalog.text, 
pattern::pg_catalog.text, 'i' );


etc.

Re: Enable data checksums by default


On 07.11.24 19:38, Alvaro Herrera wrote:

I have just noticed that since this patch was committed as 04bec894a04c,
pg_upgrade's "make check" action is unusable when given the
"olddump/oldinstall" options.  We now need to inject '-k' to the initdb
line for old servers, and we don't, so all upgrade tests fail.  I think
this patch should be enough to fix it.


Yes, this fix looks correct.  (Or the other way around: Disable 
checksums on the new node.)


Is this not being covered by the build farm?  Are the upgrade tests 
there not using this?

Re: SQL:2011 application time


I have committed the documentation patches

v43-0001-Add-WITHOUT-OVERLAPS-and-PERIOD-to-ALTER-TABLE-r.patch
v43-0002-Update-conexclop-docs-for-WITHOUT-OVERLAPS.patch

For the logical replication fixes

v43-0003-Fix-logical-replication-for-temporal-tables.patch

can you summarize what the issues currently are?  Is it currently 
broken, or just not working as well as it could?


AFAICT, there might be two separate issues.  One is that you can't use a 
temporal index as replica identity, because ALTER TABLE rejects it.  The 
other is that a subscriber fails to make use of a replica identity 
index, because it uses the wrong strategy numbers.


This conditional is really hard to understand:

+   /*
+* The AM must support uniqueness, and the index must in fact be 
unique.
+* If we have a WITHOUT OVERLAPS constraint (identified by 
uniqueness +

+* exclusion), we can use that too.
+*/
+   if ((!indexRel->rd_indam->amcanunique ||
+   !indexRel->rd_index->indisunique) &&
+   !(indexRel->rd_index->indisunique && 
indexRel->rd_index->indisexclusion))


Why can we have a indisunique index when the AM is not amcanunique?  Are 
we using the fields wrong?


-   return get_equal_strategy_number_for_am(am);
+   /* For GiST indexes we need to ask the opclass what strategy 
number to use. */

+   if (am == GIST_AM_OID)
+   return GistTranslateStratnum(opclass, 
RTEqualStrategyNumber);

+   else
+   return get_equal_strategy_number_for_am(am);

This code should probably be pushed into 
get_equal_strategy_number_for_am().  That function already has a switch 
based on index AM OIDs.  Also, there are other callers of 
get_equal_strategy_number_for_am(), which also might want to become 
aware of GiST support.


For the new test file, remember to add it to 
src/test/subscription/meson.build.


Also, maybe add a introductory comment in the test file to describe 
generally what it's trying to test.

Writing out WAL buffers that are still in flux

2024-11-13 Thread Thomas Munro

Hi,

As we learned in the build farm[1], at least one checksumming
filesystem doesn't like it if you write data from memory that is still
changing, when direct I/O is enabled.  That was BTRFS, but the next
version of OpenZFS seems to have a similar allergy.  In Andres's
thread about fixing that (among other problems) for hint bits, he
pointed out that WAL must have the same problem[2].  I am splitting
that case off to a new thread given that the problem and potential
solutions seem pretty independent.

Concretely, on BTRFS you could write WAL apparently successfully, but
later if you tried to read it you'd get EIO because the checksum was
computed badly over moving data.  On ZFS 2.3 (out soon) I think a
backend would crash/panic during the write, so different but also bad.
Some other obvious candidates in that general COW + integrity family
would be (1) APFS but no, it doesn't seem to have user data checksums,
(2) ReFS, which has user data checksums but doesn't enable them by
default and I have no idea what it would do in the same circumstances
(but Windows already explicitly tells us never to touch the buffer
during an I/O, so we're already breaking the rules regardless), (3)
bcachefs, which has user data checksums by default but it looks like
it has its own bounce buffer to defend against this type of
problem[3].  (Anyone got clues about other exotic systems?)

Here's an early experimental patch to try to do something about that,
by making a temporary copy of the moving trailing buffer.  It's not
free, but the behaviour is only activated when you turn on
debug_io_direct=wal.  No significant change to default behaviour
hopefully, but direct I/O WAL writes look like this:

postgres=# insert into t values (42);
pwrite(14,"\^X\M-Q\^E\0\^A\0\0\0\0\M-`\M^P"...,8192,0x90e000) = 8192

postgres=# insert into t select generate_series(1, 1000);
pwritev(14,[{"\^X\M-Q\^E\0\^A\0\0\0\0`\M^O\r\0"...,57344},
{"\^X\M-Q\^E\0\^A\0\0\0\0@\M^P\r\0"...,8192}],2,0x8f6000) = 65536

postgres=# begin;
postgres=*# insert into t select generate_series(1, 1000);
[walwriter]
pwrite(5,"\^X\M-Q\^E\0\^A\0\0\0\0@`\^Z\0\0"...,2080768,0x604000) = 2080768

In the first case it had to take a copy as it was only writing the
tail, but otherwise it looks like unpatched.  In the second case, it
wrote as much as it could safely directly from WAL buffers, but then
had to write the tail page out with a temporary copy.  In the third
case, bulk writing activity doesn't do partial pages so gets the
traditional behaviour.

It might also be an idea to copy only the the part we really care
about and zero the rest, and in either case perhaps only out to the
nearest PG_IO_ALIGN_SIZE (step size at which it is OK to end a direct
I/O write), though that last bit might not be worth the hassle if we
plan to just use smaller blocks anyway.

Interaction with other potential future changes:  Obviously AIO might
require something more... bouncy than a stack buffer.  4K WAL buffers
would reduce the size of the copied region (along with other
benefits[4]).  A no-overwrite WAL buffer mode (to match the behaviour
of other databases) would avoid the cost completely (along with other
benefits).

What else could we do, if not something like this?

[1] 
https://www.postgresql.org/message-id/ca+hukgksbaz78fw3wtf3q8arqkcz1ggstfrfidpbu-j9ofz...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5%40jhtrriubncjb
[3] https://bcachefs-docs.readthedocs.io/en/latest/feat-checksumming.html
[4] 
https://www.postgresql.org/message-id/flat/20231009230805.funj5ipoggjyzjz6%40awork3.anarazel.de
From bdbcaf7f8ba970e947590d86e6a19d4cb0087d70 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 1 Oct 2024 06:06:57 +1300
Subject: [PATCH] Don't write changing WAL buffers with direct I/O.

Filesystems that have checksums for user data sometimes don't like it if
you modify the data in memory while it is being written to disk.
XLogWrite() allowed that with trailing partially filled WAL pages,
because it always writes the whole WAL buffer, and yet other backends
are allowed to fill in the rest of the buffer at the same time.

Defend against corruption by taking a temporary copy of any trailing
partial page.  Use a vectored write if necessary to avoid an extra
system call.

Unfortunately on Windows pg_pwritev() is implemented with a loop and the
default wal_sync_method will therefore sometimes wait for synchronous
I/O twice, but that is mitigated by the fact that we don't activate the
new behavior for buffered WAL writes, still the default.  (It'd be
possible to WriteFileGather() but that'd take more infrastructural
changes do deal with overlapped I/O.)

XXX Zeroes in unstable region instead?

Per observation from Andres Freund that the WAL must have the same
problem that we'd already diagnosed for relation data with concurrently
modified hint bits.

Discussion: https://postgr.es/m/CA%2BhUKGL-sZrfwcdme8jERPbn%2BsGbY13sRCvq8b

Re: Add reject_limit option to file_fdw

2024-11-13 Thread Yugo NAGATA

On Wed, 13 Nov 2024 21:48:10 +0900
torikoshia  wrote:

> On 2024-11-12 14:51, Yugo Nagata wrote:
> 
> Thanks for your review!
> 
> > On Tue, 12 Nov 2024 10:16:50 +0900
> > torikoshia  wrote:
> > 
> >> On 2024-11-12 01:49, Fujii Masao wrote:
> >> > On 2024/11/11 21:45, torikoshia wrote:
> >> >>> Thanks for adding the comment. It clearly states that REJECT_LIMIT
> >> >>> can be
> >> >>> a single-quoted string. However, it might also be helpful to mention
> >> >>> that
> >> >>> it can be provided as an int64 in the COPY command option. How about
> >> >>> updating it like this?
> >> >>>
> >> >>> 
> >> >>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY
> >> >>> command
> >> >>> option or as a single-quoted string for the foreign table option
> >> >>> using
> >> >>> file_fdw. Therefore this function needs to handle both formats.
> >> >>> 
> >> >>
> >> >> Thanks! it seems better.
> >> >>
> >> >>
> >> >> Attached v3 patch.
> >> >
> >> > Thanks for updating the patch! It looks like you forgot to attach it,
> >> > though.
> >> 
> >> Oops, thanks for pointing it out.
> >> Here it is.
> > 
> > I have one minor comment.
> > 
> > +   ereport(ERROR,
> > +   
> > (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> > +errmsg("skipped more than 
> > REJECT_LIMIT (%lld) rows due to data
> > type incompatibility",
> > +   (long long) 
> > cstate->opts.reject_limit)));
> > 
> > Do we have to keep this message consistent with ones of COPY command?
> > With foreign data wrappers, it seems common that the option name is 
> > passed in
> > lowercase, so is it better to use "skipped more than reject_limit ..." 
> > rather
> > than using uppercase?
> 
> Agreed.
> Attached v4 patch.

Thank you for updating the patch.

However, I noticed that file_fdw could output the following error using 
uppercase
when an invalid value is set to the reject_limit option,

 ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '-1');
 ERROR:  REJECT_LIMIT (-1) must be greater than zero

as well as if reject_limit is set when on_error != 'ignore'.

 ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1'); -- ERROR
 ERROR:  COPY REJECT_LIMIT requires ON_ERROR to be set to IGNORE

Also, in the regression tests, I found similar behaviours on existing options,
for example;

 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'xml');  -- 
ERROR
 ERROR:  COPY format "xml" not recognized
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', quote 
':');  -- ERROR
 ERROR:  COPY QUOTE requires CSV mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', escape 
':'); -- ERROR
 ...

These messages may be unexpected for some users because the documentation of
fild_fdw does not explicitly describe that file_fdw uses COPY internally. 
(I can find several wordings like "as COPY", though.) 
However, since the current file_fdw already has such messages, I came to think
making the messages on "reject_limit" consistent with COPY is reasonable.
I mean, the previous version of the message seems fine.

As another comment, do we need to add validator test for on_error and
reject_limit as similar to other options?

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: POC: make mxidoff 64 bits

2024-11-13 Thread Maxim Orlov

On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas  wrote:

> The wraparound logic is still not correct.

Yep, my fault. I forget to reset segment counter if wraparound is happened.
Fixed.

When I try to select from a table after upgrade that contains
> post-wraparound multixids:
>
> TRAP: failed Assert("offset != 0"), File:
> "../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386
>
The problem was in converting offset segments. The new_entry index should
also bypass the invalid offset (0) value. Fixed.


>
> On a different note, I'm surprised you're rewriting member segments from
> scratch, parsing all the individual member groups and writing them out
> again. There's no change to the members file format, except for the
> numbering of the files, so you could just copy the files under the new
> names without paying attention to the contents. It's not wrong to parse
> them in detail, but I'd assume that it would be simpler not to.
>
Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 values
appears in multixact.c So, they must be handled. Bypass, in fact. When we
are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.

The first options seem too weird for me. So, we have to repack members and
bypass invalid value.

All patches are for master@38c18710b37a2d

-- 
Best regards,
Maxim Orlov.


v7-0001-Use-64-bit-format-output-for-multixact-offsets.patch
Description: Binary data


v7-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patch
Description: Binary data


v7-0005-TEST-bump-catver.patch
Description: Binary data


v7-0003-Make-pg_upgrade-convert-multixact-offsets.patch
Description: Binary data


v7-0002-Use-64-bit-multixact-offsets.patch
Description: Binary data

Re: FW: Building Postgres 17.0 with meson

2024-11-13 Thread Nazir Bilal Yavuz

Hi,

On Wed, 13 Nov 2024 at 18:17, Mark Hill  wrote:
>
> The error above says meson can't find openssl/ssl.h but it's there:
> D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0>dir
>  
> D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include\openssl\ssl.h
>  Volume in drive D is Data
>  Volume Serial Number is 58B5-7193
>
>  Directory of 
> D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include\openssl
>
> 08/12/2024  04:22 PM   127,595 ssl.h
>1 File(s)127,595 bytes
>0 Dir(s)  210,974,670,848 bytes free
>
> I don't know why meson cannot find the OpenSSL installation I've specified 
> via the options:
> extra_lib_dirs
> extra_include_dirs

I think that the problem is that you are setting '.../include/openssl'
as an include_dir not '.../include'. Could you please try:

set 
openssl_include_dir=D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Fwd: pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

2024-11-13 Thread Tony Wayne

-- Forwarded message -
From: Tony Wayne 
Date: Wed, Nov 13, 2024 at 9:07 PM
Subject: Re: pgsql-hack...@postgresql.org VS
pgsql-hackers@lists.postgresql.org
To: Kai Wagner 




On Wed, Nov 13, 2024 at 8:55 PM Kai Wagner  wrote:

> Hi Tony,
>
> pgsql-hackers@lists.postgresql.org is the correct mailing list to send
> patches.
>
> Kai
>
> On 13/11/2024 16:18, Tony Wayne wrote:
> > Hi Hackers,
> >
> > Can you please tell which mail should we use to send Patches or POCs
> > or start a discussion?
> >
> > Regards,
> > Tony Wayne.
>
> --
> Kai Wagner
> Sr. Engineering Manager, Percona
>
> e: kai.wag...@percona.com
> w: www.percona.com
> Databases Run Better with Percona.
>
>
Thanks for the quick reply,Kai.

Actually i have sent a patch to pgsql-hackers@lists.postgresql.org and i
didnt got response/feedback for almost a week now ,so was thinking is
pgsql-hack...@postgresql.org the active one.

Re: proposal: schema variables

2024-11-13 Thread Dmitry Dolgov

> On Sun, Nov 10, 2024 at 06:51:40PM GMT, Pavel Stehule wrote:
> ne 10. 11. 2024 v 17:19 odesílatel Pavel Stehule 
> napsal:
> I thought a lot of time about better solutions for identifier collisions
> and I really don't think so there is some consistent user friendly syntax.
> Personally I think there is an easy already implemented solution -
> convention - just use a dedicated schema for variables and this schema
> should not be in the search path. Or use secondary convention - like using
> prefix "__" for session variables. Common convention is using "_" for
> PLpgSQL variables. I searched how this issue is solved in other databases,
> or in standard, and I found nothing special. The Oracle and SQL/PSM has a
> concept of visibility - the variables are not visible outside packages or
> modules, but Postgres has nothing similar. It can be emulated by a
> dedicated schema without inserting a search path, but it is less strong.
>
> I think we can introduce an alternative syntax, that will not be user
> friendly or readable friendly, but it can be without collisions - or can
> decrease possible risks.
>
> It is nothing new - SQL does it with old, "new" syntax of inner joins, or
> in Postgres we can
>
> where salary < 4
>
> or
>
> where pg_catalog.int4lt(salary, 4);
>
>
> or some like we use for operators OPERATOR(*schema*.*operatorname*)
>
> So introducing VARIABLE(schema.variablename) syntax as an alternative
> syntax for accessing variables I really like. I strongly prefer to use this
> as only alternative (secondary) syntax, because I don't think it is
> friendly syntax or writing friendly, but it is safe, and I can imagine
> tools that can replace generic syntax to this special, or that detects
> generic syntax and shows some warning. Then users can choose what they
> prefer. Two syntaxes - generic and special can be good enough for all - and
> this can be perfectly consistent with current Postgres.

As far as I recall, last time this topic was discussed in hackers, two
options were proposed: the one with VARIABLE(name), what you mention
here; and another one with adding variables to the FROM clause. The
VARIABLE(...) syntax didn't get much negative feedback, so I guess why
not -- if you find it fitting, it would be interesting to see the
implementation.

I'm afraid it should not be just an alternative syntax, but the only one
allowed, because otherwise I don't see how scenarious like "drop a
column with the same name" could be avoided. As in the previous thread:

-- we've got a variable b at the same time
SELECT a, b FROM table1;

Then dropping the column b, but everything still works beause the
variable b got silently picked up. But if it would be required to say
VARIABLE(b), then all fine.

And to make sure we're on the same page, could you post couple of
examples from curretly existing tests in the patch, how are they going
to look like with this proposal?

About adding variables to the FROM clause. Looks like this option was
quite popular, and you've mentioned some technical challenges
implementing that. If you'd like to go with another approach, it would
be great to elaborate on that -- maybe even with a PoC, to make a
convincing point here.

Re: pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

2024-11-13 Thread Kai Wagner


Hi Tony,

pgsql-hackers@lists.postgresql.org is the correct mailing list to send 
patches.


Kai

On 13/11/2024 16:18, Tony Wayne wrote:

Hi Hackers,

Can you please tell which mail should we use to send Patches or POCs 
or start a discussion?


Regards,
Tony Wayne.


--
Kai Wagner
Sr. Engineering Manager, Percona

e: kai.wag...@percona.com
w: www.percona.com
Databases Run Better with Percona.

Re: Allow default \watch interval in psql to be configured

2024-11-13 Thread Daniel Gustafsson

> On 4 Nov 2024, at 14:55, Kirill Reshke  wrote:

> I'm mostly fine with this patch,

Thanks for review!

> but maybe we need to handle `errno ==
> ERANGE` inside ParseVariableDouble and provide a better error msg in
> this case?
> Something like:
> ```
> reshke=# \set WATCH_INTERVAL -1e-309
> underflow while parsing parameter
> ```

Fair point, I've added ERANGE handling in the attached v3.

> Also, maybe we should provide `double max` arg to the
> ParseVariableDouble function, because this is a general-use function?

It's general use, but not generally used.  Since it's not an exposed API it
seems premature to add handling of a max value when there is no need, once
there is a caller (if one ever comes) that needs it we can deal with it at that
point.

--
Daniel Gustafsson

v3-0001-Make-default-watch-interval-configurable.patch
Description: Binary data

Re: New GUC autovacuum_max_threshold ?

2024-11-13 Thread wenhui qiu

HI
> In any case, we should do the tests that Robert suggested and/or come up
> with a good mathematical model, because we are in the dark at the moment.
I think SQL Server has given us great inspiration
>I think we should indeed provide a retro-compatible behaviour (so maybe
> another GUC after all).
I am ready to implement a new guc parameter,Enable database administrators
to configure appropriate calculation methods（The default value is the
original calculation formula）


Frédéric Yhuel  于2024年11月13日周三 18:03写道：

>
>
> On 11/9/24 16:59, Nathan Bossart wrote:
> > AFAICT the main advantage of these formulas is that you don't need
> another
> > GUC, but they also makes the existing ones more difficult to configure.
>
> I wouldn't say that's the main advantage. It doesn't seem very clean to
> me to cap to a fixed value. Because you could take Robert's
> demonstration with a bigger table, and come to the same conclusion:
>
> Let's compare the current situation to the situation post-Nathan's-patch
> with a cap of 100M. Consider a table 100 times larger than the one of
> Robert's previous example, so pgbench scale factor 2_560_000, size on
> disk 32TB.
> Currently, that table will be vacuumed for bloat when the number of
> dead tuples exceeds 20% of the table size, because that's the default
> value of autovacuum_vacuum_scale_factor. The table has 256 billion
> tuples, so that means that we're going to vacuum it when there are
> more than 51 billion dead tuples. Post-patch, we will vacuum when we
> have 100 million dead tuples. Suppose a uniform workload that slowly
> updates rows in the table. If we were previously autovacuuming the
> table once per day (1440 minutes) we're now going to try to vacuum it
> almost every minute (1440 minutes / 512 = 168 seconds).
>
> (compare with every 55 min with my formula)
>
> Of course, this a theoretical example that is probably unrealistic. I
> don't know, really. I don't know if Robert's example was realistic in
> the first place.
>
> In any case, we should do the tests that Robert suggested and/or come up
> with a good mathematical model, because we are in the dark at the moment.
>
> > Plus, there's no way to go back to the existing behavior.
>
> I think we should indeed provide a retro-compatible behaviour (so maybe
> another GUC after all).
>
>

Re: Parametrization minimum password lenght

2024-11-13 Thread Emanuele Musella

Hi Tomas,

we have done the fixes that you have suggested us in the last mail.

1) Modified passwordcheck.sql with a_nice_long_password and also
passwordcheck.out

2) we added the documentation in the doc/src/sgml/passwordcheck.sgml.

3) we have done make check successfully.

If you have other suggestions to share with us will be welcommed.

Thank you

Best regards



Il giorno mar 12 nov 2024 alle ore 16:49 Tomas Vondra  ha
scritto:

> On 11/12/24 16:36, Emanuele Musella wrote:
> > Hi Tomas,
> >
> > thank you for you feedback.
> >
> > We have implemented you suggestion in the patch. I attached you the
> > latest version.
> >
>
> Please always reply to the mailing list, not just directly to individual
> people. Otherwise others won't see your new messages, it won't get into
> mailing list archives, etc.
>
> > We are waiting for your last feedback to upload the patch on Commitfest.
> >
>
> Not sure I follow. You can add the patch to the commitfest on your own,
> and there's no need to wait for my feedback before doing so.
>
> A couple quick comments for the patch, though:
>
> 1) You modified just the SQL script for the test, not the expected
> output in expected/passwordcheck.out. This means running "make check"
> for this extension will fail.
>
> 2) It's not clear to me why you removed  the first ALTER TABLE with
> a_nice_long_password. Seems wrong.
>
> 3) The test doesn't change the new GUC at all, so how would this show
> the setting actually affects anything? And you only test the "positive"
> case, maybe it'd be good to change the GUC and then test that it rejects
> a password.
>
> 4) The new GUC needs to be added to doc/src/sgml/passwordcheck.sgml.
>
>
> regards
>
> --
> Tomas Vondra
>
>


min_password_length_v4.patch
Description: Binary data

Re: Fix for pageinspect bug in PG 17

On Wed, Nov 13, 2024 at 11:07 AM Tomas Vondra  wrote:
> My plan was to apply the patch to both 17 and HEAD, and then maybe do
> something smarter in HEAD in a separate commit. But then Michael pointed
> out other pageinspect functions just error out in this version-mismatch
> cases, so I think it's better to just do it the same way.

FWIW I didn't actually backpatch commit 691e8b2e18. I decided that it
was better to just paper-over the issue on backbranches instead -- see
commit c788115b.

The problem that I fixed back in 2020 was a problem with the data
types used -- not a failure to consider older versions of the
extension at all. It was just convenient to use the number of columns
to detect the version of the extension to detect a problematic
(incorrectly typed) function.

-- 
Peter Geoghegan

Re: Reordering DISTINCT keys to match input path's pathkeys

2024-11-13 Thread Andrei Lepikhov

On 11/13/24 13:49, Richard Guo wrote:

On Mon, Oct 28, 2024 at 6:15 PM Andrei Lepikhov wrote:

On 6/7/24 16:46, Richard Guo wrote:

This patch does not apply any more, so here is a new rebase, with some
tweaks to the comments.

This patch needs a minor rebase again.
After skimming the code, I want to say that it looks good. But maybe to
avoid one more *_reordering GUC - it would be better to cover all path
key reorderings under a single GUC.

Thanks for reviewing this patch. After some consideration, I think
it's not too complex to also apply this optimization to DISTINCT ON.
The parser already ensures that the DISTINCT ON expressions match the
initial ORDER BY expressions; we just need to ensure that the
resulting pathkey list from the reordering matches the original
distinctClause pathkeys, while leaving the remaining pathkeys
unchanged in order. Please see attached.

Thanks, I'll discover it later.
BTW have you ever thought about one more, cost-based reordering strategy?
For now, we can reorder GROUP-BY and distinct clauses according to two
orderings: 1) ORDER-BY order and 2) order derived from the underlying
query tree.
In thread [1], I try to add one more strategy that minimises the number
of comparison operator calls. It seems that it would work the same way
with the DISTINCT statement. Do you think it make sense in general and
can be a possible direction of improvement for the current patch?

I'm not sure about merging these two 'reordering' GUCs into one.
While they may look similar, they apply to very different scenarios.
However, I'm open to other suggestions.
Sure, they enable different optimisations. But, they enable highly
specialised optimisations. Having two GUCs looks too expensive.
Moreover, this stuff is cost-based and should work automatically. So, I
treat these GUCs as mostly debugging or last-chance stuff used to
disable features during severe slowdowns or bugs. It might make sense to
group them into a single 'Clause Reordering' parameter.

[1]
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

--
regards, Andrei Lepikhov

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

2024-11-13 Thread Amit Kapila

On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
 wrote:
>
> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada  
> wrote:
> >
> > On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra  wrote:
> > >
> > >
> > > But neither of those fixes prevents backwards move for confirmed_flush
> > > LSN, as enforced by asserts in the 0005 patch. I don't know if this
> > > assert is incorrect or now. It seems natural that once we get a
> > > confirmation for some LSN, we can't move before that position, but I'm
> > > not sure about that. Maybe it's too strict.
> >
> > Hmm, I'm concerned that it might be another problem. I think there are
> > some cases where a subscriber sends a flush position older than slot's
> > confirmed_flush as a feedback message.
> >

Right, it can happen for cases where subscribers doesn't have to do
anything (for example DDLs) like I have explained in one of my emails
[1]

> > But it seems to be dangerous if
> > we always accept it as a new confirmed_flush value. It could happen
> > that confirm_flush could be set to a LSN older than restart_lsn.
> >

Possible, though I haven't tried to reproduce such a case. But, will
it create any issues? I don't know if there is any benefit in allowing
to move confirmed_flush LSN backward. AFAIR, we don't allow such
backward values to persist. They will temporarily be in memory. I
think as a separate patch we should prevent it from moving backward.

>
> If confirmed_flush LSN moves backwards, it means the transactions
> which were thought to be replicated earlier are no longer considered
> to be replicated. This means that the restart_lsn of the slot needs to
> be at least far back as the oldest of starting points of those
> transactions. Thus restart_lsn of slot has to be pushed further back.
>

I don't see a reason to move restart_lsn backward. Why do you think so?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BzWQwOe5G8zCYGvErnaXh5%2BDbyg_A1Z3uywSf_4%3DT0UA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: SQL:2011 application time

I committed a few fixes in this area today.  Has everything here been 
addressed?



On 16.08.24 04:12, jian he wrote:

On Thu, Aug 8, 2024 at 4:54 AM Paul Jungwirth
 wrote:


Rebased to e56ccc8e42.


I only applied to 0001-0003.
in create_table.sgml, I saw the WITHOUT OVERLAPS change is mainly in
table_constraint.
but we didn't touch alter_table.sgml.
Do we also need to change alter_table.sgml correspondingly?


+ if (constraint->without_overlaps)
+ {
+ /*
+ * This enforces that there is at least one equality column
+ * besides the WITHOUT OVERLAPS columns.  This is per SQL
+ * standard.  XXX Do we need this?
+ */
+ if (list_length(constraint->keys) < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("constraint using WITHOUT OVERLAPS needs at least two columns"));
+
+ /* WITHOUT OVERLAPS requires a GiST index */
+ index->accessMethod = "gist";
+ }
if Constraint->conname is not NULL, we can
+ errmsg("constraint \"%s\" using WITHOUT OVERLAPS needs at least two
columns"));

"XXX Do we need this?"
I think currently we need this, otherwise the following create_table
synopsis will not be correct.
UNIQUE [ NULLS [ NOT ] DISTINCT ] ( column_name [, ... ] [,
column_name WITHOUT OVERLAPS ] )
PRIMARY KEY ( column_name [, ... ] [, column_name WITHOUT OVERLAPS ] )


we add a column in catalog-pg-constraint.
do we need change column conexclop,
"If an exclusion constraint, list of the per-column exclusion operators"
but currently, primary key, unique constraint both have valid conexclop.


+static void
+ExecWithoutOverlapsNotEmpty(Relation rel, NameData attname, Datum
attval, char typtype, Oid atttypid)
+{
+ bool isempty;
+ RangeType *r;
+ MultirangeType *mr;
+
+ switch (typtype)
+ {
+ case TYPTYPE_RANGE:
+ r = DatumGetRangeTypeP(attval);
+ isempty = RangeIsEmpty(r);
+ break;
+ case TYPTYPE_MULTIRANGE:
+ mr = DatumGetMultirangeTypeP(attval);
+ isempty = MultirangeIsEmpty(mr);
+ break;
+ default:
+ elog(ERROR, "WITHOUT OVERLAPS column \"%s\" is not a range or multirange",
+ NameStr(attname));
+ }
+
+ /* Report a CHECK_VIOLATION */
+ if (isempty)
+ ereport(ERROR,
+ (errcode(ERRCODE_CHECK_VIOLATION),
+ errmsg("empty WITHOUT OVERLAPS value found in column \"%s\" in
relation \"%s\"",
+ NameStr(attname), RelationGetRelationName(rel;
+}
I think in the default branch, you need at least set the isempty
value, otherwise maybe there will be a compiler warning
because later your use isempty, but via default branch is value undefined?


+ /*
+ * If this is a WITHOUT OVERLAPS constraint,
+ * we must also forbid empty ranges/multiranges.
+ * This must happen before we look for NULLs below,
+ * or a UNIQUE constraint could insert an empty
+ * range along with a NULL scalar part.
+ */
+ if (indexInfo->ii_WithoutOverlaps)
+ {
+ ExecWithoutOverlapsNotEmpty(heap, att->attname,
+ }
previously we found out that if this happens later, then it won't work.
but this comment didn't explain why this must have happened earlier.
I didn't dig deep enough to find out why.
but explaining it would be very helpful.


I think some tests are duplicated, so I did the refactoring.

Re: SQL:2011 application time


A quick comment on the patch

v43-0005-Add-UPDATE-DELETE-FOR-PORTION-OF.patch

regarding the code in transformForPortionOfClause() and the additions 
you made to lsyscache.c:


What you are doing is taking a type OID and a function OID and then 
converting them back to name and namespace and then building a node and 
then feeding that node back through the parse analysis transformation. 
This all seems quite fishy and cumbersome.  I think instead of building 
a FuncCall and transforming it, try to build a FuncExpr directly.  Then 
you wouldn't need these new helper functions, which would also reduce 
the surface area of your patch.


Additional mini-comment:

#include "utils/rangetypes.h"

in src/include/nodes/execnodes.h appears to be unnecessary (but it is 
then required in src/backend/commands/trigger.c).

Re: Enable data checksums by default

2024-11-13 Thread Alvaro Herrera

On 2024-Nov-13, Peter Eisentraut wrote:

> On 07.11.24 19:38, Alvaro Herrera wrote:
> > I have just noticed that since this patch was committed as 04bec894a04c,
> > pg_upgrade's "make check" action is unusable when given the
> > "olddump/oldinstall" options.  We now need to inject '-k' to the initdb
> > line for old servers, and we don't, so all upgrade tests fail.  I think
> > this patch should be enough to fix it.
> 
> Yes, this fix looks correct.

Thanks, pushed.

> (Or the other way around: Disable checksums on the new node.)

Yeah, I thought about that too, but, I think it'd be less realistic,
because the world is become one where checksums are enabled, not the
other way around.

> Is this not being covered by the build farm?  Are the upgrade tests there
> not using this?

Nope, the buildfarm has separate code to test cross-version upgrades,
https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/TestUpgradeXversion.pm
This predates our in-core test support.  Maybe buildfarm's could be
simplified, but I'm not sure if it's worth it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)

RE: Parallel heap vacuum

2024-11-13 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
> 
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED 0x0001
+#define LV_PARALLEL_SCAN_DESC   0x0002
+#define LV_PARALLEL_SCAN_DESC_WORKER0x0003
```

I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

02. LVRelState
```
+BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

03. parallel_heap_vacuum_gather_scan_stats
```
+vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

04. heap_parallel_vacuum_estimate
```
+
+heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+  &shared_len, &pscanwork_len);
+
+/* space for PHVShared */
+shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+/* space for ParallelBlockTableScanDesc */
+pscan_len = table_block_parallelscan_estimate(rel);
+shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+/* space for per-worker scan state, PHVScanWorkerState */
+pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in 
heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

05. Idea

Can you update documentations?

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of 
pids?
Not sure it is helpful for users but it can show the parallelism.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: New GUC autovacuum_max_threshold ?

2024-11-13 Thread Frédéric Yhuel





On 11/9/24 16:59, Nathan Bossart wrote:

AFAICT the main advantage of these formulas is that you don't need another
GUC, but they also makes the existing ones more difficult to configure.


I wouldn't say that's the main advantage. It doesn't seem very clean to 
me to cap to a fixed value. Because you could take Robert's 
demonstration with a bigger table, and come to the same conclusion:


Let's compare the current situation to the situation post-Nathan's-patch 
with a cap of 100M. Consider a table 100 times larger than the one of 
Robert's previous example, so pgbench scale factor 2_560_000, size on 
disk 32TB.

Currently, that table will be vacuumed for bloat when the number of
dead tuples exceeds 20% of the table size, because that's the default
value of autovacuum_vacuum_scale_factor. The table has 256 billion
tuples, so that means that we're going to vacuum it when there are
more than 51 billion dead tuples. Post-patch, we will vacuum when we
have 100 million dead tuples. Suppose a uniform workload that slowly
updates rows in the table. If we were previously autovacuuming the
table once per day (1440 minutes) we're now going to try to vacuum it
almost every minute (1440 minutes / 512 = 168 seconds).

(compare with every 55 min with my formula)

Of course, this a theoretical example that is probably unrealistic. I 
don't know, really. I don't know if Robert's example was realistic in 
the first place.


In any case, we should do the tests that Robert suggested and/or come up 
with a good mathematical model, because we are in the dark at the moment.



Plus, there's no way to go back to the existing behavior.


I think we should indeed provide a retro-compatible behaviour (so maybe 
another GUC after all).

Re: doc: pgevent.dll location


On 12.11.24 17:52, Peter Eisentraut wrote:
So the most straightforward way to "make it work like it used to" would 
be to change src/bin/pgevent/meson.build to use shared_module() instead 
of shared_library().


I developed the attached patch for this.

I haven't tested it locally, but I checked in the CI logs that the build 
commands for pgevent don't change with this.  CI doesn't test meson 
install, it seems, so can't actually tell whether this changes the 
installation directory successfully, but I think it should work. 
Someone with Windows handy please give this a test.
From 064013af8020112b5ee96538e9bf928c623b74e0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 13 Nov 2024 09:49:25 +0100
Subject: [PATCH] meson: Build pgevent as shared_module rather than
 shared_library

This matches the behavior of the makefiles and the old MSVC build
system.  The main effect is that the build result gets installed into
pkglibdir rather than bindir.  The documentation says to locate the
library in pkglibdir, so this makes the code match the documentation
again.

Reported-by: Ryohei Takahashi (Fujitsu) 
Discussion: 
https://www.postgresql.org/message-id/flat/TY3PR01MB118912125614599641CA881B782522%40TY3PR01MB11891.jpnprd01.prod.outlook.com

ci-os-only: windows
---
 src/bin/pgevent/meson.build | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/pgevent/meson.build b/src/bin/pgevent/meson.build
index 5c2d296fef9..f7422438664 100644
--- a/src/bin/pgevent/meson.build
+++ b/src/bin/pgevent/meson.build
@@ -21,13 +21,11 @@ if cc.get_id() == 'msvc'
   pgevent_link_args += '/ignore:4104'
 endif
 
-pgevent = shared_library('pgevent',
+pgevent = shared_module('pgevent',
   pgevent_sources,
   dependencies: [frontend_code],
   link_args: pgevent_link_args,
   vs_module_defs: 'pgevent.def',
-  kwargs: default_lib_args + {
-'name_prefix': '',
-  },
+  kwargs: default_mod_args,
 )
 bin_targets += pgevent
-- 
2.47.0

Re: Commit Timestamp and LSN Inversion issue

2024-11-13 Thread Amit Kapila

On Tue, Nov 12, 2024 at 5:30 PM Tomas Vondra  wrote:
>
> On 11/12/24 04:05, Amit Kapila wrote:
> >
> > Right, and the patch sent by Hou-San [1] (based on the original patch
> > by Jan) is somewhat on these lines. The idea you have shared or
> > implemented by the patch is a logical clock stored in shared memory.
> > So, what the patch does is that if the time recorded by the current
> > commit record is lesser than or equal to the logical clock (which
> > means after we record time in the commit code path and before we
> > reserve the LSN, there happens a concurrent transaction), we shall
> > increment the value of logical clock by one and use that as commit
> > time.
> >
> > So, in most cases, we need to perform one additional "if check" and
> > "an assignment" under spinlock, and that too only for the commit
> > record. In some cases, when inversion happens, there will be "one
> > increment" and "two assignments."
> >
>
> In a way, yes. Except that each backend first sets the commit timestamp
> the usual way too. And then it also has to recalculate the checksum of
> the record - which happens under lwlock and not the main spinlock, but
> it still seems it might be quite expensive.
>
> I wonder how often we'd need to do this - it seems to me that for high
> throughput systems we might easily get into a situation where we have to
> "correct" the timestamp and recalculate the checksum (so calculating it
> twice) for most of the commits.
>
> Would be good to see some numbers from experiments - how often this can
> happen, what's the impact on throughput etc.
>

The initial numbers shared by Kuroda-San [1] didn't show any
significant impact due to this. It will likely happen in some cases
under stress but shouldn't be often because normally the backend that
acquired the timestamp first should reserve the LSN first as well. We
can do more tests as well to see the impact but the key point Andres
is raising is that we won't be able to convert the operation in
ReserveXLogInsertLocation() to use atomics after such a solution. Now,
even, if the change is only in the *commit* code path, we may or may
not be able to maintain two code paths for reserving LSN, one using
atomics and the other (for commit record) using spinlock.

I think even if we consider your solution to directly increment the
commit_timestamp in spinlock to make it monotonic and then later
rectify the commit_ts, will it be any more helpful in making the
entire operation atomics?

> Ofc, you may argue that this would only affect people with the hook, and
> those people may be OK with the impact. And in a way I agree. But I also
> understand the concerns expressed by Andres, that adding a hook here may
> be problematic / easy to get wrong, and then we'd be the ones holding
> the brown bag ...
>

BTW, the latest patch [2] doesn't have any hook.

> >> Of course, this timestamp would be completely
> >> detached from "normal" timestamps, it'd just be a sequence. But we could
> >> also once in a while "set" the timestamp to GetCurrentTimestamp(), or
> >> something like that. For example, there could be a "ticker" process,
> >> doing that every 1000ms, or 100ms, or whatever we think is enough.
> >>
> >> Or maybe it could be driven by the commit itself, say when some timeout
> >> expires, we assign too many items since the last commit, ...
> >>
> >
> > If we follow the patch, we don't need this additional stuff. Isn't
> > what is proposed [1] quite similar to your idea? If so, we can save
> > this extra maintenance of commit timestamps.
> >
>
> Not sure. But I think we already have some of the necessary parts in
> commit_ts. It'd need some improvements, but it might also be a good
> place to adjust the timestamps. The inversion can only happens within a
> small group of recent transactions (due to NUM_XLOGINSERT_LOCKS), so it
> shouldn't be hard to keep a small buffer for those XIDs. Ofc, subxacts
> would complicate stuff, and it'd mean the timestamps written to WAL and
> to commit_ts would differ. Not great.
>

Won't it be difficult to predict the number of transactions that could
be impacted in this model?

In general, I understand the concern here related to the difficulty in
converting the operation in ReserveXLogInsertLocation() to atomics but
leaving the LSN<->Timestamp inversion issue open forever also doesn't
give any good feeling. I feel any solution to fixup commit_timestamps
afterward as we discussed above [3] would be much more complex and
difficult to maintain.

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB569222C1312E7A6C3C63539DF5582%40TYAPR01MB5692.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/OS0PR01MB57162A227EC357482FEB4470945D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1%2BOZLh7vA1CQkoq0ba4J_P-8JFHnt0a_YC2xfB0t3%2BakA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Nov 11, 2024 at 5:40 PM Peter Smith  wrote:
>
> On Tue, Nov 12, 2024 at 5:00 AM Masahiko Sawada  wrote:
> >
> > I've attached the updated patch.
> >
>
> Hi, here are some review comments for the latest v6-0001.
>
> ==
> contrib/test_decoding/sql/stats.sql
>
> 1.
> +INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM
> generate_series(1, 5000) g(i);
>
> I didn't understand the meaning of "serialize-topbig--1". My guess is
> it is a typo that was supposed to say "toobig".

Fixex. We have another place using 'topbig', but I think we can leave it.

>
> Perhaps there should also be some comment to explain that this
> "toobig" stuff was done deliberately like this to exceed
> 'logical_decoding_work_mem' because that would normally (if it was not
> aborted) cause a spill to disk.

I think we already mentioned the transaction is going to be spilled
but actually not.

+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.

>
> ~~~
>
> 2.
> +-- Check stats. We should not spill anything as the transaction is already
> +-- aborted.
> +SELECT pg_stat_force_next_flush();
> +SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count
> FROM pg_stat_replication_slots WHERE slot_name =
> 'regression_slot_stats4_twophase';
> +
>
> Those aliases seem unnecessary: "spill_txns AS spill_txn" and
> "spill_count AS spill_count"

Fixed.

>
> ==
> .../replication/logical/reorderbuffer.c
>
> ReorderBufferCheckTXNAbort:
>
> 3.
> Other static functions are also declared at the top of this module.
> For consistency, shouldn't this be the same?

Agreed, added.

>
> ~~~
>
> 4.
> + * We don't mark the transaction as streamed since this function can be
> + * called for non-streamed transactions too.
> + */
> + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
> + ReorderBufferToastReset(rb, txn);
>
> Given the comment says "since this function can be called for
> non-streamed transactions too", would it be easier to pass
> rbtxn_is_streamed(txn) here instead of 'false', and then just remove
> the comment?

Agreed.

During more testing, I found some bugs in the previous version patch,
so the latest patch incorporates some changes in addition to the
review comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v6-0001-Skip-logical-decoding-of-already-aborted-transact.patch
Description: Binary data

Re: proposal: schema variables

2024-11-13 Thread Pavel Stehule

st 13. 11. 2024 v 17:35 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Sun, Nov 10, 2024 at 06:51:40PM GMT, Pavel Stehule wrote:
> > ne 10. 11. 2024 v 17:19 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com>
> > napsal:
> > I thought a lot of time about better solutions for identifier collisions
> > and I really don't think so there is some consistent user friendly
> syntax.
> > Personally I think there is an easy already implemented solution -
> > convention - just use a dedicated schema for variables and this schema
> > should not be in the search path. Or use secondary convention - like
> using
> > prefix "__" for session variables. Common convention is using "_" for
> > PLpgSQL variables. I searched how this issue is solved in other
> databases,
> > or in standard, and I found nothing special. The Oracle and SQL/PSM has a
> > concept of visibility - the variables are not visible outside packages or
> > modules, but Postgres has nothing similar. It can be emulated by a
> > dedicated schema without inserting a search path, but it is less strong.
> >
> > I think we can introduce an alternative syntax, that will not be user
> > friendly or readable friendly, but it can be without collisions - or can
> > decrease possible risks.
> >
> > It is nothing new - SQL does it with old, "new" syntax of inner joins, or
> > in Postgres we can
> >
> > where salary < 4
> >
> > or
> >
> > where pg_catalog.int4lt(salary, 4);
> >
> >
> > or some like we use for operators OPERATOR(*schema*.*operatorname*)
> >
> > So introducing VARIABLE(schema.variablename) syntax as an alternative
> > syntax for accessing variables I really like. I strongly prefer to use
> this
> > as only alternative (secondary) syntax, because I don't think it is
> > friendly syntax or writing friendly, but it is safe, and I can imagine
> > tools that can replace generic syntax to this special, or that detects
> > generic syntax and shows some warning. Then users can choose what they
> > prefer. Two syntaxes - generic and special can be good enough for all -
> and
> > this can be perfectly consistent with current Postgres.
>
> As far as I recall, last time this topic was discussed in hackers, two
> options were proposed: the one with VARIABLE(name), what you mention
> here; and another one with adding variables to the FROM clause. The
> VARIABLE(...) syntax didn't get much negative feedback, so I guess why
> not -- if you find it fitting, it would be interesting to see the
> implementation.
>
> I'm afraid it should not be just an alternative syntax, but the only one
> allowed, because otherwise I don't see how scenarious like "drop a
> column with the same name" could be avoided. As in the previous thread:
>
> -- we've got a variable b at the same time
> SELECT a, b FROM table1;
>

I am sorry, but I am in very strong opposition  against this idea. Nobody
did reply to my questions, that can change my opinion.

1. This introduces possible inconsistency between LET syntax and SELECT
syntax.
What will be the syntax of LET?

LET var = var FROM var

PLpgSQL does something, and it is really strange, and source of some
unwanted bugs. See https://commitfest.postgresql.org/50/5044/

With current design I can support

LET var = expr with wars

or

LET var = (QUERY with vars)

It is perfectly consistent. The expressions will be expressions.

2. I don't know of any product in the world that introduced the same
requirement. So this syntax will be proprietary (SQL/PSM it really doesn't
require) and shock for any users that know other databases. Proprietary
syntax in this area far from syntaxes of other products is hell. Try to
explain to users the working with OUT variables of Postgres's procedures
and functions. And there is some deeper logic.

3. There is a simple solution - convention. Use your own schema like vars,
and use session variables in this schema, When this schema will not be on
the search path, then there is not a collision.
Variables living in schema. Nobody without CREATE right can create it. So
it is safe. Or use prefix in __ for variables in the search path.

4. this requirement introduces syntax inconsistency between plpgsql
variables and session variables - which breaks one goal of the patch -
introducing some global variables for plpgsql (and for all PL).

5. Using more variables and FROM clauses decrease readability of FROM clause

SELECT v1, v2, a, b, c FROM t1, v1, v2, t2, ...

6. Usually composite variables don't want to unpack. When you should use
FROM clause, then composite variables will be unpacked. Then all fields can
be possibly in collision with all other column name

Example

CREATE TYPE t1 AS (id int, name varchar)
CREATE TABLE tab(id int, name varchar)
CREATE VARIABLE var1 AS t1;

SELECT id, name, foo(var1) FROM tab, var1;

Now I have a collision in columns id, name, and everywhere I need to use
aliases. Without necessity to use var in FROM clause, I can just write

SELECT id, name, foo(var) FROM t

Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 12, 2024 at 7:29 PM vignesh C  wrote:
>
> On Mon, 11 Nov 2024 at 23:30, Masahiko Sawada  wrote:
> >
> > On Sun, Nov 10, 2024 at 11:24 PM Peter Smith  wrote:
> > >
> > > Hi Sawada-San, here are some review comments for the patch v5-0001.
> > >
> >
> > Thank you for reviewing the patch!
> >
> > > ==
> > > Commit message.
> > >
> > > 1.
> > > This commit introduces an additional check to determine if a
> > > transaction is already aborted by a CLOG lookup, so the logical
> > > decoding skips further change also when it doesn't touch system
> > > catalogs.
> > >
> > > ~
> > >
> > > Is that wording backwards? Is it meant to say:
> > >
> > > This commit introduces an additional CLOG lookup check to determine if
> > > a transaction is already aborted, so the ...
> >
> > Fixed.
> >
> > >
> > > ==
> > > contrib/test_decoding/sql/stats.sql
> > >
> > > 2
> > > +SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
> > > spill_count FROM pg_stat_replication_slots WHERE slot_name =
> > > 'regression_slot_stats4_twophase';
> > >
> > > Why do the SELECT "= 0" like this, instead of just having zeros in the
> > > "expected" results?
> >
> > Indeed. I used "=0" like other queries in the same file do, but it
> > makes sense to me just to have zeros in the expected file. That way,
> > it would make it a bit easier to investigate in case of failures.
> >
> > >
> > > ==
> > > .../replication/logical/reorderbuffer.c
> > >
> > > 3.
> > >  static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN 
> > > *txn,
> > > - bool txn_prepared);
> > > + bool txn_prepared, bool mark_streamed);
> > >
> > > That last parameter name ('mark_streamed') does not match the same
> > > parameter name in this function's definition.
> >
> > Fixed.
> >
> > >
> > > ~~~
> > >
> > > ReorderBufferTruncateTXN:
> > >
> > > 4.
> > > if (txn_streaming && (!txn_prepared) &&
> > >   (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
> > >   txn->txn_flags |= RBTXN_IS_STREAMED;
> > >
> > > if (txn_prepared)
> > > {
> > > ~
> > >
> > > Since the following condition was already "if (txn_prepared)" would it
> > > be better remove the "(!txn_prepared)" here and instead just refactor
> > > the code like:
> > >
> > > if (txn_prepared)
> > > {
> > >   ...
> > > }
> > > else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 
> > > 0)))
> > > {
> > >   ...
> > > }
> >
> > Good idea.
> >
> > >
> > > ~~~
> > >
> > > ReorderBufferProcessTXN:
> > >
> > > 5.
> > > +
> > > + /* Remember the transaction is aborted */
> > > + Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
> > > + curtxn->txn_flags |= RBTXN_IS_ABORTED;
> > >
> > > Missing period on comment.
> >
> > Fixed.
> >
> > >
> > > ~~~
> > >
> > > ReorderBufferCheckTXNAbort:
> > >
> > > 6.
> > > + * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
> > > + * check the transaction status, so the caller always processes this
> > > + * transaction. This is to disable this check for regression tests.
> > > + */
> > > +static bool
> > > +ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > +{
> > > + /*
> > > + * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
> > > + * check the transaction status, so the caller always processes this
> > > + * transaction.
> > > + */
> > > + if (unlikely(debug_logical_replication_streaming ==
> > > DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
> > > + return false;
> > > +
> > >
> > > The wording of the sentence "This is to disable..." seemed a bit
> > > confusing. Maybe this area can be simplified by doing the following.
> > >
> > > 6a.
> > > Change the function comment to say more like below:
> > >
> > > When the GUC 'debug_logical_replication_streaming' is set to
> > > "immediate", we don't check the transaction status, meaning the caller
> > > will always process this transaction. This mode is used by regression
> > > tests to avoid unnecessary transaction status checking.
> > >
> > > ~
> > >
> > > 6b.
> > > It is not necessary for this 2nd comment to repeat everything that was
> > > already said in the function comment. A simpler comment here might be
> > > all you need:
> > >
> > > SUGGESTION:
> > > Quick return for regression tests.
> >
> > Agreed with the above two comments. Fixed.
> >
> > >
> > > ~~~
> > >
> > > 7.
> > > Is it worth mentioning about this skipping of the transaction status
> > > check in the docs for this GUC? [1]
> >
> > If we want to mention this optimization in the docs, we have to
> > explain how the optimization works too. I think it's too detailed.
> >
> > I've attached the updated patch.
>
> Few minor suggestions:
> 1) Can we use rbtxn_is_committed here?
> +   /* Remember the transaction is aborted. */
> +   Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
> +   curtxn->txn_flags |= RBTXN_IS_ABORTED;
>
> 2) Similarly here too:
> +   /*
> +* Mark the tran

Graceful way to handle too many locks

2024-11-13 Thread Chris Cleveland

In my extension I got a mystery error:

TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c", Line:
1869, PID: 62663
0 postgres 0x00010135adb4 ExceptionalCondition + 108
1 postgres 0x0001012235ec LWLockRelease + 1456
2 postgres 0x0001011faebc UnlockReleaseBuffer + 24

Turns out there was a bug in my extension where I was getting a share lock
on a particular index page over and over. Oddly, the error showed up not
when I was getting the locks, but when I released them. Any time I locked
the index page more than ~200 times, this error would show up on release.

Questions:

1. Why is the limit on the number of locks so low? I thought that when
getting a share lock, all it did was bump a reference count.

2. Is there a way to get this to fail gracefully, that is, with an error
message that makes sense, and kicks in at the moment you go over the limit,
instead of later?

-- 
Chris Cleveland
312-339-2677 mobile

Re: no library dependency in Makefile?

2024-11-13 Thread Noah Misch

This burned me recently.

On Sun, Nov 26, 2017 at 02:18:16PM +0800, 高增琦 wrote:
> Update version:
> 1. Re-base with head of master
> 2. Add some basic support for PGXS
> 
> After replacing submake-libpgport/submake-libpgfeutils with
> $(stlib_pgport)/$(stlib_pgfeutils) in
> Makefiles of client programs, I think, may be we should add static lib
> dependency for PGXS.

Maybe.  Naming an installed file as a Make prerequisite can break if installed
to a directory containing a space.  If that works now, we shouldn't break it
for this.  Otherwise, naming the installed prerequisites sounds fine.

> I can think two ways to do that: first, add  a new PG_STLIBS variable, user
> need to
> add static libs to it; second, we generate static lib dependency
> automatically
> from PG_LIBS variable. Which one is better?

I prefer the second one.  With the first one, it's easy to miss omissions.  I
think that concept is useful beyond PG_LIBS and beyond PGXS.  For example:

> -initdb: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
> +initdb: $(OBJS) $(stlib_pgport) $(stlib_pgfeutils) | submake-libpq
>   $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

This could look like

  initdb: $(OBJS) $(call lib_prereq,$(LDFLAGS) $(LDFLAGS_EX) $(LIBS))

where lib_prereq is a GNU make function that translates -lpgcommon to
SOMEDIR/libpgcommon.a, translates unrecognized arguments to nothing, etc.
This avoids the need to edit the "initdb" rule every time you edit LIBS.  It's
easy to miss doing that, due to distance between the places that edit LIBS and
the places that read it.  For example, Makefile.global is one of the editors
of LIBS.  How do you see it?

Thanks,
nm

Re: Graceful way to handle too many locks




On 11/13/24 20:05, Chris Cleveland wrote:
> In my extension I got a mystery error:
> 
> TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c",
> Line: 1869, PID: 62663
> 0postgres 0x00010135adb4ExceptionalCondition + 108
> 1postgres 0x0001012235ecLWLockRelease + 1456
> 2postgres 0x0001011faebcUnlockReleaseBuffer + 24
> 
> Turns out there was a bug in my extension where I was getting a share
> lock on a particular index page over and over. Oddly, the error showed
> up not when I was getting the locks, but when I released them. Any time
> I locked the index page more than ~200 times, this error would show up
> on release. 
> 
> Questions:
> 
> 1. Why is the limit on the number of locks so low? I thought that when
> getting a share lock, all it did was bump a reference count.
> 

Because good code shouldn't really need more than 200 LWLocks. Note this
limit does not apply to row locks, relation locks, and so on.

> 2. Is there a way to get this to fail gracefully, that is, with an error
> message that makes sense, and kicks in at the moment you go over the
> limit, instead of later?
> 

Not really, the limit of 200 lwlocks is hard-coded, so the only solution
is to not acquire that many of them (in a single backend). But I wonder
if you're actually hitting that limit, because that should trigger

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

and not the assert. That suggests your extension does something wrong
with HOLD_INTERRUPTS() or something like that.


regards

-- 
Tomas Vondra

Re: .ready and .done files considered harmful

2024-11-13 Thread Robert Haas

On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera  wrote:
> So, my question now is, would there be much opposition to backpatching
> beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE?

It seems like it's been long enough now that if the new logic had
major problems we probably would have found them by now; so I feel
like it's probably pretty safe. Perhaps it's questionable how many
people we'll help by back-patching this into one additional release,
but if you feel it's worth it I wouldn't be inclined to argue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup --incremental

2024-11-13 Thread Jakub Wartak

 On Mon, Nov 4, 2024 at 6:53 PM Robert Haas  wrote:

Hi Robert,

[..]

1. Well, I have also the same bug as Bertrand which seems to be because
MacOS was used development rather than Linux (and thus MacOS doesnt have
copy_file_range(2)/HAVE_COPY_FILE_RANGE) --> I've simply fallen back to
undefHAVE_COPY_FILE_RANGE in my case, but patch needs to be fixed. I
haven't run any longer or more data-intensive tests as the
copy_file_range() seems to be missing and from my point of view that thing
is crucial.

2. While interleaving several incremental backups with pgbench, I've
noticed something strange by accident:

This will work:

$ pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1 incr2
incr3 incr4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 incr3 incr4 -o good_incr1_4
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup good_incr1_4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 -o incr_12 #ok
$ pg_combinebackup -i incr_12 incr3 -o incr_13 #ok
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr_13

BUT, if I make intentional user error and if I merge the same incr2 into
both into two sets of incremental backups it won't reconstruct that:

$ pg_combinebackup -i incr1 incr2 -o incr1_2 # contains 1 + 2
$ pg_combinebackup -i incr2 incr3 -o incr2_3 # contains 2(!) + 3
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 # ofc works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr2_3 # fails?
pg_combinebackup: error: backup at "incr1_2" starts at LSN 0/B28, but
expected 0/928

It seems to be taking LSN from incr1_2 and ignoring incr2_3 ?

$ find incr1 incr2 incr3 incr1_2 incr2_3 fulldbbackup -name backup_manifest
-exec grep -H LSN {} \;
incr1/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/928", "End-LSN":
"0/9000120" }
incr2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B28", "End-LSN":
"0/B000120" }
incr3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D28", "End-LSN":
"0/D000120" }
incr1_2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B28",
"End-LSN": "0/B000120" }
incr2_3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D28",
"End-LSN": "0/D000120" }
fulldbbackup/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/7D8",
"End-LSN": "0/70001D0" }

So not sure should we cover that scenario or not ?

$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # two combined
sets - also work

4. Space saving feature seems to be there (I've tried to merge up to ~40
backups with rolling merging incr backup always after each incremental
backup), which seems to be the primary objective of the patch:

$ du -sm incr? incr?? incr1_38
38  incr1
25  incr2
[..]
24  incr37
24  incr38
[..above would be ~38*~25M = ~950MB]
87  incr1_38 # instead of ~950MB just 87MB

5. I've run accidently into independent problem when using
"pg_combinebackup -i `ls -1vd incr*` -o incr_ALL" (when dealing with dozens
of incrementals that are merged, I bet this is going to be pretty used
pattern), that pg_combinebackup was failing with
$ pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
pg_combinebackup: error: incr26/global/pg_control: expected system
identifier 7436752340350991865, but found 7436753510269539237

The issue for me is that the check if the output directory should not exist
first, because it is taking incr_ALL here into ls(1) first while looking
for System-Identifiers and blowing up with error , before checking if that
-o dir doesn't exit:

$ grep System-Id ./incr_ALL/backup_manifest
"System-Identifier": 7436752340350991865,

So the issue is sequencing: first it should check if the incr_ALL does not
exist and only maybe later start inspecting backups to be combined?

6. Not sure, if that's critical, but it seems to require incremental
backups in order to be merging correctly , is that a limitation by design
or not ? (note --reverse in ls(1)):

$ rm -rf incr_ALL && pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i `ls -1rvd incr*` -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B28, but
expected 0/7D8

simpler:
$ rm -rf incr_ALL && pg_combinebackup -i incr1 incr2 incr3 -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i incr3 incr2 incr1 -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B28, but
expected 0/7D8
$ find incr1 incr2 incr3 -name backup_manifest -exec grep -H LSN {} \;   |
sort -nk 1
incr1/backup_manifest:{

Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT

2024-11-13 Thread Alvaro Herrera

On 2024-Nov-13, Amit Langote wrote:

> I rebased my patch over 14e87ffa5c54 and noticed that NOT NULL
> constraints can also (or not) be marked NO INHERIT.  I think we should
> allow NO INHERIT NOT NULL constraints on leaf partitions just like
> CHECK constraints, so changed AddRelationNotNullConstraints() that way
> and added some tests.

OK, looks good.

> However, I noticed that there is no clean way to do that for the following 
> case:
> 
> ALTER TABLE leaf_partition ADD CONSTRAINT col_nn_ni NOT NULL col NO INHERIT;

Sorry, I didn't understand what's the initial state.  Does the
constraint already exist here or not?

> If I change the new function AdjustNotNullInheritance() added by your
> commit to not throw an error for leaf partitions, the above constraint
> (col_nn_ni) is not stored at all, because the function returns true in
> that case, which means the caller does nothing with the constraint.  I
> am not sure what the right thing to do would be.  If we return false,
> the caller will store the constraint the first time around, but
> repeating the command will again do nothing, not even prevent the
> addition of a duplicate constraint.

Do you mean if we return false, it allows two not-null constraints for
the same column?  That would absolutely be a bug.

I think:
* if a leaf partition already has an inherited not-null constraint
  from its parent and we want to add another one, we should:
  - if the one being added is NO INHERIT, then throw an error, because
we cannot merge them
  - if the one being added is not NO INHERIT, then both constraints
would have the same state and so we silently do nothing.
* if a leaf partition has a locally defined not-null marked NO INHERIT
  - if we add another NO INHERIT, silently do nothing
  - if we add an INHERIT one, throw an error because cannot merge.


If you want, you could leave the not-null constraint case alone and I
can have a look later.  It wasn't my intention to burden you with that.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.

Re: proposal: schema variables

2024-11-13 Thread Pavel Stehule

st 13. 11. 2024 v 15:24 odesílatel Laurenz Albe 
napsal:

> Thanks for the updated patch set.
>
> Here is my review of patch 0005:
>
> > --- a/src/backend/access/transam/xact.c
> > +++ b/src/backend/access/transam/xact.c
> > +#include "commands/session_variable.h"
>
> You probably forgot to move that to the patch for temporary variables.
> I did that.
>

+1


> > --- a/src/backend/commands/session_variable.c
> > +++ b/src/backend/commands/session_variable.c
> > @@ -83,6 +92,19 @@ static HTAB *sessionvars = NULL; /* hash table for
> session variables */
> >
> >  static MemoryContext SVariableMemoryContext = NULL;
> >
> > +/* true after accepted sinval message */
> > +static bool needs_validation = false;
> > +
> > +/*
> > + * The content of session variables is not removed immediately. When it
> > + * is possible we do this at the transaction end. But when the
> transaction failed,
> > + * we cannot do it, because we lost access to the system catalog. So we
> > + * try to do it in the next transaction before any get or set of any
> session
> > + * variable. We don't want to repeat this opening cleaning in
> transaction,
> > + * So we store the id of the transaction where opening validation was
> done.
> > + */
> > +static LocalTransactionId validated_lxid = InvalidLocalTransactionId;
>
> I have moved the reference to the transaction end to the temporary variable
> patch.
>

+1


> I have gone over the comments in patch 0005 and 0006.
> I hope I got everything right.  Attached is an updated patch set.
>

Thank you

Pavel


>
> Yours,
> Laurenz Albe
>

RE: FW: Building Postgres 17.0 with meson

2024-11-13 Thread Mark Hill

EXTERNAL

On Wed, Nov 6, 2024 at 4:21 PM Robert Hass  wrote:
> I am not an expert on this topic, but do use these options on macOS and they 
> do work for me. A typical 'meson setup' invocation for me is:

> meson setup $HOME/pgsql $HOME/pgsql-meson -Dcassert=true -Ddebug=true 
> -Dextra_include_dirs=/opt/local/include
> -Dextra_lib_dirs=/opt/local/lib -Dprefix=$HOME/install/dev

Including more information this time hoping it will help in debugging the 
problem.  I'm only building with openssl at the moment to keep things simple.
I've included the output of the meson setup command and also attached the log 
file that's generated.

I have built and installed OpenSSL 3.1.6 in the following location:
D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6

I'm setting the following env variables:
set 
openssl_include_dir=D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include\openssl
set 
openssl_lib_dir=D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\lib
set 
prefix=D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Install\pg_17.0_wx6

meson setup command and output: 
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0>meson
 setup build --prefix="%prefix%" --buildtype=release -Dssl=openssl 
-Dextra_lib_dirs=%openssl_lib_dir% -Dextra_include_dirs=%openssl_include_dir% 
-Dnls=disabled -Dplperl=disabled -Dplpython=disabled -Dpltcl=disabled 
-Dllvm=disabled -Dlz4=disabled -Dzstd=disabled -Dgssapi=disabled 
-Dldap=disabled -Dpam=disabled -Dbsd_auth=disabled -Dsystemd=disabled 
-Dbonjour=disabled -Duuid=none -Dlibxml=disabled -Dlibxslt=disabled 
-Dselinux=disabled -Dicu=disabled -Dzlib=disabled
The Meson build system
Version: 1.6.0
Source dir: 
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0
Build dir: 
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0\build
Build type: native build
Project name: postgresql
Project version: 17.0
C compiler for the host machine: cl (msvc 19.29.30154 "Microsoft (R) C/C++ 
Optimizing Compiler Version 19.29.30154 for x64")
C linker for the host machine: link link 14.29.30154.0
Host machine cpu family: x86_64
Host machine cpu: x86_64
Run-time dependency threads found: YES
Library ws2_32 found: YES
Library secur32 found: YES
Program perl found: YES (C:\Strawberry\perl\bin\perl.EXE)
Program python3 found: YES (C:\Program Files\Python38\python3.EXE)
Program flex found: YES 2.5.35 2.5.35 (C:\GnuWin32\bin\flex.EXE)
Program bison found: YES 2.4.1 2.4.1 (C:\GnuWin32\bin\bison.EXE)
Program sed found: YES (C:\GnuWin32\bin\sed.EXE)
Program prove found: YES (C:\Strawberry\perl\bin\prove.BAT)
Program tar found: YES (C:\Windows\system32\tar.EXE)
Program gzip found: NO
Program lz4 found: NO
Program openssl found: NO
Program zstd found: NO
Program dtrace skipped: feature dtrace disabled
Program config/missing found: YES (sh 
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0\config/missing)
Program cp found: NO
Program xmllint found: NO
Program xsltproc found: NO
Program wget found: NO
Program C:\Python\Python310\python.exe found: YES 
(C:\Python\Python310\python.exe)
Check usable header "bsd_auth.h" skipped: feature bsd_auth disabled
Check usable header "dns_sd.h" skipped: feature bonjour disabled
Program fop found: NO
Compiler for language cpp skipped: feature llvm disabled
Found pkg-config 'C:\\Strawberry\\perl\\bin\\pkg-config.BAT' but it is 
Strawberry Perl and thus broken. Ignoring...
Found pkg-config: NO
Found CMake: C:\Program Files\CMake\bin\cmake.EXE (3.29.0)
Run-time dependency readline found: NO (tried pkgconfig and cmake)
Library readline found: NO
Run-time dependency libedit found: NO (tried pkgconfig and cmake)
Library libedit found: NO
Dependency libselinux skipped: feature selinux disabled
Dependency libsystemd skipped: feature systemd disabled
Run-time dependency openssl found: NO (tried pkgconfig, system and cmake)

meson.build:1336:17: ERROR: C header 'openssl/ssl.h' not found

A full log can be found at 
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0\build\meson-logs\meson-log.txt


The error above says meson can't find openssl/ssl.h but it's there:
D:\Jenkins\workspace\workspace\Postgres-9.4\PostgreSQL\postgres-Source\buildPostgreSQL_wx6_17.0\postgresql-17.0>dir
 
D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include\openssl\ssl.h
 Volume in drive D is Data
 Volume Serial Number is 58B5-7193

 Directory of 
D:\Jenkins\workspace\workspace\Postgres-9.4\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6\include\openssl

08/12/2024  04:22 PM   127,595 ssl.h
   1 File(s)127,595 bytes
   0

pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

2024-11-13 Thread Tony Wayne

Hi Hackers,

Can you please tell which mail should we use to send Patches or POCs or
start a discussion?

Regards,
Tony Wayne.

Re: pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

2024-11-13 Thread David G. Johnston

On Wed, Nov 13, 2024 at 8:18 AM Tony Wayne 
wrote:

> Can you please tell which mail should we use to send Patches or POCs or
> start a discussion?
>
>
The "lists" subdomain is the real one; the other is legacy and now just an
alias.

David J.

Re: Infinite loop in XLogPageRead() on standby

2024-11-13 Thread Alexander Kukushkin

Hi Michael,

Now that v17 is released and before v18 feature freeze we have a few
months, I hope you will find some time to look at it.

On Wed, 5 Jun 2024 at 07:09, Michael Paquier  wrote:

> On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote:
> > Now that beta1 was released I hope you are not so busy and hence would
> like
> > to follow up on this problem.
>
> I am still working on something for the v18 cycle that I'd like to
> present before the beginning of the next commit fest, so I am a bit
> busy to get that out first.  Fingers crossed to not have open items to
> look at..  This thread is one of the things I have marked as an item
> to look at, yes.
> --
> Michael
>


Regards,
--
Alexander Kukushkin

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Tue, Nov 12, 2024 at 10:14 PM Masahiro Ikeda
 wrote:
> On 2024-11-13 00:55, Peter Geoghegan wrote:
> > The wrapper function has extra assertions, compared to what we do
> > already. It's slightly more defensive, and IMV slightly clearer.
>
> Thanks, I agree with adding the function for refactoring and including
> assertions for moreLeft or moreRight.

Pushed.

Thanks again
-- 
Peter Geoghegan

Re: Fix for pageinspect bug in PG 17



On 11/12/24 08:39, Michael Paquier wrote:
> On Mon, Nov 11, 2024 at 07:32:10PM +0100, Tomas Vondra wrote:
>> This adds an out argument to brin_page_items, but I failed to consider
>> the user may still run with an older version of the extension - either
>> after pg_upgrade (as in the report), or when the CREATE EXTENSION
>> command specifies VERSION.
> 
> Perhaps it would be worth adding a test where the function is called
> in the context of a past extension version?  (I know, I'm noisy when
> it comes to that.)
> 
>> The only fix I can think of is explicitly looking at TupleDesc->natts,
>> as in the attached fix.
> 
> How about some consistency instead as this is the same error as
> 691e8b2e1889?  btreefuncs.c is deciding to issue an error if we are
> trying to call one of its functions in the context of a past version
> of the extension.  pageinspect is a superuser module, it does not seem
> that bad to me to tell users that they should force an upgrade when
> using it.

Yeah, a test with an older version is a good idea. And pageinspect
already has such tests in oldextversions.sql, so it was simple to just
add it there.

And I think you're right it's probably better to just error out. Yes, we
could check the natts value and build the right tuple, but if btree
functions already error out, it'd be a bit futile to do it differently
(although I'm not sure why bt_meta() says it can't be done reliably).


regards

-- 
Tomas Vondra
diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index 22621d584fa..525c82600f1 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -117,6 +117,8 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
 	return page;
 }
 
+/* Number of output arguments (columns) for brin_page_items() */
+#define BRIN_PAGE_ITEMS_COLS_V1_12		8
 
 /*
  * Extract all item values from a BRIN index page
@@ -153,6 +155,20 @@ brin_page_items(PG_FUNCTION_ARGS)
  errmsg("\"%s\" is not a %s index",
 		RelationGetRelationName(indexRel), "BRIN")));
 
+	/*
+	 * We need a kluge here to detect API versions prior to 1.12.  Earlier
+	 * versions don't have the "empty" flag as o
+	 *
+	 * In principle we could skip the 'empty' flag and build the tuple with
+	 * the old tuple descriptor, but functions like bt_meta() already error
+	 * out in this case, so follow that precedent.
+	 */
+	if (rsinfo->setDesc->natts < BRIN_PAGE_ITEMS_COLS_V1_12)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
+ errmsg("function has wrong number of declared columns"),
+ errhint("To resolve the problem, update the \"pageinspect\" extension to the latest version.")));
+
 	bdesc = brin_build_desc(indexRel);
 
 	/* minimally verify the page we got */
diff --git a/contrib/pageinspect/expected/oldextversions.out b/contrib/pageinspect/expected/oldextversions.out
index f5c4b61bd79..c0abd7ab3ca 100644
--- a/contrib/pageinspect/expected/oldextversions.out
+++ b/contrib/pageinspect/expected/oldextversions.out
@@ -52,5 +52,20 @@ SELECT pagesize, version FROM page_header(get_raw_page('test1', 0));
  8192 |   4
 (1 row)
 
+DROP TABLE test1;
+-- brin_page_items got a new OUT parameter in 1.12
+CREATE TABLE test1 (a int, b text);
+INSERT INTO test1 VALUES (1, 'one');
+CREATE INDEX test1_a_idx ON test1 USING brin (a);
+SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx');
+ERROR:  function has wrong number of declared columns
+HINT:  To resolve the problem, update the "pageinspect" extension to the latest version.
+ALTER EXTENSION pageinspect UPDATE TO '1.12';
+SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx');
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | empty |  value   
++++--+--+-+---+--
+  1 |  0 |  1 | f| f| f   | f | {1 .. 1}
+(1 row)
+
 DROP TABLE test1;
 DROP EXTENSION pageinspect;
diff --git a/contrib/pageinspect/sql/oldextversions.sql b/contrib/pageinspect/sql/oldextversions.sql
index 9f953492c23..b5342d3397d 100644
--- a/contrib/pageinspect/sql/oldextversions.sql
+++ b/contrib/pageinspect/sql/oldextversions.sql
@@ -22,5 +22,18 @@ ALTER EXTENSION pageinspect UPDATE TO '1.9';
 \df page_header
 SELECT pagesize, version FROM page_header(get_raw_page('test1', 0));
 
+DROP TABLE test1;
+
+-- brin_page_items got a new OUT parameter in 1.12
+CREATE TABLE test1 (a int, b text);
+INSERT INTO test1 VALUES (1, 'one');
+CREATE INDEX test1_a_idx ON test1 USING brin (a);
+
+SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx');
+
+ALTER EXTENSION pageinspect UPDATE TO '1.12';
+
+SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx');
+
 DROP TABLE test1;
 DROP EXTENSION pageinspect;

Re: .ready and .done files considered harmful

2024-11-13 Thread Alvaro Herrera

Hello, sorry for necro-posting here:

On 2021-May-03, Robert Haas wrote:

> I and various colleagues of mine have from time to time encountered
> systems that got a bit behind on WAL archiving, because the
> archive_command started failing and nobody noticed right away.

We've recently had a couple of cases where the archiver hasn't been able
to keep up on systems running 13 and 14 because of this problem, causing
serious production outages.  Obviously that's not a great experience.  I
understand that this has been significantly improved in branch 15 by
commit beb4e9ba1652, the fix in this thread; we hypothesize that both
these production problems wouldn't have occurred, if the users had been
running the optimized pgarch.c code.

However, that commit was not backpatched.  I think that was the correct
decision at the time, because it wasn't a trivial fix.  It was
significantly modified by 1fb17b190341 a month later, both to fix a
critical bug as well as to make some efficiency improvements.

Now that the code has been battle-tested, I think we can consider
putting it into the older branches.  I did a quick cherry-pick
experiment, and I found that it backpatches cleanly to 14.  It doesn't
to 13, for lack of d75288fb27b8, which is far too invasive to backpatch,
and I don't think we should rewrite the code so that it works on the
previous state.  Fortunately 13 only has one more year to live, so I
don't feel too bad about leaving it as is.

So, my question now is, would there be much opposition to backpatching
beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE?

(On the other hand, we can always blame users for failing to implement
WAL archiving "correctly" ...  but from my perspective, this is an
embarrasing Postgres problem, and one that's relatively easy to solve
with very low risk.)

Thanks,

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"

Re: Fix for pageinspect bug in PG 17

On 11/12/24 09:04, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>> Here's a fix for pageinspect bug in PG17, reported in [1]. The bug turns
>> out to be introduced by my commit
> 
> I could not see the link, but I think it is [1], right?
> 

Right. Apologies, I forgot to include the link.

>>
>> commit dae761a87edae444d11a411f711f1d679bed5941
>> Author: Tomas Vondra 
>> Date:   Fri Dec 8 17:07:30 2023 +0100
>>
>> Add empty BRIN ranges during CREATE INDEX
>>
>> ...
>>
>> This adds an out argument to brin_page_items, but I failed to consider
>> the user may still run with an older version of the extension - either
>> after pg_upgrade (as in the report), or when the CREATE EXTENSION
>> command specifies VERSION.
> 
> I confirmed that 428c0ca added new output entries (version is bumped 11->12),
> and the same C function is used for both 11 and 12.
> 
>> The new "empty" field is added in the middle of the output tuple, which
>> shifts the values, causing segfault when accessing a text field at the
>> end of the array.
> 
> Agreed and I could reproduce by steps:
> 
> ```
> postgres=# CREATE EXTENSION pageinspect VERSION "1.11" ;
> CREATE EXTENSION
> postgres=# CREATE TABLE foo (id int);
> CREATE TABLE
> postgres=# CREATE INDEX ON foo USING brin ( id );
> CREATE INDEX
> postgres=# SELECT * FROM brin_page_items(get_raw_page('foo_id_idx', 2), 
> 'foo_id_idx');
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> ```
> 
>>
>> Of course, we add arguments to existing functions pretty often, and we
>> know how to do that in backwards-compatible way - pg_stat_statements is
>> a good example of how to do that nicely. Unfortunately, it's too late to
>> do that for brin_page_items :-( There may already be upgraded systems or
>> with installed pageinspect, etc.
>>
>> The only fix I can think of is explicitly looking at TupleDesc->natts,
>> as in the attached fix.
> 
> I've tested your patch and it worked well. Also, I tried to upgrade from 11 
> and 12
> and ran again, it could execute well.
> 
> Few points:
> 
> 1.
> Do you think we should follow the approach like pg_stat_statements for master?
> Or, do you want to apply the patch both PG17 and HEAD? I think latter one is 
> OK
> because we should avoid code branches as much as possible.
> 

My plan was to apply the patch to both 17 and HEAD, and then maybe do
something smarter in HEAD in a separate commit. But then Michael pointed
out other pageinspect functions just error out in this version-mismatch
cases, so I think it's better to just do it the same way.

Maybe we could be smarter, but it seems pointless to do it only for one
pageinspect function, while still requiring an upgrade for other more
widely used functions in the same situation (albeit for a much older
version of the extension).

> 2.
> Later readers may surprise the part for checking `rsinfo->setDesc->natts`.
> Can you use the boolean and add comments, something like
> 
> ```
> +/* Support older API version */
> +booloutput_empty_attr = (rsinfo->setDesc->natts >= 8);
> ```
> 

Doesn't matter, if we error out.

> 3.
> Do we have to add the test for the fix?
> 

Yes. See the patch I shared a couple minutes ago.


regards

-- 
Tomas Vondra

Re: Speed up TransactionIdIsCurrentTransactionId() with lots of subxacts

2024-11-13 Thread Bertrand Drouvot

Hi,

On Fri, Oct 11, 2024 at 12:21:12AM +0300, Heikki Linnakangas wrote:
> > +   98.52%98.45%  postgres  postgres   [.]
> > TransactionIdIsCurrentTransactionId
> In this scenario with lots of active subtransactions,
> TransactionIdIsCurrentTranactionId effectively degenerates into a linear
> search over a linked list, as it traverses each level in the
> CurrentTransactionState stack.

Agree.

> The patch

Thanks for the patch!

> Instead of having a separate childXids array on each transaction level, we
> can maintain one large array of XIDs that includes the XIDs of all committed
> and in-progress subtransactions. On each nesting level, remember the offset
> within that array, so that all XIDs belonging to that nesting level or its
> parents are above that offset.
> When a subtransaction commits, you don't need
> to copy XIDs to the parent, you just adjust the parent's offset into the
> array, to include the child's XIDs.

Yeah, makes sense. And in case of subtransaction rollback, that's fine because
all the subtransactions down to this one can be/are "discarded".

> Patch attached,

A few random comments:

=== 1

+  parentOffset--;
+  parents[parentOffset]->totalXids = totalXids;

what about "parents[--parentOffset]->totalXids = totalXids;" instead?

That would remove the ability for someone to insert code between the decrement
and the array access.

=== 2

+  newsize = 32;
+  CurrentTransactionIds = MemoryContextAlloc(TopTransactionContext,
+ 32 * sizeof(TransactionId));

what about defining a macro for "32" (as it is used in multiple places in
xact.c)?

=== 3

-   /* Copy data into output area. */
+   /*
+* In CurrentTransactoinIds,

s/CurrentTransactoinIds/CurrentTransactionIds/

4 ===

 int
 xactGetCommittedChildren(TransactionId **ptr)
 {
-   TransactionState s = CurrentTransactionState;
+   return GetCommittedChildren(CurrentTransactionState, ptr);

Worth to move this part of the comment on top of xactGetCommittedChildren(),

"
If there are no subxacts, *ptr is set to NULL.
"

on top of GetCommittedChildren() instead?

5 ===

 static void
 AtSubAbort_childXids(void)
 {
-   TransactionState s = CurrentTransactionState;
-
-   /*
-* We keep the child-XID arrays in TopTransactionContext (see
-* AtSubCommit_childXids).  This means we'd better free the array
-* explicitly at abort to avoid leakage.
-*/
-   if (s->childXids != NULL)
-   pfree(s->childXids);
-   s->childXids = NULL;
-   s->nChildXids = 0;
-   s->maxChildXids = 0;
+   /* nothing to do */

Yeah that works but then some CurrentTransactionIds[] elements do not reflect
the "reality" anymore (until the top level transaction finishes, or until a
new subtransaction is created). 

Could'nt that be an issue from a code maintability point of view?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-11-13 Thread Dmitry Dolgov

> On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
> On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
>  wrote:
> > Attached is version 16 now.
>
> I ran this with my old land registry benchmark, used for validating
> the space utilization impact of nbtree deduplication (among other
> things). This isn't obviously the best benchmark for this sort of
> thing, but I seem to recall that you'd used it yourself at some point.
> To validate work in this area, likely including this patch. So I
> decided to start there.
>
> To be clear, this test involves bulk loading of an unlogged table (the
> land registry table). The following composite index is created on the
> table before we insert any rows, so most of the cycles here are in
> index maintenance including _bt_search descents:
>
> CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
> COLLATE "C", locality COLLATE "C");
>
> I wasn't able to see much of an improvement with this patch applied.
> It went from  ~00:51.598 to ~00:51.053. That's a little disappointing,
> given that this is supposed to be a sympathetic case for the patch.
> Can you suggest something else? (Granted, I understand that this patch
> has some complicated relationship with other patches of yours, which I
> don't understand currently.)

Under the danger of showing my ignorance, what is the definition of land
registry benchmark? I think it would be useful if others could reproduce
the results as well, especially if they're somewhat surprising.

At the same time I would like to note, that dynamic prefix truncation
can be useful not only on realistic data, but in some weird-but-useful
cases, e.g. for partitioned B-Tree with an artificial leading key
separating partitions. It's a hypothetical case, which can't justify
a feature of course, but makes it worth investigating.

Re: Fix for pageinspect bug in PG 17




On 11/13/24 18:20, Peter Geoghegan wrote:
> On Wed, Nov 13, 2024 at 11:07 AM Tomas Vondra  wrote:
>> My plan was to apply the patch to both 17 and HEAD, and then maybe do
>> something smarter in HEAD in a separate commit. But then Michael pointed
>> out other pageinspect functions just error out in this version-mismatch
>> cases, so I think it's better to just do it the same way.
> 
> FWIW I didn't actually backpatch commit 691e8b2e18. I decided that it
> was better to just paper-over the issue on backbranches instead -- see
> commit c788115b.
> 
> The problem that I fixed back in 2020 was a problem with the data
> types used -- not a failure to consider older versions of the
> extension at all. It was just convenient to use the number of columns
> to detect the version of the extension to detect a problematic
> (incorrectly typed) function.
> 

Does that mean you think we should fix the issue at hand differently?
Say, by looking at number of columns and building the correct tuple,
like I did in my initial patch?

regards

-- 
Tomas Vondra

Re: Graceful way to handle too many locks

2024-11-13 Thread Robert Haas

On Wed, Nov 13, 2024 at 2:05 PM Chris Cleveland
 wrote:
> In my extension I got a mystery error:
>
> TRAP: failed Assert("InterruptHoldoffCount > 0"), File: "lwlock.c", Line: 
> 1869, PID: 62663
> 0 postgres 0x00010135adb4 ExceptionalCondition + 108
> 1 postgres 0x0001012235ec LWLockRelease + 1456
> 2 postgres 0x0001011faebc UnlockReleaseBuffer + 24
>
> Turns out there was a bug in my extension where I was getting a share lock on 
> a particular index page over and over. Oddly, the error showed up not when I 
> was getting the locks, but when I released them. Any time I locked the index 
> page more than ~200 times, this error would show up on release.

I wonder how you managed to avoid hitting this check in LWLockAcquire:

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

> 1. Why is the limit on the number of locks so low? I thought that when 
> getting a share lock, all it did was bump a reference count.

200 LWLocks is an ENORMOUS number of LWLocks to be holding at once.
Except in very specific circumstances such as the one mentioned in the
comment for MAX_SIMUL_LWLOCKS, holding more than 1 or 2 or MAYBE 3
LWLocks simultaneously is a recipe for disaster. One issue is that
there's no deadlock checking, and it's easy to bring down an entire
system. Another issue is that other code will be expecting you to
release the lock quickly and you may cause the entire system to pile
up behind whichever lock you're holding. Details aside, you're only
supposed to hold an LWLock while you're actively looking at the
in-memory data structure it protects. If you need to keep a buffer
around for a longer time, you can hold a buffer pin for a longer time,
but the time for which you actually hold the lock needs to be minimal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: split func.sgml to separated individual sgml files

2024-11-13 Thread Corey Huinker

>
> The following is step-by-step logic.
>
>
The end result (one file per section) seems good to me.

I suspect that reviewer burden may be the biggest barrier to going forward.
Perhaps breaking up the changes so that each new sect1 file gets its own
commit, allowing the reviewer to more easily (if not programmatically)
verify that the text that moved out of func.sgml moved into
func-sect-foo.sgml.

Granted, the committer will likely squash all of those commits down into
one big one, but by the the hard work of reviewing is done by then.

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Wed, Nov 13, 2024 at 3:30 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
> > On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
> > To be clear, this test involves bulk loading of an unlogged table (the
> > land registry table). The following composite index is created on the
> > table before we insert any rows, so most of the cycles here are in
> > index maintenance including _bt_search descents:
> >
> > CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
> > COLLATE "C", locality COLLATE "C");

> Under the danger of showing my ignorance, what is the definition of land
> registry benchmark? I think it would be useful if others could reproduce
> the results as well, especially if they're somewhat surprising.

It's a sample dataset that I've found useful from time to time,
particularly when testing nbtree features. Usually using a composite
index like the one I described.

One slightly useful (though far from unique) property of such an index
is that it contains data that's low cardinality (sometimes extremely
low cardinality) across multiple index columns. With variable-width
(text) index columns. That specific combination made the index a
decent test of certain issues affecting the nbtsplitloc.c split point
choice logic during work on Postgres 12 and 13. I believe that
Matthias independently found it useful on a number of other occasions,
too.

See https://wiki.postgresql.org/wiki/Sample_Databases for instructions
on how to set it up for yourself. You could probably come up with a
way of generating a similar dataset, without needing to download
anything, though. The fact that I found it useful in the past is at
least somewhat arbitrary.

-- 
Peter Geoghegan

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Wed, Nov 13, 2024 at 3:53 AM Tomas Vondra  wrote:
>
>
>
> On 11/13/24 10:38, Amit Kapila wrote:
> > On Tue, Nov 12, 2024 at 6:29 PM Tomas Vondra  wrote:
> >>
> >> Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
> >> fix this particular case. But I'd be happier if we could also add
> >> asserts checking the LSN advances, to detect similar issues that we may
> >> be unaware of yet.
> >>
> >
> > As most of us lean towards fixing
> > LogicalIncreaseRestartDecodingForSlot(), let's fix that in the HEAD
> > and back branches. Separately we can consider other asserts just for
> > HEAD that you think will make the code robust and help avoid such bugs
> > in the future.
> >
>
> +1 to that
>

+1.

Tomas, are you going to update the patch you shared before[1] or shall I?

Regards,

[1] 
https://www.postgresql.org/message-id/abf794c4-c459-4fed-84d9-968c4f0e2052%40vondra.me,
0004-fix-LogicalIncreaseRestartDecodingForSlot.patch

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/14/24 00:33, Masahiko Sawada wrote:
> On Wed, Nov 13, 2024 at 3:53 AM Tomas Vondra  wrote:
>>
>>
>>
>> On 11/13/24 10:38, Amit Kapila wrote:
>>> On Tue, Nov 12, 2024 at 6:29 PM Tomas Vondra  wrote:

 Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
 fix this particular case. But I'd be happier if we could also add
 asserts checking the LSN advances, to detect similar issues that we may
 be unaware of yet.

>>>
>>> As most of us lean towards fixing
>>> LogicalIncreaseRestartDecodingForSlot(), let's fix that in the HEAD
>>> and back branches. Separately we can consider other asserts just for
>>> HEAD that you think will make the code robust and help avoid such bugs
>>> in the future.
>>>
>>
>> +1 to that
>>
> 
> +1.
> 
> Tomas, are you going to update the patch you shared before[1] or shall I?
> 

Please feel free to take over. I'm busy with some other stuff and the
initial analysis was done by you anyway.

regards

-- 
Tomas Vondra

Re: Converting contrib SQL functions to new style

On Wed, Nov 13, 2024 at 09:39:06AM +0100, Peter Eisentraut wrote:
> By the way, if we're going to touch all these extension script files to make
> them more modern SQL-like, we could also use named parameters more.

Sounds like a good idea to do.  Thanks for the suggestion.
--
Michael


signature.asc
Description: PGP signature

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Wed, Nov 13, 2024 at 3:47 PM Tomas Vondra  wrote:
>
>
>
> On 11/14/24 00:33, Masahiko Sawada wrote:
> > On Wed, Nov 13, 2024 at 3:53 AM Tomas Vondra  wrote:
> >>
> >>
> >>
> >> On 11/13/24 10:38, Amit Kapila wrote:
> >>> On Tue, Nov 12, 2024 at 6:29 PM Tomas Vondra  wrote:
> 
>  Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
>  fix this particular case. But I'd be happier if we could also add
>  asserts checking the LSN advances, to detect similar issues that we may
>  be unaware of yet.
> 
> >>>
> >>> As most of us lean towards fixing
> >>> LogicalIncreaseRestartDecodingForSlot(), let's fix that in the HEAD
> >>> and back branches. Separately we can consider other asserts just for
> >>> HEAD that you think will make the code robust and help avoid such bugs
> >>> in the future.
> >>>
> >>
> >> +1 to that
> >>
> >
> > +1.
> >
> > Tomas, are you going to update the patch you shared before[1] or shall I?
> >
>
> Please feel free to take over. I'm busy with some other stuff and the
> initial analysis was done by you anyway.

Sure. I've attached the updated patch. I just added the commit message.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v1-0001-Fix-a-possibility-of-logical-replication-slot-s-r.patch
Description: Binary data

Re: On non-Windows, hard depend on uselocale(3)

2024-11-13 Thread Thomas Munro

On Wed, Oct 2, 2024 at 1:04 AM Peter Eisentraut  wrote:
> On 28.08.24 20:50, Peter Eisentraut wrote:
> > I suggest that the simplification of the xlocale.h configure tests could
> > be committed separately.  This would also be useful independent of this,
> > and it's a sizeable chunk of this patch.
>
> To keep this moving along a bit, I have extracted this part and
> committed it separately.  I had to make a few small tweaks, e.g., there
> was no check for xlocale.h in configure.ac, and the old
> xlocale.h-including stanza could be removed from chklocale.h.  Let's see
> how this goes.

Thanks, Peter.  That seemed to work fine, and we shouldn't have to
think about xlocale.h as a special case too much again.

I am back to working on all these cleanup patches.  Here is a rebase
of this one, with a few minor style adjustments and a new attempt to
explain the overall approach in the commit message.
From a5eebfdd0b946f64310f60fab68e67b84bb474ae Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 10 Aug 2024 11:01:21 +1200
Subject: [PATCH v5] Tidy up locale thread safety in ECPG library.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Remove setlocale() and _configthreadlocal() as fallback strategy on
systems that don't have uselocale(), where ECPG tries to control
LC_NUMERIC formatting on input and output of floating point numbers.  It
was probably broken on some systems (NetBSD), and the code was also
quite messy and complicated, with obsolete configure tests (Windows).
It was also arguably broken, or at least had unstated environmental
requirements, if pgtypeslib code was called directly.

Instead, introduce PG_C_LOCALE to refer to the "C" locale as a locale_t
value.  It maps to the special constant LC_C_LOCALE when defined by libc
(macOS, NetBSD), or otherwise uses a process-lifetime locale_t that is
allocated on first use, just as ECPG previously did itself.  The new
replacement might be more widely useful.  Then change the float parsing
and printing code to pass that to _l() functions where appropriate.

Unfortunately the portability of those functions is a bit complicated.
First, many obvious and useful _l() functions are missing from POSIX,
though most standard libraries define some of them anyway.  Second,
although the thread-safe save/restore technique can be used to replace
the missing ones, Windows and NetBSD refused to implement standard
uselocale().  They might have a point: "wide scope" uselocale() is hard
to combine with other code and error-prone, especially in library code.
Luckily they have the  _l() functions we want so far anyway.  So we have
to be prepared for both ways of doing things:

1.  In ECPG, use strtod_l() for parsing, and supply a port.h replacement
using uselocale() over a limited scope if missing.

2.  Inside our own snprintf.c, use three different approaches to format
floats.  For frontend code, call libc's snprintf_l(), or wrap libc's
snprintf() in uselocale() if it's missing.  For backend code, snprintf.c
can keep assuming that the global locale's LC_NUMERIC is "C" and call
libc's snprintf() without change, for now.

(It might eventually be possible to call our in-tree Ryū routines to
display floats in snprintf.c, given the C-locale-always remit of our
in-tree snprintf(), but this patch doesn't risk changing anything that
complicated.)

Reviewed-by: Peter Eisentraut 
Reviewed-by: Tristan Partin 
Discussion: https://postgr.es/m/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech
---
 configure| 13 +---
 configure.ac |  3 +-
 meson.build  |  3 +-
 src/include/pg_config.h.in   |  9 ++-
 src/include/port.h   | 36 +
 src/interfaces/ecpg/ecpglib/connect.c| 39 +-
 src/interfaces/ecpg/ecpglib/data.c   |  2 +-
 src/interfaces/ecpg/ecpglib/descriptor.c | 37 -
 src/interfaces/ecpg/ecpglib/ecpglib_extern.h | 12 ---
 src/interfaces/ecpg/ecpglib/execute.c| 55 --
 src/interfaces/ecpg/pgtypeslib/dt_common.c   |  6 +-
 src/interfaces/ecpg/pgtypeslib/interval.c|  4 +-
 src/interfaces/ecpg/pgtypeslib/numeric.c |  2 +-
 src/port/Makefile|  1 +
 src/port/locale.c| 80 
 src/port/meson.build |  1 +
 src/port/snprintf.c  | 55 ++
 17 files changed, 192 insertions(+), 166 deletions(-)
 create mode 100644 src/port/locale.c

diff --git a/configure b/configure
index f58eae1baa8..6a133c09d3b 100755
--- a/configure
+++ b/configure
@@ -15039,7 +15039,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

-for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle set

Converting SetOp to read its two inputs separately

2024-11-13 Thread Tom Lane

Here are the beginnings of a patchset to do what was discussed at [1],
namely change the SetOp node type to read its inputs as outerPlan and
innerPlan, rather than appending them together with a flag column to
show which rows came from where.

The previous thread wondered why manually DISTINCT'ing inputs with a
lot of duplicates would make an INTERSECT faster than not doing so.
I speculated that the overhead of attaching the flag column (which
typically requires an additional projection node) plus the overhead of
an Append node might have a lot to do with that, and it seems I was
right.  With this patchset on yesterday's HEAD, I get results like
these for a simple test case:

regression=# create table t1 as
regression-# select (random()*100)::int as a from generate_series(1,100);
SELECT 100
regression=# vacuum analyze t1;
VACUUM
regression=# set max_parallel_workers_per_gather TO 0;
SET
regression=# explain analyze select a from t1 intersect select a from t1;
   QUERY PLAN   


 HashSetOp Intersect  (cost=0.00..33960.00 rows=101 width=4) (actual 
time=246.352..246.355 rows=101 loops=1)
   ->  Seq Scan on t1  (cost=0.00..14480.00 rows=100 width=4) (actual 
time=0.025..35.646 rows=100 loops=1)
   ->  Seq Scan on t1 t1_1  (cost=0.00..14480.00 rows=100 width=4) (actual 
time=0.011..36.163 rows=100 loops=1)
 Planning Time: 0.045 ms
 Execution Time: 246.372 ms
(5 rows)

regression=# explain analyze select distinct a from t1 intersect select 
distinct a from t1;
  QUERY PLAN
  
--
 HashSetOp Intersect  (cost=33960.00..33962.52 rows=101 width=4) (actual 
time=238.888..238.891 rows=101 loops=1)
   ->  HashAggregate  (cost=16980.00..16981.01 rows=101 width=4) (actual 
time=120.749..120.756 rows=101 loops=1)
 Group Key: t1.a
 Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on t1  (cost=0.00..14480.00 rows=100 width=4) (actual 
time=0.027..35.391 rows=100 loops=1)
   ->  HashAggregate  (cost=16980.00..16981.01 rows=101 width=4) (actual 
time=118.101..118.107 rows=101 loops=1)
 Group Key: t1_1.a
 Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on t1 t1_1  (cost=0.00..14480.00 rows=100 width=4) 
(actual time=0.014..35.468 rows=100 loops=1)
 Planning Time: 0.043 ms
 Execution Time: 238.916 ms
(11 rows)

It's still a little slower without DISTINCT, but not 50% slower like
before.  I have hopes that it'll be nearly on par once I figure out
how to avoid the extra per-row slot type conversion that's being done
in the 0001 patch.

Aside from that minor TODO, the main thing that's left undone in this
patch series is to persuade the thing to exploit presorted input
paths.  Right now it fails to do so, as can be seen in some of the
regression test cases, eg

regression=# set enable_hashagg = 0;
SET
regression=# explain (costs off) select unique1 from tenk1 except select 
unique2 from tenk1 where unique2 != 10;
QUERY PLAN
--
 SetOp Except
   ->  Sort
 Sort Key: tenk1.unique1
 ->  Index Only Scan using tenk1_unique1 on tenk1
   ->  Sort
 Sort Key: tenk1_1.unique2
 ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
   Filter: (unique2 <> 10)
(8 rows)

Obviously the sorts are unnecessary, but it's not seeing that.
I suppose this requires integrating generate_nonunion_paths with
the logic from commit 66c0185a3.  I tried to make sense of that,
but failed --- either I don't understand it, or there are a
number of things wrong with it.  I'd welcome some help with
getting that done.

regards, tom lane

[1] https://www.postgresql.org/message-id/2631313.1730733484%40sss.pgh.pa.us

From 05d6ad4530dddaceb771dee66699c2f0beae61f0 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Wed, 13 Nov 2024 15:34:29 -0500
Subject: [PATCH v1 1/3] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On t

Re: pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

2024-11-13 Thread * Neustradamus *

Dear all,

Can you remove all alias to *@postgresql.org, etc. because it is a problem of 
spam...

And please inform to all that the good ML are *@lists.postgresql.org.

Thanks in advance.

Regards,

Neustradamus

From: Kai Wagner 
Sent: Wednesday, November 13, 2024 16:25
To: Tony Wayne; pgsql-hackers
Subject: Re: pgsql-hack...@postgresql.org VS pgsql-hackers@lists.postgresql.org

Hi Tony,

pgsql-hackers@lists.postgresql.org is the correct mailing list to send
patches.

Kai

On 13/11/2024 16:18, Tony Wayne wrote:
> Hi Hackers,
>
> Can you please tell which mail should we use to send Patches or POCs
> or start a discussion?
>
> Regards,
> Tony Wayne.

--
Kai Wagner
Sr. Engineering Manager, Percona

e: kai.wag...@percona.com
w: www.percona.com
Databases Run Better with Percona.

Re: define pg_structiszero(addr, s, r)

On Wed, Nov 13, 2024 at 07:50:50AM +, Bertrand Drouvot wrote:
> I did a quick check with clang and it looks like it is not as smart as gcc
> for the non inline case.

Not as much, still smart enough to skip the > 64B part when dealing
with a structure that does not require it.  So it's actually still
good considering where we are at now on HEAD for the 8kB all-zero page
case.

> Anyway it's not like we have the choice: we need (at least) one len check for
> safety reason (to not crash or read invalid data).
> 
> So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?

Makes sense to me to just do that, with a first < 8 loop, and a second
for the 8~63 range.  And I can understand the code I read here as it
self-documents what it does.

 * There is no risk to read beyond the memory area thanks to the len < 64
 * check done below. 

This comment should do a s/below/above/ and a s/check/checks/, as it
refers to the two checks done before the trick with the 64B-per-loop
check.

There is also a "cant'" in the last size_t check.  Simple typo.
--
Michael

signature.asc
Description: PGP signature

Re: altering a column's collation leaves an invalid foreign key

2024-11-13 Thread jian he

On Wed, Nov 13, 2024 at 8:56 PM jian he  wrote:
>
> https://stackoverflow.com/questions/14921668/difference-between-restrict-and-no-action
> mentioned about the difference between "no action" and "restrict".
>
> RI_FKey_restrict_upd comments also says:
>
>  * The SQL standard intends that this referential action occur exactly when
>  * the update is performed, rather than after.  This appears to be
>  * the only difference between "NO ACTION" and "RESTRICT".  In Postgres
>  * we still implement this as an AFTER trigger, but it's non-deferrable.
>
> DROP TABLE IF EXISTS fktable, pktable;
> CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
> CREATE TABLE fktable (x text collate case_insensitive REFERENCES
> pktable on update restrict on delete restrict);
> INSERT INTO pktable VALUES ('A'), ('Å');
> INSERT INTO fktable VALUES ('a');
> update pktable set x = 'a' where x = 'A';

my previous email was too verbose.
my point is this: given PK, FK both are case_insensitive. and
PK side values are all being referenced.

case like `update pktable set x = 'a' where x = 'A';`
Do we consider PK side value changed?
it seems not mentioned anywhere.


> I added a "on update cascade, on delete cascade" tests on collate.icu.utf8.sql
> for both foreign key and primary key are nondeterministic collation.

I found out that we don't have tests for
ALTER TABLE ADD FOREIGN KEY
ALTER TABLE ALTER COLUMN SET DATA TYPE.
so I added these. Please ignore previous email attachments.


minor issue on commit message:
"colllations", typo?


v8-0001-add-more-tests-for-pk-fk-tie-with-nondetermin.no-cfbot1
Description: Binary data

Re: Skip collecting decoded changes of already-aborted transactions

2024-11-13 Thread Peter Smith

Hi Sawda-San,

Here are some more review comments for the latest (accidentally called
v6 again?) v6-0001 patch.

==
contrib/test_decoding/sql/stats.sql

1.
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.

You had replied (referring to the above comment):
I think we already mentioned the transaction is going to be spilled
but actually not.

~

Yes, spilling was already mentioned in the current comment but I felt
it assumes the reader is expected to know details of why it was going
to be spilled in the first place.

In other words, I thought the comment could include a bit more
explanatory background info:
(Also, it's not really "we detect" the abort -- it's the new postgres
code of this patch that detects it.)

SUGGESTION:
Execute a transaction that is prepared but then aborted. The INSERT
data exceeds the 'logical_decoding_work_mem limit' limit which
normally would result in the transaction being spilled to disk, but
now when Postgres detects the abort it skips the spilling and also
skips collecting further changes.

~~~

2.
+-- Check if the transaction is not spilled as it's already aborted.
+SELECT count(*) FROM
pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM
pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

/Check if the transaction is not spilled/Verify that the transaction
was not spilled/

==
.../replication/logical/reorderbuffer.c

ReorderBufferResetTXN:

3.
  /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);

Looking at the calling code for ReorderBufferResetTXN it seems this
function can called for streaming OR prepared. So is it OK here to be
passing hardwired 'true' as the txn_streaming parameter, or should
that be passing rbtxn_is_streamed(txn)?

~~~

ReorderBufferLargestStreamableTopTXN:

4.
  if ((largest == NULL || txn->total_size > largest_size) &&
  (txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
- rbtxn_has_streamable_change(txn))
+ rbtxn_has_streamable_change(txn) && !(rbtxn_is_aborted(txn)))
  {
  largest = txn;
  largest_size = txn->total_size;

I felt that this increasingly complicated code would be a lot easier
to understand if you just separate the conditions into: (a) the ones
that filter out transaction you don't care about; (b) the ones that
check for the largest size. For example,

SUGGESTION:
dlist_foreach(...)
{
  ...

  /* Don't consider these kinds of transactions for eviction. */
  if (rbtxn_has_partial_change(txn) ||
!rbtxn_has_streamable_change(txn) || rbtxn_is_aborted(txn))
continue;

  /* Find the largest of the eviction candidates. */
  if ((largest == NULL || txn->total_size > largest_size) &&
(txn->total_size > 0))
  {
largest = txn;
largest_size = txn->total_size;
  }
}

~~~

ReorderBufferCheckMemoryLimit:

5.
+ /* skip the transaction if already aborted */
+ if (ReorderBufferCheckTXNAbort(rb, txn))
+ {
+ /* All changes should be truncated */
+ Assert(txn->size == 0 && txn->total_size == 0);
+ continue;
+ }

The "discard all changes accumulated so far" side-effect happening
here is not very apparent from the function name. Maybe a better name
for ReorderBufferCheckTXNAbort() would be something like
'ReorderBufferCleanupIfAbortedTXN()'.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Enhancing Memory Context Statistics Reporting

On Wed, Nov 13, 2024 at 01:00:52PM +0530, Rahila Syed wrote:
> I used one DSA area per process to share statistics. Currently,
> the size limit for each DSA is 16 MB, which can accommodate
> approximately 6,700 MemoryContextInfo structs. Any additional
> statistics will spill over to a file. I opted for DSAs over DSMs to
> enable memory reuse by freeing segments for subsequent
> statistics copies of the same backend, without needing to
> recreate DSMs for each request.

Already mentioned previously at [1] and echoing with some surrounding
arguments, but I'd suggest to keep it simple and just remove entirely
the part of the patch where the stats information gets spilled into
disk.  With more than 6000-ish context information available with a
hard limit in place, there should be plenty enough to know what's
going on anyway.

[1]: https://postgr.es/m/zxdkx0dywutav...@paquier.xyz
--
Michael

signature.asc
Description: PGP signature

Re: Converting contrib SQL functions to new style

On Wed, Nov 13, 2024 at 09:15:08AM +0100, Ronan Dunklau wrote:
> Ok, please find attached a new complete patch series including tests for the 
> uncovered functions. Tests pass both before and after the move to SQL-body 
> functions.

In 0001 for the tests of contrib/xml2/, you have forgotten to update
the alternate output used when compiling with xml2 and without
libxslt.  Fixed that, then applied.

0002 was OK as-is, so applied.

0006 to switch contrib/lo/ is so short that using named arguments does
not change much IMO, so I've left it as you proposed, then applied it.
pg_freespace with two functions is small enough that it does not seem
worth bothering more, as well.

Remains the cases of citext, pageinspect, xml2 and earthdistance.

The argument of using named parameters is appealing to self-document
the functions, but it seems to me that it gets much better once we
also do the same for all the functions of a given extension, not only
these whose body is changed.  This information does not show up in a
\dx+, it does in a \df. 

Doing that step-by-step is better than nothing, hence limiting the use
of named parameters for only the functions whose body is rewritten is
fine by me, as a first step, as long as the names are used rather the
dollar parameter numbers.  I'd suggest to do take the bonus step of
applying the same rule to all the other functions so as everything
applies with the same extension update in a single major release.
Perhaps on top of the patches already proposed?  There is no need for
an extra version bump if all that is done in the same development
cycle.

Thoughts?
--
Michael

signature.asc
Description: PGP signature

Re: Statistics Import and Export

On Wed, Nov 13, 2024 at 01:04:32AM -0500, Corey Huinker wrote:
> Makes sense, and the fix is changing a single character (unless we think it
> warrants a test case).

I'd suggest to add a small test case, actually.  Like all the other
tests of stats_import.sql, the error happens quickly meaning that such
negative test is cheap to run while being useful to keep a track of.
--
Michael


signature.asc
Description: PGP signature

Re: .ready and .done files considered harmful

On Wed, Nov 13, 2024 at 02:52:31PM -0500, Robert Haas wrote:
> On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera  
> wrote:
>> So, my question now is, would there be much opposition to backpatching
>> beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE?
> 
> It seems like it's been long enough now that if the new logic had
> major problems we probably would have found them by now; so I feel
> like it's probably pretty safe. Perhaps it's questionable how many
> people we'll help by back-patching this into one additional release,
> but if you feel it's worth it I wouldn't be inclined to argue.

+1 for v14 as this version is still around for two years.

Even getting that down to v13 would be OK for me at this stage, as,
assuming that something is messed up for a reason or another, we would
still have one year to address any issue that could pop up on
REL_13_STABLE.

Now, the conflicts that exist between v14 and v13 in pgarch.[c|h] are
really risky due to the design changes done in d75288fb27b8, so it is
much better to leave v13 out of scope..
--
Michael


signature.asc
Description: PGP signature

Re: Fix for pageinspect bug in PG 17

On Wed, Nov 13, 2024 at 09:00:30PM +0100, Tomas Vondra wrote:
> Does that mean you think we should fix the issue at hand differently?
> Say, by looking at number of columns and building the correct tuple,
> like I did in my initial patch?

691e8b2e18 is not something I would have done when it comes to
pageinspect, FWIW.  There is the superuser argument for this module,
so I'd vote for an error and apply the same policy across all branches
as a matter of consistency.

A second argument in favor of raising an error is that it makes future
changes a bit easier to implement so as you can reuse the existing
routine as an "internal" workhorse called by the SQL functions
depending on the version of the extension involved at runtime, as we
do in pgss.  Data type changes are easier to track, for example.  If
you add a solution based on the number of arguments, you may have to
mix it with a per-version approach, at least on HEAD.  As long as
there is a test to check on behavior or another, we're going to be
fine because we'll know if a previous assumption suddenly breaks.

Anyway, your solution to use the number of columns to determine what
should be done at runtime works as well, so feel free to just ignore
me.  That's just an approach I'd prefer taking here because that's 
easier to work with IMO, but I can also see why people don't like the
versioning logic a-la-PGSS, as well.
--
Michael

signature.asc
Description: PGP signature

Re: Fix for pageinspect bug in PG 17

On Wed, Nov 13, 2024 at 7:48 PM Michael Paquier  wrote:
> On Wed, Nov 13, 2024 at 09:00:30PM +0100, Tomas Vondra wrote:
> > Does that mean you think we should fix the issue at hand differently?
> > Say, by looking at number of columns and building the correct tuple,
> > like I did in my initial patch?
>
> 691e8b2e18 is not something I would have done when it comes to
> pageinspect, FWIW.  There is the superuser argument for this module,
> so I'd vote for an error and apply the same policy across all branches
> as a matter of consistency.

691e8b2e18 was the one that threw the error?

-- 
Peter Geoghegan

Re: Fix for pageinspect bug in PG 17

On Wed, Nov 13, 2024 at 3:00 PM Tomas Vondra  wrote:
> Does that mean you think we should fix the issue at hand differently?
> Say, by looking at number of columns and building the correct tuple,
> like I did in my initial patch?

I don't feel strongly about it either way. But if it was my fix I'd
probably make it work on both versions. There's plenty of precedent
for that.

-- 
Peter Geoghegan

Re: Reordering DISTINCT keys to match input path's pathkeys

2024-11-13 Thread Richard Guo

On Wed, Nov 13, 2024 at 7:36 PM Andrei Lepikhov  wrote:
> On 11/13/24 13:49, Richard Guo wrote:
> > Thanks for reviewing this patch.  After some consideration, I think
> > it's not too complex to also apply this optimization to DISTINCT ON.
> > The parser already ensures that the DISTINCT ON expressions match the
> > initial ORDER BY expressions; we just need to ensure that the
> > resulting pathkey list from the reordering matches the original
> > distinctClause pathkeys, while leaving the remaining pathkeys
> > unchanged in order.  Please see attached.

> BTW have you ever thought about one more, cost-based reordering strategy?
> For now, we can reorder GROUP-BY and distinct clauses according to two
> orderings: 1) ORDER-BY order and 2) order derived from the underlying
> query tree.
> In thread [1], I try to add one more strategy that minimises the number
> of comparison operator calls. It seems that it would work the same way
> with the DISTINCT statement. Do you think it make sense in general and
> can be a possible direction of improvement for the current patch?

I haven’t had a chance to follow that thread.  From a quick look at
that patch, it seems to improve the general costing logic for sorting.
If that’s the case, I think it would be beneficial in the areas where
we use cost_sort(), including in this patch.

> > I'm not sure about merging these two 'reordering' GUCs into one.
> > While they may look similar, they apply to very different scenarios.
> > However, I'm open to other suggestions.
> Sure, they enable different optimisations. But, they enable highly
> specialised optimisations. Having two GUCs looks too expensive.
> Moreover, this stuff is cost-based and should work automatically. So, I
> treat these GUCs as mostly debugging or last-chance stuff used to
> disable features during severe slowdowns or bugs. It might make sense to
> group them into a single 'Clause Reordering' parameter.

I don't want to argue too much about this, but the two types of
optimizations are quite different in terms of implementation details.
For example, with DISTINCT ON, one key requirement is ensuring that
the resulting pathkey list matches the original distinctClause
pathkeys, and this logic doesn’t exist in the GROUP BY reordering
code.  If these GUCs are mainly for debugging, I think it's better to
keep them separate so that we can debug each optimization individually.

Thanks
Richard

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-11-13 Thread Peter Smith

Hi Nisha.

Thanks for the recent patch updates. Here are my review comments for
the latest patch v48-0001.

==
Commit message

1.
Till now, postgres has the ability to invalidate inactive
replication slots based on the amount of WAL (set via
max_slot_wal_keep_size GUC) that will be needed for the slots in
case they become active. However, choosing a default value for
this GUC is a bit tricky. Because the amount of WAL a database
generates, and the allocated storage for instance will vary
greatly in production, making it difficult to pin down a
one-size-fits-all value.

~

What do the words "for instance" mean here? Did it mean "per instance"
or "(for example)" or something else?

==
doc/src/sgml/system-views.sgml

2.
   
 The time since the slot has become inactive.
-NULL if the slot is currently being used.
-Note that for slots on the standby that are being synced from a
+NULL if the slot is currently being used. Once the
+slot is invalidated, this value will remain unchanged until we shutdown
+the server. Note that for slots on the standby that are being
synced from a
 primary server (whose synced field is
 true), the

Is this change related to the new inactivity timeout feature or are
you just clarifying the existing behaviour of the 'active_since'
field.

Note there is already another thread [1] created to patch/clarify this
same field. So if you are just clarifying existing behavior then IMO
it would be better if you can to try and get your desired changes
included there quickly before that other patch gets pushed.

~~~

3.
+ 
+  inactive_timeout means that the slot has been
+  inactive for longer than the amount of time specified by the
+   parameter.
+ 

Maybe there is a slightly shorter/simpler way to express this. For example,

BEFORE
inactive_timeout means that the slot has been inactive for longer than
the amount of time specified by the replication_slot_inactive_timeout
parameter.

SUGGESTION
inactive_timeout means that the slot has remained inactive beyond the
duration specified by the replication_slot_inactive_timeout parameter.

==
src/backend/replication/slot.c

4.
+int replication_slot_inactive_timeout = 0;

IMO it would be more informative to give the units in the variable
name (but not in the GUC name). e.g.
'replication_slot_inactive_timeout_secs'.

~~~

ReplicationSlotAcquire:

5.
+ *
+ * An error is raised if error_if_invalid is true and the slot has been
+ * invalidated previously.
  */
 void
-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)

This function comment makes it seem like "invalidated previously"
might mean *any* kind of invalidation, but later in the body of the
function we find the logic is really only used for inactive timeout.

+ /*
+ * An error is raised if error_if_invalid is true and the slot has been
+ * previously invalidated due to inactive timeout.
+ */

So, I think a better name for that parameter might be
'error_if_inactive_timeout'

OTOH, if it really is supposed to erro for *any* kind of invalidation
then there needs to be more ereports.

~~~

6.
+ errdetail("This slot has been invalidated because it was inactive
for longer than the amount of time specified by \"%s\".",

This errdetail message seems quite long. I think it can be shortened
like below and still retain exactly the same meaning:

BEFORE:
This slot has been invalidated because it was inactive for longer than
the amount of time specified by \"%s\".

SUGGESTION:
This slot has been invalidated due to inactivity exceeding the time
limit set by "%s".

~~~

ReportSlotInvalidation:

7.
+ case RS_INVAL_INACTIVE_TIMEOUT:
+ Assert(inactive_since > 0);
+ appendStringInfo(&err_detail,
+ _("The slot has been inactive since %s for longer than the amount of
time specified by \"%s\"."),
+ timestamptz_to_str(inactive_since),
+ "replication_slot_inactive_timeout");
+ break;

Here also as in the above review comment #6 I think the message can be
shorter and still say the same thing

BEFORE:
_("The slot has been inactive since %s for longer than the amount of
time specified by \"%s\"."),

SUGGESTION:
_("The slot has been inactive since %s, exceeding the time limit set
by \"%s\"."),

~~~

SlotInactiveTimeoutCheckAllowed:

8.
+/*
+ * Is this replication slot allowed for inactive timeout invalidation check?
+ *
+ * Inactive timeout invalidation is allowed only when:
+ *
+ * 1. Inactive timeout is set
+ * 2. Slot is inactive
+ * 3. Server is in recovery and slot is not being synced from the primary
+ *
+ * Note that the inactive timeout invalidation mechanism is not
+ * applicable for slots on the standby server that are being synced
+ * from the primary server (i.e., standby slots having 'synced' field 'true').
+ * Synced slots are always considered to be inactive because they don't
+ * perform logical decoding to produ

Re: [PATCH] Refactor SLRU to always use long file names

On Tue, Nov 12, 2024 at 05:37:02PM +0300, Aleksander Alekseev wrote:
> Also it occured to me that as a 4th option we could just get rid of
> this check. Users however will pay the price every time they execute
> pg_upgrade so I doubt we are going to do this.

We cannot remove the check, or Nathan will come after us as he's
working hard on reducing the time pg_upgrade takes.  We should not
make it longer if there is no need to.

The scans may be quite long as well, actually, which could be a
bottleneck.  Did you measure the runtime with a maximized (still
realistic) pool of files for these SLRUs in the upgrade time?  For
upgrades, data would be the neck.

# equals SLRU_SEG_FILENAMES_CHANGE_CAT_VER in pg_upgrade.h
my $slru_seg_filenames_change_cat_ver = 202411121;
[...]
open my $fh, "+<", $pg_control_fname or die $!;
binmode($fh);
sysseek($fh, 12, 0);
my $binval = pack("L!", $slru_seg_filenames_change_cat_ver - 1);
syswrite($fh, $binval, 4);
close($fh);

Control file manipulation may be useful as a routine in Cluster.pm,
based on an offset in the file and a format to pack as argument?  Note
that this also depends on the system endianness, see 039_end_of_wal.pl.
It's one of these things I could see myself reuse to force a state in
the cluster and make a test cheaper, for example.  You don't really
need the lookup part, actually?  You would just need the part where
the control file is rewritten, which should be OK as long as the
cluster is freshly initdb'd meaning that there should be nothing that
interacts with the new value set.  pg_upgrade only has CAT_VER flags
for some multixact changes and the jsonb check from 9.4.
--
Michael


signature.asc
Description: PGP signature

Re: Fix for pageinspect bug in PG 17

On Wed, Nov 13, 2024 at 07:54:03PM -0500, Peter Geoghegan wrote:
> On Wed, Nov 13, 2024 at 7:48 PM Michael Paquier  wrote:
>> 691e8b2e18 is not something I would have done when it comes to
>> pageinspect, FWIW.  There is the superuser argument for this module,
>> so I'd vote for an error and apply the same policy across all branches
>> as a matter of consistency.
> 
> 691e8b2e18 was the one that threw the error?

Oops.  Sorry about that.  I meant c788115b5eb5 in this sentence, not
691e8b2e18.
--
Michael


signature.asc
Description: PGP signature

RE: Fix for pageinspect bug in PG 17

2024-11-13 Thread Hayato Kuroda (Fujitsu)

Dear Tomas,

Thanks for updating the patch.

I've tested new patch and confirmed the brin_pgage_items() could error out:

```
postgres=# SELECT * FROM brin_page_items(get_raw_page('foo_id_idx', 2), 
'foo_id_idx');
ERROR:  function has wrong number of declared columns
HINT:  To resolve the problem, update the "pageinspect" extension to the latest 
version.
```

I also do not have strong opinions for both approaches.
If I had to say... it's OK to do error out. The source becomes simpler and I 
cannot
find use-case that user strongly wants to use the older pageinspect extension.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Support for NO INHERIT to INHERIT state change with named NOT NULL constraints

2024-11-13 Thread Suraj Kharage

Hi,

Upstream commit 14e87ffa5c543b5f30ead7413084c25f7735039f

added the support for named NOT NULL constraints which are INHERIT by
default.
We can declare those as NO INHERIT which means those constraints will not
be inherited to child tables and after this state, we don't have the
functionality to change the state back to INHERIT.

This patch adds this support where named NOT NULL constraint defined as NO
INHERIT can be changed to INHERIT.
For this, introduced the new syntax something like -

ALTER TABLE  ALTER CONSTRAINT  INHERIT;


Once the not null constraints are altered to INHERIT from NO INHERIT,
recurse to all children and propagate the constraint if it doesn't exist.


Alvaro stated that allowing a not null constraint state to be modified from
INHERIT to NO INHERIT is going to be quite problematic because of the
number of weird cases to avoid, so for now that support is not added.

Please share your thoughts on the same.

--

Thanks & Regards,
Suraj kharage,



enterprisedb.com 
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 6098ebe..348ac38 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -58,7 +58,7 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET COMPRESSION compression_method
 ADD table_constraint [ NOT VALID ]
 ADD table_constraint_using_index
-ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] [INHERIT]
 VALIDATE CONSTRAINT constraint_name
 DROP CONSTRAINT [ IF EXISTS ]  constraint_name [ RESTRICT | CASCADE ]
 DISABLE TRIGGER [ trigger_name | ALL | USER ]
@@ -108,7 +108,7 @@ WITH ( MODULUS numeric_literal, REM
   PRIMARY KEY index_parameters |
   REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ]
 [ ON DELETE referential_action ] [ ON UPDATE referential_action ] }
-[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] [INHERIT]
 
 and table_constraint is:
 
@@ -553,7 +553,10 @@ WITH ( MODULUS numeric_literal, REM
 
  
   This form alters the attributes of a constraint that was previously
-  created. Currently only foreign key constraints may be altered.
+  created. Currently only foreign key and not null constraints may be
+  altered. Only Not null constraints can be modified to INHERIT from
+  NO INHERIT which creates constraints in all inherited childrens and
+  validates the same.
  
 

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ccd9645..360fcdb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -389,7 +389,7 @@ static void AlterIndexNamespaces(Relation classRel, Relation rel,
 static void AlterSeqNamespaces(Relation classRel, Relation rel,
 			   Oid oldNspOid, Oid newNspOid, ObjectAddresses *objsMoved,
 			   LOCKMODE lockmode);
-static ObjectAddress ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
+static ObjectAddress ATExecAlterConstraint(List **wqueue, Relation rel, AlterTableCmd *cmd,
 		   bool recurse, bool recursing, LOCKMODE lockmode);
 static bool ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
 	 Relation rel, HeapTuple contuple, List **otherrelids,
@@ -5400,7 +5400,8 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			   lockmode);
 			break;
 		case AT_AlterConstraint:	/* ALTER CONSTRAINT */
-			address = ATExecAlterConstraint(rel, cmd, false, false, lockmode);
+			address = ATExecAlterConstraint(wqueue, rel, cmd, false, false,
+			lockmode);
 			break;
 		case AT_ValidateConstraint: /* VALIDATE CONSTRAINT */
 			address = ATExecValidateConstraint(wqueue, rel, cmd->name, cmd->recurse,
@@ -11622,8 +11623,8 @@ GetForeignKeyCheckTriggers(Relation trigrel,
  * InvalidObjectAddress.
  */
 static ObjectAddress
-ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, bool recurse,
-	  bool recursing, LOCKMODE lockmode)
+ATExecAlterConstraint(List **wqueue, Relation rel, AlterTableCmd *cmd,
+	  bool recurse, bool recursing, LOCKMODE lockmode)
 {
 	Constraint *cmdcon;
 	Relation	conrel;
@@ -11666,7 +11667,74 @@ ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, bool recurse,
  errmsg("constraint \"%s\" of relation \"%s\" does not exist",
 		cmdcon->conname, RelationGetRelationName(rel;
 
+	contuple = heap_copytuple(contuple);
+
 	currcon = (Form_pg_constraint) GETSTRUCT(contuple);
+
+	/* Not null constraint */
+	if (cmdcon->contype == CONSTR_NOTNULL)
+	{
+		AttrNumber	colNum;
+		char	   *colName;
+		List

Re: RFC: Extension Packaging & Lookup

2024-11-13 Thread David E. Wheeler

On Nov 7, 2024, at 10:30, David E. Wheeler  wrote:

> Last week I tried to integrate all the ideas into this thread and the 
> previous[1] into a single proposal that attempts to work through all the 
> implications and issues. I’ve drafted it as a blog post[2] and plan to 
> publish it next week, following some more feedback. Would appreciate 
> comments, corrections, and any other general feedback:
> 
>  https://github.com/theory/justatheory/pull/7

Got some good feedback on this, in particular about how I might be overthinking 
separate destinations for core vs. package-installed vs. user-installed 
extensions. The RFC proposes separate directories and variables for 
core/vendor/site extensions, borrowing the idea from various dynamic language 
systems.

I came to this thinking that it was important to keep core (contrib, PL) 
extensions separate from non-core extensions, and if so, it’d be useful to have 
other defaults so that `make install` would go to the right one (site by 
default).

But maybe it’s not necessary? If there are no extensions by default, perhaps it 
doesn’t matter?

But of course there are some by default. I just ran `make all && make install`, 
and `share/extension` contains files for plpgsql (and plperl, but I presume it 
could be separated). Meanwhile, `lib` is full of a _ton_ of files.

Would it not be beneficial to have by-default empty directories into which 
extensions and modules are installed that don’t come with core? Or should they 
*all* be considered one thing? If the latter, perhaps truly core modules should 
be stored separately?

Best,

David

Re: Speed up TransactionIdIsCurrentTransactionId() with lots of subxacts

2024-11-13 Thread Heikki Linnakangas


On 13/11/2024 18:08, Bertrand Drouvot wrote:

=== 1

+  parentOffset--;
+  parents[parentOffset]->totalXids = totalXids;

what about "parents[--parentOffset]->totalXids = totalXids;" instead?

That would remove the ability for someone to insert code between the decrement
and the array access.


IMHO it's good as it is. There's no particular harm if someone inserts 
code there, is there?


I wonder if we should add a "child" pointer to TransactionStateData 
though. That would make it unnecessary to build the temporary array 
here, and it could be used to avoid the recursion in 
ShowTransactionStateRec().



=== 2

+  newsize = 32;
+  CurrentTransactionIds = MemoryContextAlloc(TopTransactionContext,
+ 32 * sizeof(TransactionId));

what about defining a macro for "32" (as it is used in multiple places in
xact.c)?


The only other place is in AtStart_Memory(), but there it's used for a 
different thing. I think a constant is fine here. It'd probably be good 
to change "32 * sizeof(TransactionId)" to "newsize * 
sizeof(TransactionId)" though.



=== 3

-   /* Copy data into output area. */
+   /*
+* In CurrentTransactoinIds,

s/CurrentTransactoinIds/CurrentTransactionIds/


thanks!


4 ===

  int
  xactGetCommittedChildren(TransactionId **ptr)
  {
-   TransactionState s = CurrentTransactionState;
+   return GetCommittedChildren(CurrentTransactionState, ptr);

Worth to move this part of the comment on top of xactGetCommittedChildren(),

"
If there are no subxacts, *ptr is set to NULL.
"

on top of GetCommittedChildren() instead?


I don't see why. The interface is described in 
xactGetCommittedChildren() now, and GetCommittedChildren() just notes 
that it's an internal version of xactGetCommittedChildren(). It seems 
clear to me that you should look at xactGetCommittedChildren() for more 
information.



5 ===

  static void
  AtSubAbort_childXids(void)
  {
-   TransactionState s = CurrentTransactionState;
-
-   /*
-* We keep the child-XID arrays in TopTransactionContext (see
-* AtSubCommit_childXids).  This means we'd better free the array
-* explicitly at abort to avoid leakage.
-*/
-   if (s->childXids != NULL)
-   pfree(s->childXids);
-   s->childXids = NULL;
-   s->nChildXids = 0;
-   s->maxChildXids = 0;
+   /* nothing to do */

Yeah that works but then some CurrentTransactionIds[] elements do not reflect
the "reality" anymore (until the top level transaction finishes, or until a
new subtransaction is created).

Could'nt that be an issue from a code maintability point of view?


Hmm, I don't know. The subtransaction is in the process of being 
aborted. Whether you consider its XID to still belong to the 
subtransaction until it's fully aborted is a matter of taste. If you 
consider that they still belong to the subtransaction until the 
subtransaction's TransactionState is free'd, it makes sense to leave 
them alone here.


Before this patch, it made sense to reset all those fields when the 
childXids array was free'd, but with this patch, the elements in 
CurrentTransactionIds[] are still valid until the TransactionState is 
free'd.


What would be the alternative? I don't think it makes sense to go out of 
our way to clear the elements in CurrentTransactionIds[]. We could set 
them to InvalidXid to make it clear they're not valid anymore, but no 
one is supposed to look at elements beyond totalXids anyway. We don't do 
such clearing at the end of top transaction either.


Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Some dead code in get_param_path_clause_serials()

2024-11-13 Thread Richard Guo

On Wed, Nov 13, 2024 at 9:59 PM Andrei Lepikhov  wrote:
> On 11/13/24 16:34, Richard Guo wrote:
> > The function get_param_path_clause_serials() is used to get the set of
> > pushed-down clauses enforced within a parameterized Path.  Since we
> > don't currently support parameterized MergeAppend paths, and it
> > doesn't look like that is going to change anytime soon (as explained
> > in the comments for generate_orderedappend_paths), we don't need to
> > consider MergeAppendPath in this function.  Is it worth removing the
> > related code, as attached?
> >
> > This change won't make any measurable difference in performance; it's
> > just for clarity's sake.

> I've passed through the logic of
> get_param_path_clause_serials/reparameterize_path_by_child/reparameterize_path.
> Agree, it seems not useful to parameterise ordered appends in the near
> future.

Thanks for the review.

> Even So, it would be better to insert changes, induced by new
> feature by one commit.

I'm not sure I understand what you mean by this sentence.

> By the way, Do you know if anyone gave a speech on the current state of
> internal plan parameterisation and its possible development directions?
> It would be helpful to have such an explanation.

Not that I'm aware of, but I found the "Parameterized Paths" section
in optimizer/README to be very helpful.

Thanks
Richard

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-11-13 Thread vignesh C

On Wed, 13 Nov 2024 at 15:00, Nisha Moond  wrote:
>
> Please find the v48 patch attached.
>
> On Thu, Sep 19, 2024 at 9:40 AM shveta malik  wrote:
> >
> > When we promote hot standby with synced logical slots to become new
> > primary, the logical slots are never invalidated with
> > 'inactive_timeout' on new primary.  It seems the check in
> > SlotInactiveTimeoutCheckAllowed() is wrong. We should allow
> > invalidation of slots on primary even if they are marked as 'synced'.
>
> fixed.
>
> > I have raised 4 issues so far on v46, the first 3 are in [1],[2],[3].
> > Once all these are addressed, I can continue reviewing further.
> >
>
> Fixed issues reported in [1], [2].

Few comments:
1) Since we don't change the value of now in
ReplicationSlotSetInactiveSince, the function parameter can be passed
by value:
+/*
+ * Set slot's inactive_since property unless it was previously invalidated.
+ */
+static inline void
+ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz *now,
+   bool
acquire_lock)
+{
+   if (s->data.invalidated != RS_INVAL_NONE)
+   return;
+
+   if (acquire_lock)
+   SpinLockAcquire(&s->mutex);
+
+   s->inactive_since = *now;

2) Currently it allows a minimum value of less than 1 second like in
milliseconds, I feel we can have some minimum value at least something
like checkpoint_timeout:
diff --git a/src/backend/utils/misc/guc_tables.c
b/src/backend/utils/misc/guc_tables.c
index 8a67f01200..367f510118 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3028,6 +3028,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},

+   {
+   {"replication_slot_inactive_timeout", PGC_SIGHUP,
REPLICATION_SENDING,
+   gettext_noop("Sets the amount of time a
replication slot can remain inactive before "
+"it will be invalidated."),
+   NULL,
+   GUC_UNIT_S
+   },
+   &replication_slot_inactive_timeout,
+   0, 0, INT_MAX,
+   NULL, NULL, NULL
+   },

3) Since SlotInactiveTimeoutCheckAllowed check is just done above and
the current time has been retrieved can we used "now" variable instead
of SlotInactiveTimeoutCheckAllowed again second time:
@@ -1651,6 +1713,26 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
if (SlotIsLogical(s))
invalidation_cause = cause;
break;
+   case RS_INVAL_INACTIVE_TIMEOUT:
+
+   /*
+* Check if the slot needs to
be invalidated due to
+*
replication_slot_inactive_timeout GUC.
+*/
+   if
(SlotInactiveTimeoutCheckAllowed(s) &&
+
TimestampDifferenceExceeds(s->inactive_since, now,
+
replication_slot_inactive_timeout * 1000))
+   {
+   invalidation_cause = cause;
+   inactive_since =
s->inactive_since;

4) I'm not sure if this change required by this patch or is it a
general optimization, if it is required for this patch we can detail
the comments:
@@ -2208,6 +2328,7 @@ RestoreSlotFromDisk(const char *name)
boolrestored = false;
int readBytes;
pg_crc32c   checksum;
+   TimestampTz now;

/* no need to lock here, no concurrent access allowed yet */

@@ -2368,6 +2489,9 @@ RestoreSlotFromDisk(const char *name)
NameStr(cp.slotdata.name)),
 errhint("Change \"wal_level\" to be
\"replica\" or higher.")));

+   /* Use same inactive_since time for all slots */
+   now = GetCurrentTimestamp();
+
/* nothing can be active yet, don't lock anything */
for (i = 0; i < max_replication_slots; i++)
{
@@ -2400,7 +2524,7 @@ RestoreSlotFromDisk(const char *name)
 * slot from the disk into memory. Whoever acquires
the slot i.e.
 * makes the slot active will reset it.
 */
-   slot->inactive_since = GetCurrentTimestamp();
+   slot->inactive_since = now;

5) Why should the slot invalidation be updated during shutdown,
shouldn't the inactive_since value be intact during shutdown?
-NULL if the slot is currently being used.
-Note that for slots on the standby that are being synced from a
+NULL if the slot is currently being used. Once the
+slot is invalidated, this

Improve the error message for logical replication of regular column to generated column.

2024-11-13 Thread Shubham Khanna

Hi all,

Recently there was an issue reported by Kuroda-san on a different
thread [1]. I have created this thread to discuss the issue
separately.

Currently, the ERROR message for the replication of a regular column
on the publisher node to a generated column on the subscriber node
is:-
ERROR: logical replication target relation \"%s.%s\" is missing
replicated columns: %s
For example:-
test_pub=# CREATE TABLE t1(c1 int, c2 int GENERATED ALWAYS AS (c1 * 2) STORED);
test_pub=# CREATE PUBLICATION pub1 for table t1(c1, c2, c3);
test_pub=# INSERT INTO t1 VALUES (1);

test_sub=# CREATE TABLE t1(c1 int, c2 int GENERATED ALWAYS AS (c1 + 2)
STORED, c3 int GENERATED ALWAYS AS (c1 + 2) STORED);
test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr'
PUBLICATION pub1;
-> ERROR: logical replication target relation "t1" is missing
replicated column: "c2","c3"

The error message was misleading, as it failed to clarify that the
replication of a regular column on the publisher to the corresponding
generated column on the subscriber is not supported.
To avoid and solve the issue, we can update the ERROR message stating
that the replication of the generated column on the subscriber is not
supported. I have attached a patch for the same.

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB5693AF061D62E55189490D2DF5562%40TYCPR01MB5693.jpnprd01.prod.outlook.com

Thanks and Regards,
Shubham Khanna.


v1-0001-Error-message-improvement.patch
Description: Binary data

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Thu, Nov 14, 2024 at 11:45:56AM +0530, Ashutosh Bapat wrote:
> Here's a quick and dirty patch which describes the idea. I didn't get
> time to implement code to move SnapBuild::restart_lsn if
> SnapBuild::start_decoding_at moves forward while building initial
> snapshot. I am not sure whether that's necessary either.
> 
> I have added three elogs to see if the logic is working as expected. I
> see two of the elogs in patch in the server log when I run tests from
> tests/subscription and tests/recovery. But I do not see the third one.
> That either means that the situation causing the bug is not covered by
> those tests or the fix is not triggered. If you run your reproduction
> and still see the crashes please provide the output of those elog
> messages along with the rest of the elogs you have added.

Forgot the attachment, perhaps?
--
Michael


signature.asc
Description: PGP signature

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

2024-11-13 Thread Ashutosh Bapat

Thanks for noticing it. Sorry. Attached.

On Thu, Nov 14, 2024 at 12:04 PM Michael Paquier  wrote:
>
> On Thu, Nov 14, 2024 at 11:45:56AM +0530, Ashutosh Bapat wrote:
> > Here's a quick and dirty patch which describes the idea. I didn't get
> > time to implement code to move SnapBuild::restart_lsn if
> > SnapBuild::start_decoding_at moves forward while building initial
> > snapshot. I am not sure whether that's necessary either.
> >
> > I have added three elogs to see if the logic is working as expected. I
> > see two of the elogs in patch in the server log when I run tests from
> > tests/subscription and tests/recovery. But I do not see the third one.
> > That either means that the situation causing the bug is not covered by
> > those tests or the fix is not triggered. If you run your reproduction
> > and still see the crashes please provide the output of those elog
> > messages along with the rest of the elogs you have added.
>
> Forgot the attachment, perhaps?
> --
> Michael



-- 
Best Wishes,
Ashutosh Bapat
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3fe1774a1e9..e42fe52de39 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -147,7 +147,7 @@ CheckLogicalDecodingRequirements(void)
  * CreateDecodingContext() performing common tasks.
  */
 static LogicalDecodingContext *
-StartupDecodingContext(List *output_plugin_options,
+StartupDecodingContext(List *output_plugin_options, XLogRecPtr restart_lsn,
 	   XLogRecPtr start_lsn,
 	   TransactionId xmin_horizon,
 	   bool need_full_snapshot,
@@ -212,7 +212,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
-		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
+		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, restart_lsn, start_lsn,
 need_full_snapshot, in_create, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
@@ -438,7 +438,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
 
-	ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+	ctx = StartupDecodingContext(NIL, restart_lsn, restart_lsn, xmin_horizon,
  need_full_snapshot, false, true,
  xl_routine, prepare_write, do_write,
  update_progress);
@@ -591,7 +591,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
-	ctx = StartupDecodingContext(output_plugin_options,
+	ctx = StartupDecodingContext(output_plugin_options, slot->data.restart_lsn,
  start_lsn, InvalidTransactionId, false,
  fast_forward, false, xl_routine, prepare_write,
  do_write, update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da32668..62f92b51917 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -184,6 +184,7 @@ static void SnapBuildRestoreContents(int fd, char *dest, Size size, const char *
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
 		TransactionId xmin_horizon,
+		XLogRecPtr restart_lsn,
 		XLogRecPtr start_lsn,
 		bool need_full_snapshot,
 		bool in_slot_creation,
@@ -217,6 +218,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
+	builder->restart_lsn = restart_lsn;
 	builder->in_slot_creation = in_slot_creation;
 	builder->building_full_snapshot = need_full_snapshot;
 	builder->two_phase_at = two_phase_at;
@@ -1163,7 +1165,17 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * anything because we hadn't reached a consistent state yet.
 	 */
 	if (txn != NULL && txn->restart_decoding_lsn != InvalidXLogRecPtr)
-		LogicalIncreaseRestartDecodingForSlot(lsn, txn->restart_decoding_lsn);
+	{
+		elog(LOG, "running xact record at %X/%X, builder::restart_lsn = %X/%X, restart_decoding_lsn of txn (%d): %X/%X",
+			LSN_FORMAT_ARGS(lsn), LSN_FORMAT_ARGS(builder->restart_lsn), txn->xid, LSN_FORMAT_ARGS(txn->restart_decoding_lsn));
+
+		if (txn->restart_decoding_lsn >= builder->restart_lsn)
+			LogicalIncreaseRestartDecodingForSlot(lsn, txn->restart_decoding_lsn);
+		else
+			elog(LOG, "skipped candidates generated by running transaction record at %X/%X",
+LSN_FORMAT_ARGS(lsn));
+
+	}
 
 	/*
 	 * No in-progress transaction, can reuse the last serialized snapshot if
@@ -1172,8 +1184,18 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	else if (txn == NULL &&
 			 builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
 			 builder->last_serialized_snapshot != InvalidXLogRecPtr)
-		LogicalIncreaseRestartDecodingForSlot(lsn,
+	{
+
+		elog(LOG, "running xact record at %X/%X, last serialized snapshot (L

Re: per backend I/O statistics