Re: Vacuum statistics

2025-01-02 Thread Sami Imseih
> While backwards compatibility is important, there’s definitely precedent for 
> changing
> what shows up in the catalog. IMHO it’s better to bite the bullet and move 
> those fields
> instead of having vacuum stats spread across two different views.

Correct, the most recent one that I could think of is pg_stat_checkpointer,
which pulled the checkpoint related columns from pg_stat_bgwriter.
In that case though, these are distinct background processes and
it's a clear distinction.

In this case, I am not so sure about this, particularly because
we will then have the autoanalyze and autovacuum fields in different
views, which could be more confusing to users than saying pg_stat_all_tables
has high level metrics about vacuum and analyze and for more details on
vacuum, refer to pg_stat_vacuum_tables ( or whatever name we settle on ).

Regards,

Sami




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Larry Rosenman

On 01/02/2025 3:42 pm, Thomas Munro wrote:

On Fri, Jan 3, 2025 at 10:16 AM Larry Rosenman  wrote:

What about doing what Rick suggests?
do {
   dir = opendir("X");
dp = readdir(dir);
if (dp != NULL)
 unlink(dp->d_name);
 close(dir);
} while (dp != NULL);
?


That only works for unlink loops where we expect no concurrent
modifications.  DROP DATABASE qualifies, but we use readdir() on
directories that might be changing in various other places, including
backups.  They'd silently fail to do their job, and can't use that
technique as they aren't unlinking as they go and so they would never
make progress.  Which leads to the lipstick/pig conclusion: we'd have
other, worse, buggy behaviour still.

It might be possible to make our own readdir snapshot thing that
checks a shmem counter of all links/unlinks before and after and
retries.  But ugh...


Would it make sense for ONLY drop database to have the above loop?

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010




Re: WAL-logging facility for pgstats kinds

2025-01-02 Thread Andres Freund
Hi,

On 2024-12-27 12:32:25 +0900, Michael Paquier wrote:
> While brainstorming on the contents of the thread I have posted a
> couple of days ago, I have been looking at what could be done so as
> pgstats and WAL-logging could work together.  This was point 2) from
> this message:
> https://www.postgresql.org/message-id/z2tblemfuofzy...@paquier.xyz
>
> While considering the point that not all stats data is worth
> replicating, I have fallen down to the following properties that are
> nice to have across the board:
> - A pgstats kind should be able to WAL-log some data that is should be
> able to decide.  Including versioning of the data.

I don't really think that's right.  For cumulative stats to make sense on both
a primary and a replica, or a primary after a crash, they need to cover things
that *already* are WAL logged.

E.g. for pg_stat_all_tables.n_{tup_{ins,upd,del,hot_upd},live_tup,dead_tup,
...}, - which are important for autovacuum scheduling - we should use the WAL
records covering those changes to re-assemble the stats during WAL replay.

The only reason that that's not trivial is that we don't have access to the
relation oid during crash recovery. Which in turn is why I think we should
change relation level stats to be keyed by relfilenode, rather than relation
oid.


I can't think of a real case where we would want to WAL log the stats
themselves, rather than re-emitting stats during replay based on the WAL
record of the "underlying object".

Do you have counter-examples?

Greetings,

Andres Freund




Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-02 Thread Peter Smith
Hi Shubham,

Here are some review comments for patch v5-0001.

==
doc/src/sgml/ref/pg_createsubscriber.sgml

1.
+   
+The source server must have  to
+be set to -1 to prevent the automatic removal of WAL
+replication slots. Setting this parameter to files needed by a
specific size
+may lead to replication failures if required WAL files are
+prematurely deleted.
+   

IIUC, this problem is all about the removal of WAL *files*, not "WAL
replication slots".

Also, the "Setting this parameter to files needed by a specific size"
part sounds strange.

I think this can be simplified anyhow like below.

SUGGESTION:
Replication failures can occur if required WAL files are prematurely
deleted. To prevent this, the source server must set  to -1,
ensuring WAL files are not automatically removed.

==
src/bin/pg_basebackup/pg_createsubscriber.c

check_publisher:

2.
+ if (dry_run && max_slot_wal_keep_size != -1)
+ {
+ pg_log_warning("publisher requires WAL size must not be restricted");
+ pg_log_warning_detail("The max_slot_wal_keep_size parameter is
currently set to a non-default value, which may lead to replication
failures. "
+   "This parameter must be set to -1 to ensure that required WAL
files are not prematurely removed.");
+ }

2a.
Whenever you mention a GUC name in a PostgreSQL message then (by
convention) it must be double-quoted.

~

2b.
The detailed message seems verbose and repetitive to me.

~

2c.
The other nearby messages use the terminology "configuration
parameter", so this should be the same.

~

2d.
The other nearby messages use \"%s\" substitution for the GUC name, so
this should be the same.

~

2e.
IMO advising the user to change a configuration parameter should be
done using the pg_log_warning_hint function (e.g. _hint, not
_details).

~~

Result:

CURRENT (pg_log_warning_details)
The max_slot_wal_keep_size parameter is currently set to a non-default
value, which may lead to replication failures.  This parameter must be
set to -1 to ensure that required WAL files are not prematurely
removed.

SUGGESTION (pg_log_warning_hint)
Set the configuration parameter \"%s\" to -1 to ensure that required
WAL files are not prematurely removed.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Strange issue with NFS mounted PGDATA on ugreen NAS

2025-01-02 Thread Tom Lane
I wrote:
> I forgot to report back, but yesterday I spent time unsuccessfully
> trying to reproduce the problem with macOS client and NFS server
> using btrfs (a Synology NAS running some no-name version of Linux).

Also, I *can* reproduce it using the same NFS server and a FreeBSD
14.2 client.  At least with this pair of machines, the behavior seems
deterministic: "createdb foo" followed by "dropdb foo" leaves the
same set of not-dropped files behind each time.  I added a bit of
debug logging to rmtree.c and verified that it's not seeing anything
odd happening, except that readdir never returns anything about the
missed files.

I will file a bug report, unless you already did?

regards, tom lane




Re: apply_scanjoin_target_to_paths and partitionwise join

2025-01-02 Thread Robert Haas
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat
 wrote:
> A join between partitions is pushed down if only partitionwise join is
> chosen and a join between partitions won't be pushed down if
> partitionwise join is not chosen. Hence this bug affects pushdown as
> well.
>
> The CF entry shows as waiting for author. But that isn't the right
> status. Will change it to needs review. I think we need a consensus as
> to whether we want to fix this bug or not. Since this bug doesn't
> affect me anymore, I will just withdraw this CF entry if there is no
> interest.

I think it's unhelpful that you keep calling this a "bug" when the
behavior is clearly deliberate. Whether it's the *right* behavior is
debatable, but it's not *accidental* behavior.

I don't actually have a clear understanding of why we need this. In
https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com
Jakub says that an EDB customer experienced a case where the
partitionwise plan took 530+s and the non-partitionwise plan took 23s,
but unfortunately there's no public test case, and in the examples
shared publicly, either the partionwise plan is actually slower but is
mistakenly estimated to be faster, or the two are extremely close to
the same speed so it doesn't really matter. So the customer scenario
(which is not public) is justification for a code-change, but the
publicly-posted examples, as far as I can see, are not.

And what confuses me is -- what could that test case possibly look
like? I mean, how can it be less efficient to perform N small joins
than 1 big join? For instance, suppose we're doing a merge join
between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort
all the data. With a partitionwise join, we have to do 2N sorts of
partitions of some size, let's say K. The cost should be O(2N * K lg
K). If we instead do two really big sorts, the cost is now O(2 * (NK)
lg (NK)), which is more. If we do a hash join, the cost should be
about the same either way, because probing a hash table is roughly
constant time. However, if we do N small hash joins, the hash table is
a lot more likely to fit in memory -- and if the big hash table does
not fit in memory and the small hash tables do, we should win big.
Finally, let's say we're doing a nested loop. If the inner side of the
nested loop is unparameterized, then the cost of the non-partitionwise
nested loop is O(N^2 * K^2), while the cost of the partitionwise
nested loop is O(N * K^2), which is a huge win. If the inner side is
parameterized, then the partitionwise plan involves scanning one
partition for matching values for each outer row, whereas the
non-partitionwise plan involves scanning every partition for matching
values for each outer row, which is again clearly worse.

I'm obviously missing something here, because I'm sure Jakub is quite
right when he says that this actually happened and actually hosed an
EDB customer. But I don't understand HOW it happened, and I think if
we're going to change the code we really ought to understand that and
write some code comments about it. In general, I think that it's very
reasonable to expect that a bunch of small joins will beat one big
join, which is why the code does what it currently does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2025-01-02 Thread Jim Nasby

> On Dec 31, 2024, at 5:41 PM, Tomas Vondra  wrote:
> 
> On 12/31/24 21:46, Jim Nasby wrote:
>> On Dec 30, 2024, at 7:05 PM, James Hunter  wrote:
>>> 
>>> On Sat, Dec 28, 2024 at 11:24 PM Jim Nasby  wrote:
 
 IMHO none of this will be very sane until we actually have cluster-level 
 limits. One sudden burst in active connections and you still OOM the 
 instance.
>>> 
>>> Fwiw, PG does support "max_connections" GUC, so a backend/connection -
>>> level limit, times "max_connections", yields a cluster-level limit.
>> 
>> max_connections is useless here, for two reasons:
>> 
>> 1. Changing it requires a restart. That’s at *best* a real PITA in 
>> production. [1]
>> 2. It still doesn’t solve the actual problem. Unless your workload *and* 
>> your data are extremely homogeneous you can’t simply limit the number of 
>> connections and call it a day. A slight change in incoming queries, OR in 
>> the data that the queries are looking at and you go from running fine to 
>> meltdown. You don’t even need a plan flip for this to happen, just the same 
>> plan run at the same rate but now accessing more data than before.
>> 
> 
> I really don't follow your argument ...
> 
> Yes, changing max_connections requires a restart - so what? AFAIK the
> point James was making is that if you multiply max_connections by the
> per-backend limit, you get a cluster-wide limit. And presumably the
> per-backend limit would be a GUC not requiring a restart.
> 
> Yes, high values of max_connections are problematic. I don't see how a
> global limit would fundamentally change that. In fact, it could
> introduce yet more weird failures because some unrelated backend did
> something weird.

That’s basically my argument for having workload management. If a system 
becomes loaded enough for the global limit to start kicking in it’s likely that 
query response time is increasing, which means you will soon have more and more 
active backends trying to run queries. That’s just going to make the situation 
even worse. You’d either have to start trying to “take memory away” from 
already running backends or backends that are just starting would have such a 
low limit as to cause them to spill very quickly, creating further load on the 
system.

> FWIW I'm not opposed to having some global memory limit, but as I
> explained earlier, I don't see a way to do that sensibly without having
> a per-backend limit first. Because if you have a global limit, a single
> backend consuming memory could cause all kinds of weird failures in
> random other backends.

I agree, but I’m also not sure how much a per-backend limit would actually help 
on its own, especially in OLTP environments.

>> Most of what I’ve seen on this thread is discussing ways to *optimize* how 
>> much memory the set of running backends can consume. Adjusting how you slice 
>> the memory pie across backends, or even within a single backend, is 
>> optimization. While that’s a great goal that I do support, it will never 
>> fully fix the problem. At some point you need to either throw your hands in 
>> the air and start tossing memory errors, because you don’t have control over 
>> how much work is being thrown at the engine. The only way that the engine 
>> can exert control over that would be to hold new transactions from starting 
>> when the system is under duress (ie, workload management). While workload 
>> managers can be quite sophisticated (aka, complex), the nice thing about 
>> limiting this scope to work_mem, and only as a means to prevent complete 
>> overload, is that the problem becomes a lot simpler since you’re only 
>> looking at one metric and not trying to support any kind of priority system. 
>> The only fanciness I think an MVP would need is a GUC to control how long a 
>> transaction can sit waiting before it throws an error. Frankly, that sounds 
>> a lot less complex and much easier for DBAs to adjust than trying to teach 
>> the planner how to apportion out per-node work_mem limits.
>> 
>> As I said, I’m not opposed to optimizations, I just think they’re very much 
>> cart-before-the-horse.
>> 
> 
> What optimization? I didn't notice anything like that. I don't see how
> "adjusting how you slice the memory pie across backends" counts as an
> optimization. I mean, that's exactly what a memory limit is meant to do.
> 
> Similarly, there was a proposal to do planning with work_mem, and then
> go back and adjust the per-node limits to impose a global limit. That
> does not seem like an optimization either ... (more an opposite of it).

It’s optimization in that you’re trying to increase how many active backends 
you can have before getting memory errors. It’s an alternative to throwing more 
memory at the problem or limiting the rate of incoming workload.

>> 1: While it’d be a lot of work to make max_connections dynamic one thing we 
>> could do fairly easily would be to introduce another GUC (max_backends?) 
>> that actually controls the total numbe

Re: Logical Replication of sequences

2025-01-02 Thread Peter Smith
Hi Vignesh.

Here are some review comments for patch v20241230-0002

==
1. SYNTAX

The proposed syntax is currently:

CREATE PUBLICATION name
[ FOR ALL object_type [, ...]
  | FOR publication_object [, ... ] ]
[ WITH ( publication_parameter [= value] [, ... ] ) ]

where object_type is one of:

TABLES
SEQUENCES

where publication_object is one of:

TABLE [ ONLY ] table_name [ * ] [ ( column_name [, ... ] ) ] [
WHERE ( expression ) ] [, ... ]
TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ]
~

But lately, I've been thinking it could be clearer if you removed the
object_type and instead fully spelled out FOR ALL TABLES and/or FOR
ALL SEQUENCES.

compare
CREATE PUBLICATION FOR ALL TABLES, SEQUENCES;
versus
CREATE PUBLICATION FOR ALL TABLES, ALL SEQUENCES;

~

Also AFAICT, the current syntax says it is impossible to mix FOR ALL
SEQUENCES with FOR TABLE etc but really that *should* be allowed,
right?

And it looks like you may come to similar grief in future if you try
things like:
"FOR ALL TABLES" mixed with "FOR SEQUENCE seq_name"
"FOR ALL TABLES" mixed with "FOR SEQUENCES IN SCHEMA schema_name"

~

So, maybe a revised syntax like below would end up being easier and
also more flexible:

CREATE PUBLICATION name
[ FOR publication_object [, ... ] ]
[ WITH ( publication_parameter [= value] [, ... ] ) ]

where publication_object is one of:

ALL TABLES
ALL SEQUENCES
TABLE [ ONLY ] table_name [ * ] [ ( column_name [, ... ] ) ] [
WHERE ( expression ) ] [, ... ]
TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ]

==
src/backend/commands/publicationcmds.c

CreatePublication:

2.
- /* FOR ALL TABLES requires superuser */
- if (stmt->for_all_tables && !superuser())
+ /* FOR ALL TABLES or FOR ALL SEQUENCES requires superuser */
+ if ((stmt->for_all_tables || stmt->for_all_sequences) && !superuser())
  ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to create FOR ALL TABLES publication")));
+ errmsg("must be superuser to create ALL TABLES and/or ALL SEQUENCES
publication")));


2a.
Typo.

/create ALL TABLES and/or ALL SEQUENCES publication/create a FOR ALL
TABLES and/or a FOR ALL SEQUENCES publication/

~

2b.
This message might be OK now, but I suspect it will become very messy
in future after you introduce another syntax like "FOR SEQUENCE
seq_name" etc (which would also be able to be used in combination with
a FOR ALL TABLES).

So, I think that for future-proofing against all the possible (future)
combinations, and for keeping the code cleaner, it will be far simpler
to just keep the errors for tables and sequences separated:

SUGGESTION:
if (!superuser())
{
  if (stmt->for_all_tables)
ereport(ERROR, ... FOR ALL TABLES ...);
  if (stmt->for_all_sequences)
ereport(ERROR, ... FOR ALL SEQUENCES ...);
}

~~~

AlterPublicationOwner_internal:

3.
- if (form->puballtables && !superuser_arg(newOwnerId))
+ /* FOR ALL TABLES or FOR ALL SEQUENCES requires superuser */
+ if ((form->puballtables || form->puballsequences) &&
+ !superuser_arg(newOwnerId))
  ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied to change owner of publication \"%s\"",
  NameStr(form->pubname)),
- errhint("The owner of a FOR ALL TABLES publication must be a superuser.")));
+ errhint("The owner of ALL TABLES and/or ALL SEQUENCES publication
must be a superuser.")));

Ditto the above comment #2.

==
src/bin/psql/describe.c

4.
+ puboid_col = cols++;
+ pubname_col = cols++;
+ pubowner_col = cols++;
+ puballtables_col = cols++;
+
+ if (has_pubsequence)
+ {
+ appendPQExpBufferStr(&buf,
+ ", puballsequences");
+ puballsequences_col = cols++;
+ }
+
+ appendPQExpBufferStr(&buf,
+ ", pubinsert, pubupdate, pubdelete");
+ pubins_col = cols++;
+ pubupd_col = cols++;
+ pubdel_col = cols++;
+
  if (has_pubtruncate)
+ {
  appendPQExpBufferStr(&buf,
  ", pubtruncate");
+ pubtrunc_col = cols++;
+ }
+
  if (has_pubgencols)
+ {
  appendPQExpBufferStr(&buf,
  ", pubgencols");
+ pubgen_col = cols++;
+ }
+
  if (has_pubviaroot)
+ {
  appendPQExpBufferStr(&buf,
  ", pubviaroot");
+ pubviaroot_col = cols++;
+ }

There is some overlap/duplication of the new variable 'cols' and the
existing variable 'ncols'.

AFAICT you can just move/replace the declaration of 'ncols' to where
'cols' is declared, and then you can remove the duplicated code below
(because the above code is already doing the same thing).

if (has_pubtruncate)
  ncols++;
if (has_pubgencols)
  ncols++;
if (has_pubviaroot)
  ncols++;
if (has_pubsequence)
  ncols++;

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Conflict detection for update_deleted in logical replication

2025-01-02 Thread Amit Kapila
On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Attach the new version patch set which addressed all other comments.
>

Some more miscellaneous comments:
=
1.
@@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
  * modifying it.  This makes checkpoint's determination of which xacts
  * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
  */
- Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
  START_CRIT_SECTION();
- MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+ MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;

  /*
  * Insert the commit XLOG record.
@@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
  */
  if (markXidCommitted)
  {
- MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+ MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
  END_CRIT_SECTION();

The comments related to this change should be updated in EndPrepare()
and RecordTransactionCommitPrepared(). They still refer to the
DELAY_CHKPT_START flag. We should update the comments explaining why a
similar change is not required for prepare or commit_prepare, if there
is one.

2.
 static bool
 tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
- TypeCacheEntry **eq)
+ TypeCacheEntry **eq, Bitmapset *columns)
 {
  int attrnum;

@@ -337,6 +340,14 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
  if (att->attisdropped || att->attgenerated)
  continue;

+ /*
+ * Ignore columns that are not listed for checking.
+ */
+ if (columns &&
+ !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber,
+columns))
+ continue;

Update the comment atop tuples_equal to reflect this change.

3.
+FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
+ TransactionId *delete_xid,
+ RepOriginId *delete_origin,
+ TimestampTz *delete_time)
...
...
+ /* Try to find the tuple */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
+ {
+ bool dead = false;
+ TransactionId xmax;
+ TimestampTz localts;
+ RepOriginId localorigin;
+
+ if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
+ continue;
+
+ tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
+ buf = hslot->buffer;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
HEAPTUPLE_RECENTLY_DEAD)
+ dead = true;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!dead)
+ continue;

Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
matter to detect update_delete conflict?

4. In FindMostRecentlyDeletedTupleInfo(), add comments to state why we
need to use SnapshotAny.

5.
+
+  
+detect_update_deleted
(boolean)
+
+ 
+  Specifies whether the detection of 
+  is enabled. The default is false. If set to
+  true, the dead tuples on the subscriber that are still useful for
+  detecting 
+  are retained,

One of the purposes of retaining dead tuples is to detect
update_delete conflict. But, I also see the following in 0001's commit
message: "Since the mechanism relies on a single replication slot, it
not only assists in retaining dead tuples but also preserves commit
timestamps and origin data. These information will be displayed in the
additional logs generated for logical replication conflicts.
Furthermore, the preserved commit timestamps and origin data are
essential for consistently detecting update_origin_differs conflicts."
which indicates there are other cases where retaining dead tuples can
help. So, I was thinking about whether to name this new option as
retain_dead_tuples or something along those lines?

BTW, it is not clear how retaining dead tuples will help the detection
update_origin_differs. Will it happen when the tuple is inserted or
updated on the subscriber and then when we try to update the same
tuple due to remote update, the commit_ts information of the xact is
not available because the same is already removed by vacuum? This
should happen for the update case for the new row generated by the
update operation as that will be used in comparison. Can you please
show it be a test case even if it is manual?

Can't it happen for delete_origin_differs as well for the same reason?

6. I feel we should keep 0004 as a later patch. We can ideally
consider committing 0001, 0002, 0003, 0005, and 0006 (or part of 0006
to get some tests that are relevant) as one unit and then the patch to
detect and report update_delete conflict. What do you think?

-- 
With Regards,
Amit Kapila.




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
Thomas Munro  writes:
> For what little it's worth, I'm not quite convinced yet that FreeBSD's
> client isn't more broken than it needs to be.

I'm suspicious of that too.  The wireshark trace you described is hard
to read any other way than that FreeBSD went out of its way to deliver
incorrect information.  I'm prepared to believe that we can't work
correctly on NFS servers that don't do the stable-cookie thing, but
why isn't it succeeding on ones that do?

regards, tom lane




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> For what little it's worth, I'm not quite convinced yet that FreeBSD's
>> client isn't more broken than it needs to be.

> I'm suspicious of that too.

I poked at this a little further.  I made the attached stand-alone
test case (you don't need any more than "cc -o rmtree rmtree.c"
to build it, then point the script at some NFS-mounted directory).
This fails with my NAS at least as far back as FreeBSD 11.0.
I also tried it on NetBSD 9.2 which seems fine.

regards, tom lane

#! /bin/sh

set -e

TESTDIR="$1"

mkdir "$TESTDIR"

i=0
while [ $i -lt 1000 ]
do
  touch "$TESTDIR/$i"
  i=`expr $i + 1`
done

./rmtree "$TESTDIR"
/*-
 *
 * rmtree.c
 *
 * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
 *	  src/common/rmtree.c
 *
 *-
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


typedef enum PGFileType
{
	PGFILETYPE_ERROR,
	PGFILETYPE_UNKNOWN,
	PGFILETYPE_REG,
	PGFILETYPE_DIR,
	PGFILETYPE_LNK,
} PGFileType;


static void *
palloc(size_t size)
{
	void	   *tmp;

	/* Avoid unportable behavior of malloc(0) */
	if (size == 0)
		size = 1;
	tmp = malloc(size);
	if (tmp == NULL)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return tmp;
}

static void *
repalloc(void *ptr, size_t size)
{
	void	   *tmp;

	/* Avoid unportable behavior of realloc(NULL, 0) */
	if (ptr == NULL && size == 0)
		size = 1;
	tmp = realloc(ptr, size);
	if (!tmp)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return tmp;
}

static char *
pstrdup(const char *in)
{
	char	   *tmp;

	if (!in)
	{
		fprintf(stderr,
"cannot duplicate null pointer (internal error)\n");
		exit(1);
	}
	tmp = strdup(in);
	if (!tmp)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return tmp;
}

/*
 * Return the type of a directory entry.
 */
static PGFileType
get_dirent_type(const char *path,
const struct dirent *de,
bool look_through_symlinks)
{
	PGFileType	result;

	/*
	 * Some systems tell us the type directly in the dirent struct, but that's
	 * a BSD and Linux extension not required by POSIX.  Even when the
	 * interface is present, sometimes the type is unknown, depending on the
	 * filesystem.
	 */
#if defined(DT_REG) && defined(DT_DIR) && defined(DT_LNK)
	if (de->d_type == DT_REG)
		result = PGFILETYPE_REG;
	else if (de->d_type == DT_DIR)
		result = PGFILETYPE_DIR;
	else if (de->d_type == DT_LNK && !look_through_symlinks)
		result = PGFILETYPE_LNK;
	else
		result = PGFILETYPE_UNKNOWN;
#else
	result = PGFILETYPE_UNKNOWN;
#endif

	if (result == PGFILETYPE_UNKNOWN)
	{
		struct stat fst;
		int			sret;


		if (look_through_symlinks)
			sret = stat(path, &fst);
		else
			sret = lstat(path, &fst);

		if (sret < 0)
		{
			result = PGFILETYPE_ERROR;
			fprintf(stderr, "could not stat file \"%s\": %m\n", path);
		}
		else if (S_ISREG(fst.st_mode))
			result = PGFILETYPE_REG;
		else if (S_ISDIR(fst.st_mode))
			result = PGFILETYPE_DIR;
		else if (S_ISLNK(fst.st_mode))
			result = PGFILETYPE_LNK;
	}

	return result;
}


/*
 *	rmtree
 *
 *	Delete a directory tree recursively.
 *	Assumes path points to a valid directory.
 *	Deletes everything under path.
 *	If rmtopdir is true deletes the directory too.
 *	Returns true if successful, false if there was any problem.
 *	(The details of the problem are reported already, so caller
 *	doesn't really have to say anything more, but most do.)
 */
static bool
rmtree(const char *path, bool rmtopdir)
{
	char		pathbuf[8192];
	DIR		   *dir;
	struct dirent *de;
	bool		result = true;
	size_t		dirnames_size = 0;
	size_t		dirnames_capacity = 8;
	char	  **dirnames;

	dir = opendir(path);
	if (dir == NULL)
	{
		fprintf(stderr, "could not open directory \"%s\": %m\n", path);
		return false;
	}

	dirnames = (char **) palloc(sizeof(char *) * dirnames_capacity);

	while (errno = 0, (de = readdir(dir)))
	{
		if (strcmp(de->d_name, ".") == 0 ||
			strcmp(de->d_name, "..") == 0)
			continue;
		snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
		switch (get_dirent_type(pathbuf, de, false))
		{
			case PGFILETYPE_ERROR:
/* already logged, press on */
break;
			case PGFILETYPE_DIR:

/*
 * Defer recursion until after we've closed this directory, to
 * avoid using more than one file descriptor at a time.
 */
if (dirnames_size == dirnames_capacity)
{
	dirnames = repalloc(dirnames,
		sizeof(char *) * dirnames_capacity * 2);
	dirnames_capacity *= 2;
}
dirnames[dirnames_size++] = pstrdup(pathbuf);
break;
			default:
if (unlink(pathbuf) != 0 && errno != ENOENT)
{
	fprintf(stderr, "could not remove file \"%s\": %m\n", pathbuf);
	result = false;
}
break;
		}
	}

	if (errno != 0)
	{

Re: Conflict detection for update_deleted in logical replication

2025-01-02 Thread Masahiko Sawada
On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 
>  wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >   Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still 
> > invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to 
> setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be 
> better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new 
> request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent 
> transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit 
> timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the 
> xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.
>

Thank you for updating the patches!

I have one comment on the 0001 patch:

+   /*
+* The changes made by this and later transactions are still
non-removable
+* to allow for the detection of update_deleted conflicts when applying
+* changes in this logical replication worker.
+*
+* Note that this info cannot directly protect dead tuples from being
+* prematurely frozen or removed. The logical replication launcher
+* asynchronously collects this info to determine whether to advance the
+* xmin value of the replication slot.
+*
+* Therefore, FullTransactionId that includes both the
transaction ID and
+* its epoch is used here instead of a single Transaction ID. This is
+* critical because without considering the epoch, the transaction ID
+* alone may appear as if it is in the future due to transaction ID
+* wraparound.
+*/
+   FullTransactionId oldest_nonremovable_xid;

The last paragraph of the comment mentions that we need to use
FullTransactionId to properly compare XIDs even after the XID
wraparound happens. But once we set the oldest-nonremovable-xid it
prevents XIDs from being wraparound, no? I mean that workers'
oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
xmin) are never away from more than 2^31 XIDs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
Larry Rosenman  writes:
> @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
> my issue.

Thanks for reaching out to him.  So if I'm reading this correctly,
there's little point in filing a FreeBSD bug because it'll be
dismissed as unfixable.

This leaves us in rather a nasty position.  Sure, we could rewrite
rmtree() as Thomas suggested upthread, but I'm still of the opinion
that that's smearing lipstick on a pig.  rmtree() is the least of
our worries: it doesn't need to expect that anybody else will be
modifying the target directory, plus it can easily restart its scan
without complicated bookkeeping.  I doubt we can make such an
assumption for all our uses of readdir(), or that it's okay to
miss or double-process files in every one of them.

I'm still of the opinion that the best thing to do is disclaim
safety of storing a database on NFS.

regards, tom lane




Re: apply_scanjoin_target_to_paths and partitionwise join

2025-01-02 Thread Tom Lane
Robert Haas  writes:
> I'm obviously missing something here, because I'm sure Jakub is quite
> right when he says that this actually happened and actually hosed an
> EDB customer. But I don't understand HOW it happened, and I think if
> we're going to change the code we really ought to understand that and
> write some code comments about it. In general, I think that it's very
> reasonable to expect that a bunch of small joins will beat one big
> join, which is why the code does what it currently does.

I am wondering if the problem is not that the plan is slower, it's
that for some reason the planner took a lot longer to create it.
It's very plausible that partitionwise planning takes longer, and
maybe we have some corner cases where the time is O(N^2) or worse.

However, this is pure speculation without a test case, and any
proposed fix would be even more speculative.  I concur with your
bottom line: we should insist on a public test case before deciding
what to do about it.

regards, tom lane




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Robert Haas
On Thu, Jan 2, 2025 at 3:49 PM Tom Lane  wrote:
> I'm still of the opinion that the best thing to do is disclaim
> safety of storing a database on NFS.

If we're going to disclaim support for NFS, it would certainly be
better to do that clearly and with reasons than otherwise. However, I
suspect a substantial number of users will still use NFS and then
blame us when it doesn't work; or alternatively say that our software
sucks because it doesn't work on NFS. Neither of which are very nice
outcomes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: apply_scanjoin_target_to_paths and partitionwise join

2025-01-02 Thread Robert Haas
On Thu, Jan 2, 2025 at 3:58 PM Tom Lane  wrote:
> I am wondering if the problem is not that the plan is slower, it's
> that for some reason the planner took a lot longer to create it.
> It's very plausible that partitionwise planning takes longer, and
> maybe we have some corner cases where the time is O(N^2) or worse.

That doesn't seem like a totally unreasonable speculation, but it
seems a little surprising that retaining the non-partitionwise paths
would fix it. True, that might let us discard a bunch of partitionwise
paths more quickly than would otherwise be possible, but I wouldn't
expect that to have an impact as dramatic as what Jakub alleged. The
thing I thought about was whether there might be some weird effects
with lots of empty partitions; or maybe with some other property of
the path like say sort keys or parallelism. For example if we couldn't
generate a partitionwise path with sort keys as good as the
non-partitionwise path had, or if we couldn't generate a parallel
partitionwise path but we could generate a parallel non-partitionwise
path. As far as I knew neither of those things are real problems, but
if they were then I believe they could pretty easily explain a large
regression.

> However, this is pure speculation without a test case, and any
> proposed fix would be even more speculative.  I concur with your
> bottom line: we should insist on a public test case before deciding
> what to do about it.

Yeah.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jan 2, 2025 at 3:49 PM Tom Lane  wrote:
>> I'm still of the opinion that the best thing to do is disclaim
>> safety of storing a database on NFS.

> If we're going to disclaim support for NFS, it would certainly be
> better to do that clearly and with reasons than otherwise. However, I
> suspect a substantial number of users will still use NFS and then
> blame us when it doesn't work; or alternatively say that our software
> sucks because it doesn't work on NFS. Neither of which are very nice
> outcomes.

Are you prepared to buy into "we will make every bit of code that uses
readdir() proof against arbitrary lies from readdir()"?  I'm not:
I cannot see how to do that in any but the simplest cases -- rmtree()
being about the simplest.  Even if we had a bulletproof design in
mind, it's pretty hard to believe that future patches would apply
it every time.  I think it's inevitable that we'd have bugs like
"sometimes not every file gets fsync'd", which would be impossible
to find until someone's data gets eaten, and maybe still impossible
to find afterwards.

(To be clear: if this is how FreeBSD acts, then I'm afraid we already
do have such bugs.  The rmtree case is just easier to observe than a
missed fsync.)

regards, tom lane




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Larry Rosenman

On 01/02/2025 4:56 pm, Tom Lane wrote:

Larry Rosenman  writes:

Would it make sense for ONLY drop database to have the above loop?


Not really.  We'd just be papering over the most-easily-observable
consequence of readdir's malfeasance.  There'd still be problems
like basebackups omitting files, missed fsyncs potentially leading
to data loss, etc.

I am sort of wondering though why we've not heard reports of this
many times before.  Did FreeBSD recently change something in this
area?  Also, if as they argue it's a fundamental problem in the NFS
protocol, why are we failing to reproduce it with other clients?

regards, tom lane

Pre version 16, we worked just fine.  There was a change in 16.

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
Larry Rosenman  writes:
> Would it make sense for ONLY drop database to have the above loop?

Not really.  We'd just be papering over the most-easily-observable
consequence of readdir's malfeasance.  There'd still be problems
like basebackups omitting files, missed fsyncs potentially leading
to data loss, etc.

I am sort of wondering though why we've not heard reports of this
many times before.  Did FreeBSD recently change something in this
area?  Also, if as they argue it's a fundamental problem in the NFS
protocol, why are we failing to reproduce it with other clients?

regards, tom lane




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Tom Lane
Larry Rosenman  writes:
> On 01/02/2025 4:56 pm, Tom Lane wrote:
>> I am sort of wondering though why we've not heard reports of this
>> many times before.  Did FreeBSD recently change something in this
>> area?  Also, if as they argue it's a fundamental problem in the NFS
>> protocol, why are we failing to reproduce it with other clients?

> Pre version 16, we worked just fine.  There was a change in 16.

Well, as Thomas already mentioned, rmtree() used to work by reading
the directory and making an in-memory copy of the whole file list
before it started to delete anything.  But reverting that wouldn't
fix all of the other problems that we now know exist.  (In other
words, "we worked just fine" only for small values of "fine".)

regards, tom lane




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Thomas Munro
On Fri, Jan 3, 2025 at 10:53 AM Tom Lane  wrote:
> (To be clear: if this is how FreeBSD acts, then I'm afraid we already
> do have such bugs.  The rmtree case is just easier to observe than a
> missed fsync.)

For what little it's worth, I'm not quite convinced yet that FreeBSD's
client isn't more broken than it needs to be.  Lots of systems I
looked at have stable cookies in practice (as NFS 4 recommends),
including the one used in this report, so it seems like a more basic
problem.  At the risk of being wrong on the internet, I don't see any
fundamental reason why a readdir() scan can't have no-skip,
no-duplicate, no-fail semantics for stable-cookie, no-verification
servers.  And this case works perfectly with a couple of other NFS
clients implementations that you and I tried.

As for systems that don't have stable cookies, well then they should
implement the cookie verification scheme and AFAICS the readdir() scan
should then fail if it can't recover, or it would expose isolation
anomalies offensive to database hacker sensibilities.  I think it
should be theoretically possible to recover in some happy cases
(maybe: when enough state is still around in cache to deduplicate).
But that shouldn't be necessary on filers using eg ZFS or BTRFS whose
cookies are intended to be stable.  A server could also do MVCC magic,
and from a quick google, I guess NetApp WAFL might do that as it has
the concept of "READDIR expired", which smells a bit like ORA-01555:
snapshot too old.




Re: Proposal: Progressive explain

2025-01-02 Thread jian he
hi.


all the newly added GUC
progressive_explain;
progressive_explain_verbose;
progressive_explain_settings;
progressive_explain_interval;
progressive_explain_output_size;
progressive_explain_format;
progressive_explain_sample_rate;
also need to add to postgresql.conf.sample?


in doc/src/sgml/monitoring.sgml, we also need add
view pg_stat_progress_explain
to the section

Dynamic Statistics Views
(Table 27.1. Dynamic Statistics Views)

pg_stat_progress_explain.explain will be truncated after 4096 byte.
(default value of progressive_explain_output_size)
so if the progressive_explain_format is json,
and the plan is bigger (imagine two partitioned tables joined
together, each having many partitions)
the column "explain" text may not be a valid json.
Should we be concerned about this?


I don't really understand the actual usage of
pg_stat_progress_explain.explain_count.
Other column usage makes sense to me.
Can you share your idea why we need this column?


select name, category from pg_settings
where category =  'Query Tuning / Planner Method Configuration';
you will see that in config_group as QUERY_TUNING_METHOD
all the GUC names generally begin with "enable".
all the GUC names begin with "progressive" set the config_group as
QUERY_TUNING_METHOD
may not be appropriate? also it is not related to query tuning.


#include "utils/backend_status.h"
#include "storage/procarray.h"
#include "executor/spi.h"
#include "utils/guc.h"

src/backend/commands/explain.c
the header generally should be sorted in alphabetical ascending order.
apply the order to ipci.c, execMain.c, execProcnode.c

else
/* Node in progress */
if (es->progressive && planstate ==
planstate->state->progressive_explain_current_node)
appendStringInfo(es->str,
 " (actual rows=%.0f loops=%.0f) (current)",
 rows, nloops);
else
appendStringInfo(es->str,
 " (actual rows=%.0f loops=%.0f)",
 rows, nloops);

the above part in src/backend/commands/explain.c ExplainNode, the
indentation looks wrong to me.




Re: Avoid orphaned objects dependencies, take 3

2025-01-02 Thread Bertrand Drouvot
Hi,

On Mon, Oct 28, 2024 at 09:30:19AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Aug 19, 2024 at 03:35:14PM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Wed, Jul 10, 2024 at 07:31:06AM +, Bertrand Drouvot wrote:
> > > So, to sum up:
> > > 
> > > A. Locking is now done exclusively with LockNotPinnedObject(Oid classid, 
> > > Oid objid)
> > > so that it's now always clear what object we want to acquire a lock for. 
> > > It means
> > > we are not manipulating directly an object address or a list of objects 
> > > address
> > > as it was the case when the locking was done "directly" within the 
> > > dependency code.
> > > 
> > > B. A special case is done for objects that belong to the 
> > > RelationRelationId class.
> > > For those, we should be in one of the two following cases that would 
> > > already
> > > prevent the relation to be dropped:
> > > 
> > >  1. The relation is already locked (could be an existing relation or a 
> > > relation
> > >  that we are creating).
> > > 
> > >  2. The relation is protected indirectly (i.e an index protected by a 
> > > lock on
> > >  its table, a table protected by a lock on a function that depends the 
> > > table...)
> > > 
> > > To avoid any risks for the RelationRelationId class case, we acquire a 
> > > lock if
> > > there is none. That may add unnecessary lock for 2. but that seems worth 
> > > it. 
> > > 
> > 
> > Please find attached v16, mandatory rebase due to 80ffcb8427.
> 
> rebased (v17 attached).

rebased (v18 attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b2b6810fabd4743fa18bf19d714805e6bb219da5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v18] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 212 ++
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 +++
 src/backend/catalog/objectaddress.c   |  57 +
 src/backend/catalog/pg_aggregate.c|   9 +
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 +++
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  39 +++-
 src/backend/catalog/pg_operator.c |  19 ++
 src/backend/catalog/pg_proc.c |  17 +-
 src/backend/catalog/pg_publication.c  |  11 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c   

Re: Strange issue with NFS mounted PGDATA on ugreen NAS

2025-01-02 Thread Thomas Munro
On Wed, Jan 1, 2025 at 6:39 PM Tom Lane  wrote:
> ISTM we used to disclaim responsibility for data integrity if you
> try to put PGDATA on NFS.  I looked at the current wording about
> NFS in runtime.sgml and was frankly shocked at how optimistic it is.
> Shouldn't we be saying something closer to "if it breaks you
> get to keep both pieces"?

I now suspect this specific readdir() problem is in FreeBSD's NFS
client.  See below.  There have also been reports of missed files from
(IIRC) Linux clients without much analysis, but that doesn't seem too
actionable from here unless someone can come up with a repro or at
least some solid details to investigate; those involved unspecified
(possibly appliance/cloud) NFS and CIFS file servers.

The other issue I know of personally is NFS ENOSPC, which has some
exciting disappearing-committed-data failure modes caused by lazy
allocation on Linux's implementation (and possibly others), that I've
written about before.  But actually that one is not strictly an
NFS-only issue, it's just really easy to hit that way, and I have a
patch to fix it on our side, which I hope to re-post soon.
Independently of this, really as it's tangled up with quite a few
other things...

> > Anyway, I'll write a patch to change rmtree() to buffer the names in
> > memory.  In theory there could be hundreds of gigabytes of pathnames,
> > so perhaps I should do it in batches; I'll look into that.
>
> This feels a lot like putting lipstick on a pig.

Hehe.  Yeah.  Abandoned.

I see this issue here with a FreeBSD client talking to a Debian server
exporting BTRFS or XFS, even with dirreadsize set high so that
multi-request paging is not expected.  Looking at Wireshark and the
NFS spec (disclaimer: I have never studied NFS at this level before,
addito salis grano), what I see is a READDIR request with cookie=0
(good), and which receives a response containing the whole directory
listing and a final entry marker eof=1 (good), but then FreeBSD
unexpectedly (to me) sends *another* READDIR request with cookie=662,
which is a real cookie that was received somewhere in the middle of
the first response on the entry for "13816_fsm", and that entry was
followed by an entry for "13816_vm".  The second request gets a
response that begins at "13816_vm" (correct on the server's part).
Then the client sends REMOVE (unlink) requests for some but not all of
the files, including "13816_fsm" but not "13816_vm".  Then it sends
yet another READDIR request with cookie=0 (meaning go from the top),
and gets a non-empty directory listing, but immediately sends RMDIR,
which unsurprisingly fails NFS3ERR_NOTEMPTY.  So my best guess so far
is that FreeBSD's NFS client must be corrupting its directory cache
when files are unlinked, and it's not the server's fault.  I don't see
any obvious problem with the way the cookies work.  Seems like
material for a minimised bug report elsewhere, and not our issue.




Re: ERROR: corrupt MVNDistinct entry

2025-01-02 Thread Richard Guo
On Tue, Dec 31, 2024 at 5:40 PM Richard Guo  wrote:
> Regarding the back-patch, this patch is a bug fix, which suggests it
> should be back-patched.  However, it also changes some plans by fixing
> the cost estimation.  Does anyone know what our usual approach is in
> this situation?

Although this patch could result in plan changes, it fixes an actual
bug, so I've back-patched it to v16.

Thanks
Richard




Re: FileFallocate misbehaving on XFS

2025-01-02 Thread Andrea Gelmini
Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund 
ha scritto:

2024-12-19 04:47:04 CET [2646363]:  ERROR:  could not extend file
> "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from
> 29850 to 29861, using FileFallocate(): No space left on device


Dunno it it helps, but today I read this reference in latest patchset on
XFS mailing list:
https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangsh...@kylinos.cn/


It could be related and explain the effect?

For the record, I found it here:
https://lore.kernel.org/linux-xfs/canubcdxwhottw4pjje1qjajheg48ls7mfc065gcqwoh7s0y...@mail.gmail.com/

Ciao,
Gelma


RFC: Lock-free XLog Reservation from WAL

2025-01-02 Thread Zhou, Zhiguo
Hi all,

I am reaching out to solicit your insights and comments on a recent proposal 
regarding the "Lock-free XLog Reservation from WAL." We have identified some 
challenges with the current WAL insertions, which require space reservations in 
the WAL buffer which involve updating two shared-memory statuses in 
XLogCtlInsert: CurrBytePos (the start position of the current XLog) and 
PrevBytePos (the prev-link to the previous XLog). Currently, the use of 
XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, 
hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a 
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the 
next XLog always points to the head of the current Xlog,we will slightly exceed 
the reserved memory range of the current XLog to update the prev-link of the 
next XLog, regardless of which backend acquires the next memory space. The next 
XLog inserter will wait until its prev-link is updated for CRC calculation 
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link 
of its next XLog first, then return to the header position for the current log 
insertion. This change will reduce the dependency of XLog writes on previous 
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to 
separate the LSN it intends to access from the LSN it expects to update in the 
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing 
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the 
parallelism.

The attached patch could pass the regression tests (make check, make 
check-world), and in the performance test of this POC on SPR (480 vCPU) shows 
that this optimization could help the TPCC benchmark better scale with the core 
count and as a result the performance with full cores enabled could be improved 
by 2.04x.

Before we proceed with further patch validation and refinement work, we are 
eager to hear the community's thoughts and comments on this optimization so 
that we can confirm our current work aligns with expectations.



0001-Lock-free-XLog-Reservation-from-WAL.patch
Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch


[RFC] Lock-free XLog Reservation from WAL

2025-01-02 Thread Zhou, Zhiguo
Hi all,

I am reaching out to solicit your insights and comments on a recent proposal 
regarding the "Lock-free XLog Reservation from WAL." We have identified some 
challenges with the current WAL insertions, which require space reservations in 
the WAL buffer which involve updating two shared-memory statuses in 
XLogCtlInsert: CurrBytePos (the start position of the current XLog) and 
PrevBytePos (the prev-link to the previous XLog). Currently, the use of 
XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, 
hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a 
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the 
next XLog always points to the head of the current Xlog,we will slightly exceed 
the reserved memory range of the current XLog to update the prev-link of the 
next XLog, regardless of which backend acquires the next memory space. The next 
XLog inserter will wait until its prev-link is updated for CRC calculation 
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link 
of its next XLog first, then return to the header position for the current log 
insertion. This change will reduce the dependency of XLog writes on previous 
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to 
separate the LSN it intends to access from the LSN it expects to update in the 
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing 
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the 
parallelism.

The attached patch could pass the regression tests (make check, make 
check-world), and in the performance test of this POC on SPR (480 vCPU) shows 
that this optimization could help the TPCC benchmark better scale with the core 
count and as a result the performance with full cores enabled could be improved 
by 2.04x.

Before we proceed with further patch validation and refinement work, we are 
eager to hear the community's thoughts and comments on this optimization so 
that we can confirm our current work aligns with expectations.


0001-Lock-free-XLog-Reservation-from-WAL.patch
Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch


[RFC] Lock-free XLog Reservation from WAL

2025-01-02 Thread Zhou, Zhiguo
Hi all,

I am reaching out to solicit your insights and comments on a recent proposal 
regarding the "Lock-free XLog Reservation from WAL." We have identified some 
challenges with the current WAL insertions, which require space reservations in 
the WAL buffer which involve updating two shared-memory statuses in 
XLogCtlInsert: CurrBytePos (the start position of the current XLog) and 
PrevBytePos (the prev-link to the previous XLog). Currently, the use of 
XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, 
hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a 
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the 
next XLog always points to the head of the current Xlog,we will slightly exceed 
the reserved memory range of the current XLog to update the prev-link of the 
next XLog, regardless of which backend acquires the next memory space. The next 
XLog inserter will wait until its prev-link is updated for CRC calculation 
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link 
of its next XLog first, then return to the header position for the current log 
insertion. This change will reduce the dependency of XLog writes on previous 
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to 
separate the LSN it intends to access from the LSN it expects to update in the 
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing 
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the 
parallelism.

The attached patch could pass the regression tests (make check, make 
check-world), and in the performance test of this POC on SPR (480 vCPU) shows 
that this optimization could help the TPCC benchmark better scale with the core 
count and as a result the performance with full cores enabled could be improved 
by 2.04x.

Before we proceed with further patch validation and refinement work, we are 
eager to hear the community's thoughts and comments on this optimization so 
that we can confirm our current work aligns with expectations.


0001-Lock-free-XLog-Reservation-from-WAL.patch
Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch


Re: Modern SHA2- based password hashes for pgcrypto

2025-01-02 Thread Bernd Helmle
Am Donnerstag, dem 02.01.2025 um 15:57 +0100 schrieb Daniel Gustafsson:
> > I adapted the code from the publicly available reference
> > implementation
> > at [1]. It's based on our existing OpenSSL infrastructure in
> > pgcrypto
> > and produces compatible password hashes with crypt() and "openssl
> > passwd" with "-5" and "-6" switches.
> 
> Potentially daft question, but since we require OpenSSL to build
> pgcrypto, why
> do we need to include sha2 code instead of using the sha2
> implementation in
> libcrypto? How complicated would it be to use the OpenSSL API
> instead?

Not sure i got you, but i use OpenSSL and the SHA2 implementation
there. See the pgcrypto px_* API (px.h and openssl.c respectively) i am
using to create the digests.

Thanks,
Bernd





Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-02 Thread Shubham Khanna
On Fri, Jan 3, 2025 at 8:06 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> Here are some review comments for patch v5-0001.
>
> ==
> doc/src/sgml/ref/pg_createsubscriber.sgml
>
> 1.
> +   
> +The source server must have  
> to
> +be set to -1 to prevent the automatic removal of WAL
> +replication slots. Setting this parameter to files needed by a
> specific size
> +may lead to replication failures if required WAL files are
> +prematurely deleted.
> +   
>
> IIUC, this problem is all about the removal of WAL *files*, not "WAL
> replication slots".
>
> Also, the "Setting this parameter to files needed by a specific size"
> part sounds strange.
>
> I think this can be simplified anyhow like below.
>
> SUGGESTION:
> Replication failures can occur if required WAL files are prematurely
> deleted. To prevent this, the source server must set  linkend="guc-max-slot-wal-keep-size"/> to -1,
> ensuring WAL files are not automatically removed.
>
> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> check_publisher:
>
> 2.
> + if (dry_run && max_slot_wal_keep_size != -1)
> + {
> + pg_log_warning("publisher requires WAL size must not be restricted");
> + pg_log_warning_detail("The max_slot_wal_keep_size parameter is
> currently set to a non-default value, which may lead to replication
> failures. "
> +   "This parameter must be set to -1 to ensure that required WAL
> files are not prematurely removed.");
> + }
>
> 2a.
> Whenever you mention a GUC name in a PostgreSQL message then (by
> convention) it must be double-quoted.
>
> ~
>
> 2b.
> The detailed message seems verbose and repetitive to me.
>
> ~
>
> 2c.
> The other nearby messages use the terminology "configuration
> parameter", so this should be the same.
>
> ~
>
> 2d.
> The other nearby messages use \"%s\" substitution for the GUC name, so
> this should be the same.
>
> ~
>
> 2e.
> IMO advising the user to change a configuration parameter should be
> done using the pg_log_warning_hint function (e.g. _hint, not
> _details).
>
> ~~
>
> Result:
>
> CURRENT (pg_log_warning_details)
> The max_slot_wal_keep_size parameter is currently set to a non-default
> value, which may lead to replication failures.  This parameter must be
> set to -1 to ensure that required WAL files are not prematurely
> removed.
>
> SUGGESTION (pg_log_warning_hint)
> Set the configuration parameter \"%s\" to -1 to ensure that required
> WAL files are not prematurely removed.
>

I have fixed the given comments using your suggestions. Tha attached
patch contains the suggested changes.

Thanks and Regards,
Shubham Khanna.


v6-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch
Description: Binary data


Re: Re: proposal: schema variables

2025-01-02 Thread jian he
hi.

in the function svariableStartupReceiver all these "ereport(ERROR"
cannot happen,
since transformLetStmt already did all the heavy work.
base on https://www.postgresql.org/docs/current/error-message-reporting.html
all these "ereport(ERROR," in the svariableStartupReceiver can be
simplified as "elog(ERROR,"
or Assert.


After standard_ExecutorStart->InitPlan, queryDesc.tupDesc will not
include attr->attisdropped is true scarenio.
In standard_ExecutorStart, I added the following code then ran the
regress test again to prove my point.

standard_ExecutorStart
/*
 * Initialize the plan state tree
 */
InitPlan(queryDesc, eflags);
for (int i = 0; i < queryDesc->tupDesc->natts; i++)
{
Form_pg_attribute attr = TupleDescAttr(queryDesc->tupDesc, i);
if (attr->attisdropped)
{
elog(INFO, "some attribute is dropped queryDesc->operation
is %d", queryDesc->operation);
}
}
MemoryContextSwitchTo(oldcontext);
-
svariableStartupReceiver parameter typeinfo is from queryDesc->tupDesc
So I think svariableStartupReceiver, typeinfo->natts will always equal one.
therefore SVariableState.slot_offset is not necessary.

overall, i think svariableStartupReceiver can be simplified as the following:

static void
svariableStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
{
SVariableState *myState = (SVariableState *) self;
intnatts = typeinfo->natts;
Form_pg_attribute attr;
LOCKTAGlocktag PG_USED_FOR_ASSERTS_ONLY;
Assert(myState->pub.mydest == DestVariable);
Assert(OidIsValid(myState->varid));
Assert(SearchSysCacheExists1(VARIABLEOID, myState->varid));
#ifdef USE_ASSERT_CHECKING
SET_LOCKTAG_OBJECT(locktag,
   MyDatabaseId,
   VariableRelationId,
   myState->varid,
   0);
Assert(LockHeldByMe(&locktag, AccessShareLock, false));
#endif
Assert(natts == 1);
attr = TupleDescAttr(typeinfo, 0);
myState->need_detoast = attr->attlen == -1;
myState->rows = 0;
}

I've attached the file containing the changes I mentioned earlier.

-<<>>>---
Overall, 0001 and 0002 the doc looks good to me now.
The following are only some minor issues I came up with.

In Table 5.1. ACL Privilege Abbreviations

ACL Privilege Abbreviations

VARIABLE (occurred 3 times)
should be
SESSION VARIABLE
?

doc/src/sgml/glossary.sgml
I want to do minor tweak. from

 A persistent database object that holds a value in session memory.  This
 value is private to each session and is released when the session ends.
 Read or write access to session variables is controlled by privileges,
 similar to other database objects.

to

 A persistent database object that holds a value in session memory.  This
 value is private to each session and is reset to default value
(null) when the session ends.
 Read or write access to session variables is controlled by access
privileges,
 similar to other database objects.


in let.sgml.
  
   Example:

CREATE VARIABLE myvar AS integer;
LET myvar = 10;
LET myvar = (SELECT sum(val) FROM tab);

  

it should be
 
  Examples
...your example code
 


v1-0001-refactoring-svariableStartupReceiver.no-cfbot
Description: Binary data


Re: magical eref alias names

2025-01-02 Thread Tom Lane
Robert Haas  writes:
> Oh, I agree, but I don't see why anyone would care whether rel names
> are unique across different queries. When I mentioned global
> uniqueness, I meant unique within a query, like what
> set_rtable_names() does after the fact.

Okay, but then we still have the problem of how to ensure that in
a query that has inline'd some views.  I think solving the sort of
I-want-to-reference-this problem you describe would require
that we unique-ify the aliases in the rewriter, just after it
finishes incorporating any views.  We could do that, but it seems
like a lot of cycles to expend on something that would be pointless
in the typical case where nobody ever looks at the aliases later.

> Now, I'm firmly convinced that this is a real problem and worth
> solving, but let me be clear that I don't think the solution is
> anywhere on this thread, nor do I think that it is simple.

Agreed.

> My original
> proposal of getting rid of system-generated fake names isn't
> necessary, because you very helpfully pointed out that I can look at
> whether RTE->alias->aliasname exists to figure that out.

Actually, I noticed that we are failing to honor that in the places
where we inject "*SELECT*" and "*SELECT* %d" names, because that
code puts those names into RTE->alias not only RTE->eref.
I experimented with the attached patch to not do that anymore,
which is sort of a subset of what you did but just focused on
not lying about what's generated versus user-written.  We could
alternatively keep the current generated names by extending
addRangeTableEntryForSubquery's API so that alias and generated eref
are passed separately.  (I didn't look to see if anyplace else
is messing up this distinction similarly.)

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index bf322198a2..3d8a433c19 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -4961,13 +4961,13 @@ SELECT ft1.c1 FROM ft1 JOIN ft2 on ft1.c1 = ft2.c1 WHERE
 -- ===
 EXPLAIN (verbose, costs off)
 INSERT INTO ft2 (c1,c2,c3) SELECT c1+1000,c2+100, c3 || c3 FROM ft2 LIMIT 20;
-QUERY PLAN
---
+ QUERY PLAN 
+
  Insert on public.ft2
Remote SQL: INSERT INTO "S 1"."T 1"("C 1", c2, c3, c4, c5, c6, c7, c8) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
Batch Size: 1
-   ->  Subquery Scan on "*SELECT*"
- Output: "*SELECT*"."?column?", "*SELECT*"."?column?_1", NULL::integer, "*SELECT*"."?column?_2", NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2   '::character(10), NULL::user_enum
+   ->  Subquery Scan on unnamed_subquery
+ Output: unnamed_subquery."?column?", unnamed_subquery."?column?_1", NULL::integer, unnamed_subquery."?column?_2", NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2   '::character(10), NULL::user_enum
  ->  Foreign Scan on public.ft2 ft2_1
Output: (ft2_1.c1 + 1000), (ft2_1.c2 + 100), (ft2_1.c3 || ft2_1.c3)
Remote SQL: SELECT "C 1", c2, c3 FROM "S 1"."T 1" LIMIT 20::bigint
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 3a2d51c5ad..c58bbd5b78 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -1962,7 +1962,7 @@ tlist_coercion_finished:
 		rte = makeNode(RangeTblEntry);
 		rte->rtekind = RTE_SUBQUERY;
 		rte->subquery = parse;
-		rte->eref = rte->alias = makeAlias("*SELECT*", colnames);
+		rte->eref = makeAlias("unnamed_subquery", colnames);
 		rte->lateral = false;
 		rte->inh = false;
 		rte->inFromCl = true;
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 3864a675d2..359d3c390e 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -823,7 +823,7 @@ transformInsertStmt(Pa

Re: Vacuum statistics

2025-01-02 Thread Sami Imseih
Hi,

Thanks for the work you have done here. Exposing cumulative
metrics at this level of detail for vacuum is surely useful to find
vacuum bottlenecks and to determine the effectiveness of
vacuum tuning.

I am yet to look very closely, but I think some additional columns that
will be useful is the number of failsafe autovacuums occurred. Also
the counter for number of index_cleanup skipped, truncate phase
skipped and toast vacuuming skipped ( the latter will only be relevant
for the main relation ).

I also wonder if if makes sense to break down timing by phase. I surely
would like to know how much of my vacuum time was spent in index
cleanup vs heap scan, etc.

A nit: I noticed in v14, the column is "schema". It should be "schemaname"
for consistency.

Also, instead of pg_stat_vacuum_tables, what about pg_stat_vacuum?

Now, I became aware of this discussion after starting a new thread
to track total time spent in vacuum/analyze in pg_stat_all_tables [1].
But this begs the question of what should be done with the current
counters in pg_stat_all_tables? I see it mentioned above that (auto)vacuum_count
should be added to this new view, but it's also already in pg_stat_all_tables.
I don't think we should be duplicating the same columns across views.

I think total_time should be removed from your current patch and added
as is being suggested in [1]. This way high level metrics such as counts
and total time spent remain in pg_stat_all_tables, while the new view
you are proposing will contain more details. I don't think we will have
consistency issues between the views because a reset using pg_stat_reset()
will act on all the stats and pg_stat_reset_single_table_counters() will act on
all the stats related to that table. There should be no way to reset the vacuum
stats independently, AFAICT.

Alternatively, we can remove the vacuum related stats from pg_stat_all_tables,
but that will break monitoring tools and will leave us with the (auto)analyze
metrics alone in pg_stat_all_tables. This sounds very ugly.

What do you think?

Regards,

Sami Imseih
Amazon Web Services (AWS)

[1] https://commitfest.postgresql.org/52/5485/




Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Bruce Momjian
On Thu, Jan  2, 2025 at 03:48:53PM -0500, Tom Lane wrote:
> Larry Rosenman  writes:
> > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
> > my issue.
> 
> Thanks for reaching out to him.  So if I'm reading this correctly,
> there's little point in filing a FreeBSD bug because it'll be
> dismissed as unfixable.
> 
> This leaves us in rather a nasty position.  Sure, we could rewrite
> rmtree() as Thomas suggested upthread, but I'm still of the opinion
> that that's smearing lipstick on a pig.  rmtree() is the least of
> our worries: it doesn't need to expect that anybody else will be
> modifying the target directory, plus it can easily restart its scan
> without complicated bookkeeping.  I doubt we can make such an
> assumption for all our uses of readdir(), or that it's okay to
> miss or double-process files in every one of them.
> 
> I'm still of the opinion that the best thing to do is disclaim
> safety of storing a database on NFS.

It would be nice to have some details on the docs about why NFS can
cause problems so we have something to point to when people ask.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.






Re: Alias of VALUES RTE in explain plan

2025-01-02 Thread Tom Lane
Robert Haas  writes:
> On Sat, Nov 2, 2024 at 1:09 PM Tom Lane  wrote:
>> It seems sufficient to avoid alias pushdown when there's an ORDER BY
>> inside the VALUES subquery.  We disallow a locking clause, and
>> while there can be LIMIT/OFFSET, those aren't allowed to reference the
>> VALUES output anyway.  I added some test cases to show that this is
>> enough to make view-dumping behave sanely.

> ... The correctness of that rejiggering depends crucially on what
> will happen at plan time and then at EXPLAIN/ruleutils time, but the
> rules for what will happen at those later times are pretty darn
> complicated, so I feel like this is creating an awful lot of action at
> a distance.

I'm not seeing where there's a correctness issue here?  The parser
is charged with assigning aliases to RTEs that lack one, and with
this patch that's still true.  It's just assigning a different alias
than it did before.  There is no question of whether the alias can
"leak" out of the implicit sub-select; it cannot, by SQL's scoping
rules.  We have to avoid changing anything if there are clauses
inside the implicit sub-select that could reference the old choice
of alias, but the patch does that.  (Or we could decide to simplify
things at the cost of breaking such SQL code, since there probably
is none in the field.  It's still not clear to me which choice is
better.)

Yes, changing the parser to get an effect in ruleutils.c is indeed
action-at-a-distance-y, but the same can be said of an awful lot
of the behavior in this area.  If we were to do this differently,
we'd be breaking existing conventions like "the parser is what
initially assigns aliases", and we'd be much more likely to create
new bugs that way (cf. my criticisms upthread of the v1 patch).
So I'm not really seeing another way.

We could just reject this patch series on the grounds of "it's
not a bug and we're not going to change the behavior".  But I do
think the proposed new output looks nicer.

> If were able (and I suspect we're not, but hypothetically) to in
> effect pull up the subquery at parse time, so that to the planner and
> executor it doesn't even exist, then I think that would be perfectly
> fine, because then we would have strong reasons for believing that no
> later decision can turn our parse-time decision into a problem. But to
> leave that subquery there and guess that it's going to disappear
> before we get to EXPLAIN doesn't seem nearly as safe. It seems pretty
> easy to either miss things (like the ORDER BY case) or even to have
> future planner changes break stuff.

I completely fail to understand what you think might break.  The
planner is not especially interested in aliases, and ruleutils already
has sufficient defenses against cases like duplicate aliases --- it
must, because such cases can arise anyway.

regards, tom lane




Re: apply_scanjoin_target_to_paths and partitionwise join

2025-01-02 Thread Ashutosh Bapat
On Tue, Oct 1, 2024 at 3:22 AM Andrei Lepikhov  wrote:
>
> On 24/7/2024 15:22, Ashutosh Bapat wrote:
> > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo  wrote:
> >> Is there a specific query that demonstrates benefits from this change?
> >> I'm curious about scenarios where a partitionwise join runs slower
> >> than a non-partitionwise join.
> >
> > [1] provides a testcase where a nonpartitionwise join is better than
> > partitionwise join. This testcase is derived from a bug reported by an
> > EDB customer. [2] is another bug report on psql-bugs.
> I haven't passed through the patch yet, but can this issue affect the
> decision on what to push down to foreign servers: a whole join or just a
> scan of two partitions?
> If the patch is related to the pushdown decision, I'd say it is quite an
> annoying problem for me. From time to time, I see cases where JOIN
> produces more tuples than both partitions have in total - in this case,
> it would be better to transfer tables' tuples to the main instance
> before joining them.

Sorry for replying late. I somehow didn't notice this.

A join between partitions is pushed down if only partitionwise join is
chosen and a join between partitions won't be pushed down if
partitionwise join is not chosen. Hence this bug affects pushdown as
well.

The CF entry shows as waiting for author. But that isn't the right
status. Will change it to needs review. I think we need a consensus as
to whether we want to fix this bug or not. Since this bug doesn't
affect me anymore, I will just withdraw this CF entry if there is no
interest.

-- 
Best Wishes,
Ashutosh Bapat




Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-01-02 Thread Peter Eisentraut

On 20.12.24 02:00, Jacob Champion wrote:

v40 also contains:
- explicit testing for connect_timeout compatibility
- support for require_auth=oauth, including compatibility with
require_auth=!scram-sha-256
- the ability for a validator to set authn_id even if the token is not
authorized, for auditability in the logs
- the use of pq_block_sigpipe() for additional safety in the face of
CURLOPT_NOSIGNAL

I have split out the require_auth changes temporarily (0002) for ease
of review, and I plan to ping the last thread where SASL support in
require_auth was discussed [1].


Some review of v40:

General:

There is a mix of using "URL" and "URI" throughout the patch.  I tried
to look up in the source material (RFCs) what the correct use would
be, but even they are mixing it in nonobvious ways.  Maybe this is
just hopelessly confused, or maybe there is a system that I don't
recognize.


* .cirrus.tasks.yml

Since libcurl is an "auto" meson option, it doesn't need to be enabled
explicitly.  At least that's how most of the other feature options are
handled.  So probably better to stick to that pattern.


* config/programs.m4

Useless whitespace change.


* configure.ac

+AC_MSG_CHECKING([whether to build with libcurl support for OAuth client 
flows])

etc.

Let's just write something like 'whether to build with libcurl
support' here.  So we don't have to keep updating it if the scope of
the option changes.

* meson_options.txt

+option('libcurl', type : 'feature', value: 'auto',
+  description: 'libcurl support for OAuth client flows')

Similarly, let's just write something like 'libcurl support' here.


* src/backend/libpq/auth-oauth.c

+typedef enum
+{
+   OAUTH_STATE_INIT = 0,
+   OAUTH_STATE_ERROR,
+   OAUTH_STATE_FINISHED,
+} oauth_state;
+
+/* Mechanism callback state. */
+struct oauth_ctx
+{
+   oauth_state state;

This is the only use of that type definition.  I think you can skip
the typedef and use the enum tag directly.


* src/interfaces/libpq/libpq-fe.h

+#ifdef WIN32
+#include   /* for SOCKET */
+#endif

Including a system header like that seems a bit unpleasant.  AFAICT,
you only need it for this:

+   PostgresPollingStatusType (*async) (PGconn *conn,
+   struct _PGoauthBearerRequest 
*request,

+   SOCKTYPE * altsock);

But that function could already get the altsock handle via
conn->altsock.  So maybe that is a way to avoid the whole socket type
dance in this header.


* src/test/authentication/t/001_password.pl

I suppose this file could be a separate commit?  It just separates the
SASL/SCRAM terminology for existing functionality.


* src/test/modules/oauth_validator/fail_validator.c

+{
+   elog(FATAL, "fail_validator: sentinel error");
+   pg_unreachable();
+}

This pg_unreachable() is probably not necessary after elog(FATAL).


* .../modules/oauth_validator/oauth_hook_client.c

+#include 
+#include 

These are generally not necessary, as they come in via c.h.

+#ifdef WIN32
+#include 
+#else
+#include 
+#endif

I don't think this special Windows handling is necessary, since there
is src/include/port/win32/sys/socket.h.

+static void
+usage(char *argv[])
+{
+   fprintf(stderr, "usage: %s [flags] CONNINFO\n\n", argv[0]);

Help output should go to stdout.


With the above changes, I think this patch set is structurally okay now. 
 Now it just needs to do the right things. ;-)





read stream on amcheck

2025-01-02 Thread Matheus Alcantara
Hi,

While reviewing some other patches implementing stream API for core subsystems,
I noticed that the amcheck extension could also benefit from that.

Notice the refactor when handling the "skip" parameter; The logic was moved to
the heapam_read_stream_next_block callback so that verify_heapam don't need to
touch any private field of heapam_read_stream_next_block_private struct.

One other think to mention is that the test cases of "skip" parameter
that I've seen just test when the first page is corrupted, so I think
that a carefully review on callback logic would be good to ensure that
we don't accidentally skip a page when doing p->current_blocknum++;

This patch doesn't show any performance improvements (or regression)
but I think that it would be good to replace the ReadBufferExtended
usage with the read stream API, so in the future it could be benefit
from the AIO project.

--
Matheus Alcantara


v1-0001-Use-read-stream-on-amcheck.patch
Description: Binary data


Re: Modern SHA2- based password hashes for pgcrypto

2025-01-02 Thread Daniel Gustafsson
> On 2 Jan 2025, at 16:17, Bernd Helmle  wrote:
> 
> Am Donnerstag, dem 02.01.2025 um 15:57 +0100 schrieb Daniel Gustafsson:
>>> I adapted the code from the publicly available reference
>>> implementation
>>> at [1]. It's based on our existing OpenSSL infrastructure in
>>> pgcrypto
>>> and produces compatible password hashes with crypt() and "openssl
>>> passwd" with "-5" and "-6" switches.
>> 
>> Potentially daft question, but since we require OpenSSL to build
>> pgcrypto, why
>> do we need to include sha2 code instead of using the sha2
>> implementation in
>> libcrypto? How complicated would it be to use the OpenSSL API
>> instead?
> 
> Not sure i got you, but i use OpenSSL and the SHA2 implementation
> there. See the pgcrypto px_* API (px.h and openssl.c respectively) i am
> using to create the digests.

Sorry, skimming the patch I misread it, nevermind.

--
Daniel Gustafsson





Re: Strange issue with NFS mounted PGDATA on ugreen NAS

2025-01-02 Thread Tom Lane
Thomas Munro  writes:
> I now suspect this specific readdir() problem is in FreeBSD's NFS
> client.  See below.  There have also been reports of missed files from
> (IIRC) Linux clients without much analysis, but that doesn't seem too
> actionable from here unless someone can come up with a repro or at
> least some solid details to investigate; those involved unspecified
> (possibly appliance/cloud) NFS and CIFS file servers.

I forgot to report back, but yesterday I spent time unsuccessfully
trying to reproduce the problem with macOS client and NFS server
using btrfs (a Synology NAS running some no-name version of Linux).
So that lends additional weight to your conclusion that it isn't
specifically a btrfs bug.

> I see this issue here with a FreeBSD client talking to a Debian server
> exporting BTRFS or XFS, even with dirreadsize set high so that
> multi-request paging is not expected.  Looking at Wireshark and the
> NFS spec (disclaimer: I have never studied NFS at this level before,
> addito salis grano), what I see is a READDIR request with cookie=0
> (good), and which receives a response containing the whole directory
> listing and a final entry marker eof=1 (good), but then FreeBSD
> unexpectedly (to me) sends *another* READDIR request with cookie=662,
> which is a real cookie that was received somewhere in the middle of
> the first response on the entry for "13816_fsm", and that entry was
> followed by an entry for "13816_vm".  The second request gets a
> response that begins at "13816_vm" (correct on the server's part).
> Then the client sends REMOVE (unlink) requests for some but not all of
> the files, including "13816_fsm" but not "13816_vm".  Then it sends
> yet another READDIR request with cookie=0 (meaning go from the top),
> and gets a non-empty directory listing, but immediately sends RMDIR,
> which unsurprisingly fails NFS3ERR_NOTEMPTY.  So my best guess so far
> is that FreeBSD's NFS client must be corrupting its directory cache
> when files are unlinked, and it's not the server's fault.  I don't see
> any obvious problem with the way the cookies work.  Seems like
> material for a minimised bug report elsewhere, and not our issue.

Yeah, that seems pretty damning.  Thanks for looking into it.

regards, tom lane




Re: magical eref alias names

2025-01-02 Thread Robert Haas
On Thu, Jan 2, 2025 at 3:27 PM Tom Lane  wrote:
> Global uniqueness across the database (not single queries) would be
> needed to prevent cases where different views use the same generated
> names.  The only way I can see to do that without nasty performance
> costs is to use something like an OID counter.  Which would mean
> that rather than nice names like "union_1", "union_2", etc, you'd
> soon be looking at "union_5846926".  I don't think anyone would
> find that to be an improvement on what we're doing now.

Oh, I agree, but I don't see why anyone would care whether rel names
are unique across different queries. When I mentioned global
uniqueness, I meant unique within a query, like what
set_rtable_names() does after the fact.

> > If we had global uniqueness, or even per-subquery-level uniqueness,
> > this would sort itself out somewhat nicely, I think. We would just end
> > up with select_1 and select_2 or union_1 and union_2 or something like
> > that, and I think that would be strictly better than the status quo
> > where we do sometimes generate *SELECT* %d, but also sometimes just
> > *SELECT* and other times unnamed_subquery, and also only ever *VALUES*
> > and not *VALUES* %d.
>
> I'll concede that it'd be nicer.  But I'm not convinced it'd be enough
> nicer to justify the costs of changing.  We've been doing it this way
> for a long time, and AFAIR you're the first to complain about it.

I'm not sure it's enough nicer to justify the cost of changing,
either. I do think that "you're the first to complain about it" is not
a convincing argument, here or in general. Every time somebody reports
a new problem, they are the first to complain about it, but that does
make the problem any less real.

The reason that I got interested in this problem was because of the
thread about allowing extensions to control planner behavior. I wrote
some words about it over there, but it's been a while so those
thoughts might not be entirely up to date. What I've found is that
it's a lot easier to insert a hook or three to allow an extension to
control the planner behavior than it is to get those hooks to do
anything useful, and that's precisely because it is hard to find a
useful way to identify a particular part of the query plan. If the
query planner were a person sitting next to you at your computer, you
could point at the screen with your finger and say "hey, do you see
this part of the EXPLAIN plan right here? could you try making this a
sequential scan rather than an index scan?" and the planner could say
"sure, let me re-plan it that way and I'll show you how it turns out".
Since the planner is incorporeal and cannot see your finger, you want
to identify "this part of the EXPLAIN plan right here" in some other
way, like with a name, but the names known at parsing time and
planning time are not unique and need not match what shows up in the
EXPLAIN output.

Concretely, if the user creates a partitioned table called *VALUES*
with several partitions and then creates another such table, also
called *VALUES*, in a different schema, and then joins the two of them
together; and then does UNION ALL with a subquery that actually uses
VALUES (), and then somewhere includes a subquery that also queries
one of the tables actually called *VALUES*, you cannot meaningfully
use the label *VALUES* to identify one particular scan; and you can't
use anything that appears in the EXPLAIN output for that purpose
either because the next planning cycle won't have those labels
available *during planning* when it needs to honor whatever plan-shape
requests were made.

Now, I'm firmly convinced that this is a real problem and worth
solving, but let me be clear that I don't think the solution is
anywhere on this thread, nor do I think that it is simple. My original
proposal of getting rid of system-generated fake names isn't
necessary, because you very helpfully pointed out that I can look at
whether RTE->alias->aliasname exists to figure that out. Also,
enforcing global uniqueness across the query at parse time wouldn't
actually help, because when we apply rewriter rules we can introduce
new relations into the query, and then when we expand inheritance
hierarchies we can introduce even more new relations into the query;
and what the user will care about if they want to modify "that portion
of the plan right there" is what ultimately ends up in the plan, not
what was in the query at parse time. So as far as I am concerned, this
thread has served the useful purpose of making me smarter and I can't
really see a way for it to do anything more than that; but if it's
given you a clever idea for something that we could change in the
code, I'm certainly happy to hear about that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: magical eref alias names

2025-01-02 Thread Tom Lane
Robert Haas  writes:
> On Sun, Dec 22, 2024 at 12:45 PM Tom Lane  wrote:
>> In particular, in a situation where we're trying to show a plan for
>> a query with inlined views, EXPLAIN would probably have to have code
>> to unique-ify the names anyway --- there's no way we're going to make
>> these nonce names globally unique, so the view(s) might contain names
>> that conflict with each other or the outer query.

> When you say "there's no way we're going to make these nonce names
> globally unique," is that because you think it would be too costly
> from a performance perspective (which was my concern) or that you
> think it's flat-out impossible for some reason (which doesn't seem to
> me to be true)?

Global uniqueness across the database (not single queries) would be
needed to prevent cases where different views use the same generated
names.  The only way I can see to do that without nasty performance
costs is to use something like an OID counter.  Which would mean
that rather than nice names like "union_1", "union_2", etc, you'd
soon be looking at "union_5846926".  I don't think anyone would
find that to be an improvement on what we're doing now.

> If we had global uniqueness, or even per-subquery-level uniqueness,
> this would sort itself out somewhat nicely, I think. We would just end
> up with select_1 and select_2 or union_1 and union_2 or something like
> that, and I think that would be strictly better than the status quo
> where we do sometimes generate *SELECT* %d, but also sometimes just
> *SELECT* and other times unnamed_subquery, and also only ever *VALUES*
> and not *VALUES* %d.

I'll concede that it'd be nicer.  But I'm not convinced it'd be enough
nicer to justify the costs of changing.  We've been doing it this way
for a long time, and AFAIR you're the first to complain about it.

regards, tom lane




Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Larry Rosenman
@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
my issue.



 Original Message 
Subject: Re: A new look at old NFS readdir() problems?
Date: 01/02/2025 10:08 am
From: Rick Macklem 
To: Thomas Munro 
Cc: Rick Macklem , Larry Rosenman 

On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro  wrote:


CAUTION: This email originated from outside of the University of 
Guelph. Do not click links or open attachments unless you recognize the 
sender and know the content is safe. If in doubt, forward suspicious 
emails to ith...@uoguelph.ca.



Hi Rick
CC: ler

I hope you don't mind me reaching out directly, I just didn't really
want to spam existing bug reports without sufficient understanding to
actually help yet...  but I figured I should get in touch and see if
you have any clues or words of warning, since you've worked on so much
of the NFS code.  I'm a minor FBSD contributor and interested in file
systems, but not knowledgeable about NFS; I run into/debug/report a
lot of file system bugs on a lot of systems in my day job on
databases.  I'm interested to see if I can help with this problem.
Existing ancient report and interesting email:

https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696

What we ran into is not the "bad cookie" state, which doesn't really
seem to be recoverable in general, from what I understand (though the
FreeBSD code apparently would try, huh).  It's a simple case where the
NFS client requests a whole directory with a large READDIR request,
and then tries to unlink all the files in a traditional
while-readdir()-unlink() loop that works on other systems.
In general, NFS is not a POSIX compliant file system, due to its 
protocol

design. The above is one example. The only "safe" way is to opendir() or
rewinddir() after every removal.

The above usually works (and always worked for UFS long ago) because
the directory offset cookies for subsequent entries in the directory 
after

the entry unlinked happened to "still be valid". That is no longer true
for FreeBSD's UFS nor for many other file systems that can be exported.

If the client reads the entire directory in one READDIR, then it is 
fine,

since it has no need to the directory offset cookies. However, there is
a limit to how much a single READDIR can do (these days for NFSv4.1/4.2,
it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K 
at

the moment).

Another way to work around the problem is to read the entire directory
into the client via READDIRs before starting to do the unlinks.
The opendir()/readdir() code in libc could be hacked to do that,
but I have never tried to push such a patch into FreeBSD.
(It would be limited by how much memory can be malloc()'d, that
is pretty generous compared to even large directorys with 10s of
thousand entries.)

The above is true for all versions of NFS up to NFSv4.2, which is
the current one and unless some future version of NFS does READDIR
differently (I won't live long enough to see this;-), it will always
be the case.

If my comment above was not clear, the following encoding is the "safe"
way to remove all entries in a directory.

do {
  dir = opendir("X");
   dp = readdir(dir);
   if (dp != NULL)
unlink(dp->d_name);
close(dir);
} while (dp != NULL);

In theory, the directory_offset_cookie was supposed to handle this, but 
it

has never worked correctly, for a couple of reasons.
1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier 
correctly.

 It should only change when cookies for extant entries change. The
description
 suggested it should change whenever an entry is deleted, since that 
cookie

 is no longer valid.
2 - #1 only works if directory offset cookies for other entries in the 
directory
do not change when an entry is deleted. This used to be the case for 
UFS,

but was broken in FreeBSD when a commit many years ago optimized
ufs_readdir() to compress out invalid entries. Doing this changes 
the
directory offset cookies every time an entry is deleted at the 
beginning

of a directory block.

rick

 On FreeBSD
it seems to clobber its own directory cache, make extra unnecessary
READDIR requests, and skip some of the files.  Or maybe I have no idea
what's going on and this is a hopelessly naive question and mission
:-)

Here's what we learned so far starting from Larry's report:

https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org

Note that this issue has nothing to do with "bad cookie" errors (I
doubt the server I'm talking to even implements that -- instead it
tries to have cookies that are persistent/stable).

Also, while looking into this and initially suspecting cookie
stability bugs (incorrectly), I checked a bunch of local file systems
to understand how their cookies work, and I think I found a related
problem when FreeBSD exports UFS, too. 

Re: Alias of VALUES RTE in explain plan

2025-01-02 Thread Robert Haas
On Sat, Nov 2, 2024 at 1:09 PM Tom Lane  wrote:
> regression=# create view vv as SELECT * FROM (VALUES (4),(2),(3),(1) ORDER BY 
> t1.x LIMIT 2) AS t1(x);
> CREATE VIEW
> regression=# \d+ vv
>  View "public.vv"
>  Column |  Type   | Collation | Nullable | Default | Storage | Description
> +-+---+--+-+-+-
>  x  | integer |   |  | | plain   |
> View definition:
>  SELECT x
>FROM ( VALUES (4), (2), (3), (1)
>   ORDER BY t1_1.x
>  LIMIT 2) t1(x);
>
> ruleutils has decided that it needs to make the two "t1" table
> aliases distinct.  But of course that will fail on reload:
>
> regression=# SELECT x
> regression-#FROM ( VALUES (4), (2), (3), (1)
> regression(#   ORDER BY t1_1.x
> regression(#  LIMIT 2) t1(x);
> ERROR:  missing FROM-clause entry for table "t1_1"
> LINE 3:   ORDER BY t1_1.x
>^
> It seems sufficient to avoid alias pushdown when there's an ORDER BY
> inside the VALUES subquery.  We disallow a locking clause, and
> while there can be LIMIT/OFFSET, those aren't allowed to reference the
> VALUES output anyway.  I added some test cases to show that this is
> enough to make view-dumping behave sanely.

I'm concerned about taking things in this direction. There's two scans
here, really: a Values Scan for the VALUES construct, and then a
Subquery Scan sitting on top of it that will normally be optimized
away. It seems to me that you're basically guessing whether the
subquery scan will be optimized away to a sufficient degree that its
alias will not leak out anywhere. But that seems a bit fragile and
error-prone. Whether to elide the subquery scan is a planner decision;
what aliases to assign to the planner output is a ruleutils.c
decision; but here you're talking about rejiggering things at parse
time. The correctness of that rejiggering depends crucially on what
will happen at plan time and then at EXPLAIN/ruleutils time, but the
rules for what will happen at those later times are pretty darn
complicated, so I feel like this is creating an awful lot of action at
a distance.

If were able (and I suspect we're not, but hypothetically) to in
effect pull up the subquery at parse time, so that to the planner and
executor it doesn't even exist, then I think that would be perfectly
fine, because then we would have strong reasons for believing that no
later decision can turn our parse-time decision into a problem. But to
leave that subquery there and guess that it's going to disappear
before we get to EXPLAIN doesn't seem nearly as safe. It seems pretty
easy to either miss things (like the ORDER BY case) or even to have
future planner changes break stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Vacuum statistics

2025-01-02 Thread Jim Nasby

> On Jan 2, 2025, at 2:12 PM, Sami Imseih  wrote:
> 
> Alternatively, we can remove the vacuum related stats from pg_stat_all_tables,
> but that will break monitoring tools and will leave us with the (auto)analyze
> metrics alone in pg_stat_all_tables. This sounds very ugly.

While backwards compatibility is important, there’s definitely precedent for 
changing what shows up in the catalog. IMHO it’s better to bite the bullet and 
move those fields instead of having vacuum stats spread across two different 
views.

Re: Fwd: Re: A new look at old NFS readdir() problems?

2025-01-02 Thread Larry Rosenman

On 01/02/2025 2:50 pm, Bruce Momjian wrote:

On Thu, Jan  2, 2025 at 03:48:53PM -0500, Tom Lane wrote:

Larry Rosenman  writes:
> @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on
> my issue.

Thanks for reaching out to him.  So if I'm reading this correctly,
there's little point in filing a FreeBSD bug because it'll be
dismissed as unfixable.

This leaves us in rather a nasty position.  Sure, we could rewrite
rmtree() as Thomas suggested upthread, but I'm still of the opinion
that that's smearing lipstick on a pig.  rmtree() is the least of
our worries: it doesn't need to expect that anybody else will be
modifying the target directory, plus it can easily restart its scan
without complicated bookkeeping.  I doubt we can make such an
assumption for all our uses of readdir(), or that it's okay to
miss or double-process files in every one of them.

I'm still of the opinion that the best thing to do is disclaim
safety of storing a database on NFS.


It would be nice to have some details on the docs about why NFS can
cause problems so we have something to point to when people ask.


What about doing what Rick suggests?
do {
  dir = opendir("X");
   dp = readdir(dir);
   if (dp != NULL)
unlink(dp->d_name);
close(dir);
} while (dp != NULL);
?
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2025-01-02 Thread Tomas Vondra
On 1/2/25 22:09, Jim Nasby wrote:
> 
>> On Dec 31, 2024, at 5:41 PM, Tomas Vondra  wrote:
>>
>> On 12/31/24 21:46, Jim Nasby wrote:
>>> On Dec 30, 2024, at 7:05 PM, James Hunter 
>>> wrote:

 On Sat, Dec 28, 2024 at 11:24 PM Jim Nasby  wrote:
>
> IMHO none of this will be very sane until we actually have cluster-
> level limits. One sudden burst in active connections and you still
> OOM the instance.

 Fwiw, PG does support "max_connections" GUC, so a backend/connection -
 level limit, times "max_connections", yields a cluster-level limit.
>>>
>>> max_connections is useless here, for two reasons:
>>>
>>> 1. Changing it requires a restart. That’s at *best* a real PITA in
>>> production. [1]
>>> 2. It still doesn’t solve the actual problem. Unless your workload
>>> *and* your data are extremely homogeneous you can’t simply limit the
>>> number of connections and call it a day. A slight change in incoming
>>> queries, OR in the data that the queries are looking at and you go
>>> from running fine to meltdown. You don’t even need a plan flip for
>>> this to happen, just the same plan run at the same rate but now
>>> accessing more data than before.
>>>
>>
>> I really don't follow your argument ...
>>
>> Yes, changing max_connections requires a restart - so what? AFAIK the
>> point James was making is that if you multiply max_connections by the
>> per-backend limit, you get a cluster-wide limit. And presumably the
>> per-backend limit would be a GUC not requiring a restart.
>>
>> Yes, high values of max_connections are problematic. I don't see how a
>> global limit would fundamentally change that. In fact, it could
>> introduce yet more weird failures because some unrelated backend did
>> something weird.
> 
> That’s basically my argument for having workload management. If a system
> becomes loaded enough for the global limit to start kicking in it’s
> likely that query response time is increasing, which means you will soon
> have more and more active backends trying to run queries. That’s just
> going to make the situation even worse. You’d either have to start
> trying to “take memory away” from already running backends or backends
> that are just starting would have such a low limit as to cause them to
> spill very quickly, creating further load on the system.
> 

I'm not opposed to having a some sort of "workload management" (similar
to what's available in some databases), but my guess is that's (at
least) an order of magnitude more complex than introducing the memory
limit discussed here. I can only guess, because no one really explained
what would it include, how would it be implemented. Which makes it easy
to dream about a solution that would fix the problem ...

What I'm afraid will happen everyone mostly agrees a comprehensive
workload management system would be better than a memory limit (be it
per-backend or a global one). But without a workable proposal how to
implement it no one ends up working on it. And no one gets to work on a
memory limit because the imagined workload management would be better.
So we get nothing ...

FWIW I have a hard time imagining a workload management system without
some sort of a memory limit.

>> FWIW I'm not opposed to having some global memory limit, but as I
>> explained earlier, I don't see a way to do that sensibly without having
>> a per-backend limit first. Because if you have a global limit, a single
>> backend consuming memory could cause all kinds of weird failures in
>> random other backends.
> 
> I agree, but I’m also not sure how much a per-backend limit would
> actually help on its own, especially in OLTP environments.
> 

So what if it doesn't help in every possible case? It'd be valuable even
for OLAP/mixed workloads with sensible max_connection values, because
it'd allow setting the work_mem more aggressively.

>>> Most of what I’ve seen on this thread is discussing ways to
>>> *optimize* how much memory the set of running backends can consume.
>>> Adjusting how you slice the memory pie across backends, or even
>>> within a single backend, is optimization. While that’s a great goal
>>> that I do support, it will never fully fix the problem. At some point
>>> you need to either throw your hands in the air and start tossing
>>> memory errors, because you don’t have control over how much work is
>>> being thrown at the engine. The only way that the engine can exert
>>> control over that would be to hold new transactions from starting
>>> when the system is under duress (ie, workload management). While
>>> workload managers can be quite sophisticated (aka, complex), the nice
>>> thing about limiting this scope to work_mem, and only as a means to
>>> prevent complete overload, is that the problem becomes a lot simpler
>>> since you’re only looking at one metric and not trying to support any
>>> kind of priority system. The only fanciness I think an MVP would need
>>> is a GUC to control how long a transaction can sit waitin

Re: Logical Replication of sequences

2025-01-02 Thread Peter Smith
Hi Vignesh,

Some minor review comments for the patch v20241230-0003.

==
src/backend/replication/logical/syncutils.c

1.
+ * syncutils.c
+ *   PostgreSQL logical replication: common synchronization code
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group

Happy New Year.

s/2024/2025/

~~~

2.
+/*
+ * Enum representing the overall state of subscription relations state.
+ *
+ * SYNC_RELATIONS_STATE_NEEDS_REBUILD indicates that the subscription relations
+ * state is no longer valid and the subscription relations should be rebuilt.
+ *
+ * SYNC_RELATIONS_STATE_REBUILD_STARTED indicates that the subscription
+ * relations state is being rebuilt.
+ *
+ * SYNC_RELATIONS_STATE_VALID indicates that subscription relation state is
+ * up-to-date and valid.
+ */

2a.
That first sentence saying "overall state of [...] state" is a bit strange.

Maybe it can be reworded something like:
Enum for phases of the subscription relations state.

~

2b.
/is no longer valid and/is no longer valid, and/

`

2c.
/that subscription relation state is up-to-date/that the subscription
relation state is up-to-date/

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-02 Thread Shubham Khanna
On Thu, Jan 2, 2025 at 4:20 AM Peter Smith  wrote:
>
> Hi Shubham.
>
> Here are some review comments for the patch v4-0001.
>
> ==
> Commit message.
>
> 1.
> The 'pg_createsubscriber' utility is updated to fetch and validate the
> 'max_slot_wal_keep_size' setting from the publisher. A warning is raised 
> during
> the '--dry-run' mode if the configuration is set to a non-default value.
>
>
> This says a "warning is raised", but patch v4 changed the warning to
> an error, so this description is incompatible with the code.
>

Since, this patch now raises a warning instead of an error so I think
this should be the same as before.

> ==
> doc/src/sgml/ref/pg_createsubscriber.sgml
>
> 2.
> +   
> +The 'max_slot_wal_keep_size' must be set to -1 to prevent the automatic
> +removal of WAL files needed by replication slots. Setting this parameter 
> to
> +a specific size may lead to replication failures if required WAL files 
> are
> +prematurely deleted.
> +   
>
> The 'max_slot_wal_keep_size' should not be single-quoted like that. It
> should use the same markup as the other nearby GUC names and also have
> a link like those others do.
>

Added the link for 'max_slot_wal_keep_size' and updated the
'pg_createsubscriber' documentation.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> 3.
> +/*
> + * Validate max_slot_wal_keep_size
> + * Logical replication requires max_slot_wal_keep_size to be set to -1 on the
> + * publisher to prevent the deletion of WAL files that are still needed by
> + * replication slots. If this parameter is set to a non-default value, it may
> + * cause replication failures due to required WAL files being prematurely
> + * removed.
> + */
>
> 3a.
> Not fixed? For patch v3 I had already posted [1-#4a] that this entire
> comment is wrongly indented.
>

Fixed.

>
> 3b.
> That first sentence looks like it is missing a period and a following
> blank line. OTOH, maybe the first sentence is not even necessary.
>

Fixed.

>
> 4.
> + if (dry_run && max_slot_wal_keep_size != -1)
> + {
> + pg_log_error("publisher requires max_slot_wal_keep_size to be -1,
> but only %d remain",
> + max_slot_wal_keep_size);
> + pg_log_error_hint("Change the configuration parameter \"%s\" on the
> publisher to %d.",
> +   "max_slot_wal_keep_size", -1);
> + failed = true;
> + }
> +
>
> 4a.
> Not fixed? For patch v3 I had already posted [1-#4b] that it seemed
> strange you did not use the \"%s\" substitution for the GUC name of
> the first message (unlike the 2nd message).
>
> I also think it is strange that the 1st message uses a hardwired -1,
> but the 2nd message uses a parameter for the -1.
>

Updated the warning message and warning detail.

>
> 4b.
> I don't think "but only %d remain" is the appropriate wording for this
> GUC. It looks like a cut/paste mistake from a previous message.
> Anyway, maybe this message should be something quite different so it
> is not just duplicating the same information as the error_hint. e.g.
> see below for one idea. This could also resolve the previous review
> comment.
>
> SUGGESTION
> "publisher requires WAL size must not be restricted"
>

I have used your suggestion in the latest patch.

>
> 4c.
> I saw that this only emits the message in dry-mode, which is
> apparently based on the suggestion from Vignesh [2-#1]. Meanwhile, all
> the comments/docs say "it may cause replication failures", so I felt
> it might be better to always stop everything up-front if there is a
> wrong configuration, instead of waiting for potential failures to
> happen -- that's why I had suggested using error instead of a warning,
> but maybe I am in the minority.
>
> IIUC there are two ways to address this configuration problem:
>
> A. Give warnings, but only in dry-run mode. If the warnings are
> ignored (or if the dry-run was not even done), there may be
> replication failures later, but we HOPE it will not happen.
>
> B. Give errors in all modes. The early config error prevents any
> chance of later replication errors.
>
> So, I think the implementation needs to choose either A or B.
> Currently, it seems a mixture.
>
> ==
> [1] my v3 review -
> https://www.postgresql.org/message-id/CAHut%2BPsgEphCa-P-3Q7cORA%3D1TRv15UUsaxN9ZkX56M1-J_QRg%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/CALDaNm09cRzke52UN5zx33PT390whU92oXY4gfOSZEo17CLPjw%40mail.gmail.com

If 'max_slot_wal_keep_size' is not set to -1, there is no surety that
'pg_createsubscriber' will function as expected and the removal of WAL
files will start to occur. Therefore, a warning is raised instead of
an error to alert users about the potential configuration issue.
If 'max_slot_wal_keep_size' is -1 (the default), replication slots may
retain an unlimited amount of WAL files.

The attached Patch contains the suggested changes.

Thanks and regards,
Shubham Khanna.


v5-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch
Description: Binary data


Re: Conflict detection for update_deleted in logical replication

2025-01-02 Thread vignesh C
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 
>  wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >   Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still 
> > invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to 
> setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few suggestions:
1) If we have a subscription with detect_update_deleted option and we
try to upgrade it with default settings(in case dba forgot to set
track_commit_timestamp), the upgrade will fail after doing a lot of
steps like that mentioned in ok below:
Setting locale and encoding for new cluster   ok
Analyzing all rows in the new cluster ok
Freezing all rows in the new cluster  ok
Deleting files from new pg_xact   ok
Copying old pg_xact to new server ok
Setting oldest XID for new clusterok
Setting next transaction ID and epoch for new cluster ok
Deleting files from new pg_multixact/offsets  ok
Copying old pg_multixact/offsets to new serverok
Deleting files from new pg_multixact/members  ok
Copying old pg_multixact/members to new serverok
Setting next multixact ID and offset for new cluster  ok
Resetting WAL archivesok
Setting frozenxid and minmxid counters in new cluster ok
Restoring global objects in the new cluster   ok
Restoring database schemas in the new cluster
  postgres
*failure*

We should detect this at an earlier point somewhere like in
check_new_cluster_subscription_configuration and throw an error from
there.

2) Also should we include an additional slot for the
pg_conflict_detection slot while checking max_replication_slots.
Though this error will occur after the upgrade is completed, it may be
better to include the slot during upgrade itself so that the DBA need
not handle this error separately after the upgrade is completed.

3) We have reserved the pg_conflict_detection name in this version, so
if there was a replication slot with the name pg_conflict_detection in
the older version, the upgrade will fail at a very later stage like an
earlier upgrade shown. I feel we should check if the old cluster has
any slot with the name pg_conflict_detection and throw an error
earlier itself:
+void
+ErrorOnReservedSlotName(const char *name)
+{
+   if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+   ereport(ERROR,
+   errcode(ERRCODE_RESERVED_NAME),
+   errmsg("replication slot name \"%s\"
is reserved",
+  name));
+}

4) We should also mention something like below in the documentation so
the user can be aware of it:
The slot name cannot be created with pg_conflict_detection, as this is
reserved for logical replication conflict detection.

Regards,
Vignesh




Re: Modern SHA2- based password hashes for pgcrypto

2025-01-02 Thread Daniel Gustafsson
> On 31 Dec 2024, at 17:06, Bernd Helmle  wrote:

> I adapted the code from the publicly available reference implementation
> at [1]. It's based on our existing OpenSSL infrastructure in pgcrypto
> and produces compatible password hashes with crypt() and "openssl
> passwd" with "-5" and "-6" switches.

Potentially daft question, but since we require OpenSSL to build pgcrypto, why
do we need to include sha2 code instead of using the sha2 implementation in
libcrypto? How complicated would it be to use the OpenSSL API instead?

--
Daniel Gustafsson





Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-01-02 Thread Nisha Moond
On Thu, Jan 2, 2025 at 5:44 AM Peter Smith  wrote:
>
> Hi Nisha.
>
> My review comments for patch v58-0001.
>
> ==
> src/backend/replication/slot.c
>
> InvalidatePossiblyObsoleteSlot:
>
> 1.
>   /*
> - * If the slot can be acquired, do so and mark it invalidated
> - * immediately.  Otherwise we'll signal the owning process, below, and
> - * retry.
> + * If the slot can be acquired, do so and mark it as invalidated. If
> + * the slot is already ours, mark it as invalidated. Otherwise, we'll
> + * signal the owning process below and retry.
>   */
> - if (active_pid == 0)
> + if (active_pid == 0 ||
> + (MyReplicationSlot == s && active_pid == MyProcPid))
>   {
>
> As you previously explained [1] "This change applies to all types of
> invalidation, not just inactive_timeout case [...] It's a general
> optimization for the case when the current process is the active PID
> for the slot."
>
> In that case, should this be in a separate patch that can be pushed to
> master by itself, i.e. independent of anything else in this thread
> that is being done for the purpose of implementing the timeout
> feature?

The patch-001 has additional general optimizations similar to the one
you mentioned, which are not strictly required for this feature.
Let’s wait for input from others on splitting the patches or
addressing it in a separate thread.

--
Thanks,
Nisha




Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-01-02 Thread Nisha Moond
On Thu, Jan 2, 2025 at 8:16 AM Peter Smith  wrote:
>
> Hi Nisha,
>
> Here are some minor review comments for patch v58-0002.
>

Thank you for your feedback! Please find the v59 patch set addressing
all the comments.
Note: There are no new changes in patch-0001.

--
Thanks,
Nisha
From 8154e2baee6fcf348524899c1f8b643a1e3564fc Mon Sep 17 00:00:00 2001
From: Nisha Moond 
Date: Mon, 18 Nov 2024 16:13:26 +0530
Subject: [PATCH v59 1/2] Enhance replication slot error handling, slot
 invalidation, and inactive_since setting logic

In ReplicationSlotAcquire(), raise an error for invalid slots if the
caller specifies error_if_invalid=true.

Add check if the slot is already acquired, then mark it invalidated directly.

Ensure same inactive_since time for all slots in update_synced_slots_inactive_since()
and RestoreSlotFromDisk().
---
 .../replication/logical/logicalfuncs.c|  2 +-
 src/backend/replication/logical/slotsync.c| 13 ++--
 src/backend/replication/slot.c| 69 ---
 src/backend/replication/slotfuncs.c   |  2 +-
 src/backend/replication/walsender.c   |  4 +-
 src/backend/utils/adt/pg_upgrade_support.c|  2 +-
 src/include/replication/slot.h|  3 +-
 src/test/recovery/t/019_replslot_limit.pl |  2 +-
 8 files changed, 75 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 0148ec3678..ca53caac2f 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -197,7 +197,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	else
 		end_of_wal = GetXLogReplayRecPtr(NULL);
 
-	ReplicationSlotAcquire(NameStr(*name), true);
+	ReplicationSlotAcquire(NameStr(*name), true, true);
 
 	PG_TRY();
 	{
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index f6945af1d4..692527b984 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -446,7 +446,7 @@ drop_local_obsolete_slots(List *remote_slot_list)
 
 			if (synced_slot)
 			{
-ReplicationSlotAcquire(NameStr(local_slot->data.name), true);
+ReplicationSlotAcquire(NameStr(local_slot->data.name), true, false);
 ReplicationSlotDropAcquired();
 			}
 
@@ -665,7 +665,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		 * pre-check to ensure that at least one of the slot properties is
 		 * changed before acquiring the slot.
 		 */
-		ReplicationSlotAcquire(remote_slot->name, true);
+		ReplicationSlotAcquire(remote_slot->name, true, false);
 
 		Assert(slot == MyReplicationSlot);
 
@@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
 static void
 update_synced_slots_inactive_since(void)
 {
-	TimestampTz now = 0;
+	TimestampTz now;
 
 	/*
 	 * We need to update inactive_since only when we are promoting standby to
@@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void)
 	/* The slot sync worker or SQL function mustn't be running by now */
 	Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
 
+	/* Use same inactive_since time for all slots */
+	now = GetCurrentTimestamp();
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (int i = 0; i < max_replication_slots; i++)
@@ -1537,10 +1540,6 @@ update_synced_slots_inactive_since(void)
 			/* The slot must not be acquired by any process */
 			Assert(s->active_pid == 0);
 
-			/* Use the same inactive_since time for all the slots. */
-			if (now == 0)
-now = GetCurrentTimestamp();
-
 			SpinLockAcquire(&s->mutex);
 			s->inactive_since = now;
 			SpinLockRelease(&s->mutex);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b30e0473e1..47e474a548 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -163,6 +163,7 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void RaiseSlotInvalidationError(ReplicationSlot *slot);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -535,9 +536,12 @@ ReplicationSlotName(int index, Name name)
  *
  * An error is raised if nowait is true and the slot is currently in use. If
  * nowait is false, we sleep until the slot is released by the owning process.
+ *
+ * An error is raised if error_if_invalid is true and the slot is found to
+ * be invalid.
  */
 void
-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)
 {
 	ReplicationSlot *s;
 	int			active_pid;
@@ -615,6 +619,13 @@ retry:
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = s;
 
+	/*
+	 * An err

Re: Conflict detection for update_deleted in logical replication

2025-01-02 Thread vignesh C
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 
>  wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >   Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still 
> > invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to 
> setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be 
> better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new 
> request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent 
> transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit 
> timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the 
> xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.

Conflict detection of truncated updates is detected as update_missing
and deleted update is detected as update_deleted. I was not sure if
truncated updates should also be detected as update_deleted, as the
document says truncate operation is "It has the same effect as an
unqualified DELETE on each table" at [1].

I tried with the following three node(N1,N2 & N3) setup with
subscriber on N3 subscribing to the publisher pub1 in N1 and publisher
pub2 in N2:
N1 - pub1
N2 - pub2
N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2)

-- Insert a record in N1
insert into t1 values(1);

-- Insert a record in N2
insert into t1 values(1);

-- Now N3 has the above inserts from N1 and N2
N3=# select * from t1;
 c1

  1
  1
(2 rows)

-- Truncate t1 from N2
N2=# truncate t1;
TRUNCATE TABLE

-- Now N3 has no records:
N3=# select * from t1;
 c1

(0 rows)

-- Update from N1 to generated a conflict
postgres=# update t1 set c1 = 2;
UPDATE 1
N1=# select * from t1;
 c1

  2
(1 row)

--- N3 logs the conflict as update_missing
2025-01-02 12:21:37.388 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-02 12:21:37.388 IST [24803] DETAIL:  Could not find the row to
be updated.
Remote tuple (2); replica identity full (1).
2025-01-02 12:21:37.388 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 757, finished
at 0/17478D0

-- Insert a record with value 2 in N2
N2=# insert into t1 values(2);
INSERT 0 1

-- Now N3 has the above inserted records:
N3=# select * from t1;
 c1

  2
(1 row)

-- Delete this record from N2:
N2=# delete from t1;
DELETE 1

-- Now N3 has no records:
N3=# select * from t1;
 c1

(0 rows)

-- Update from N1 to generate a conflict
postgres=# update t1 set c1 = 3;
UPDATE 1

--

Re: FileFallocate misbehaving on XFS

2025-01-02 Thread Andres Freund
Hi,

On 2025-01-02 11:41:56 +0100, Andrea Gelmini wrote:
> Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund 
> ha scritto:
> 
> 2024-12-19 04:47:04 CET [2646363]:  ERROR:  could not extend file
> > "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from
> > 29850 to 29861, using FileFallocate(): No space left on device
> 
> 
> Dunno it it helps, but today I read this reference in latest patchset on
> XFS mailing list:
> https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangsh...@kylinos.cn/
> 
> 
> It could be related and explain the effect?

I doubt it - there was a lot more free space in the various AGs and they, with
the exception of 1-2 AGs, weren't that fragmented.

Greetings,

Andres Freund




Re: FileFallocate misbehaving on XFS

2025-01-02 Thread Andres Freund
Hi,

On 2024-12-20 11:39:42 -0500, Andres Freund wrote:
> On 2024-12-19 17:47:13 +1100, Michael Harris wrote:
> > This is a different system to those I previously provided logs from.
> > It is also running RHEL8 with a similar configuration to the other
> > system.
>
> Given it's a RHEL system, have you raised this as an issue with RH? They
> probably have somebody with actual XFS hacking experience on staff.
>
> RH's kernels are *heavily* patched, so it's possible the issue is actually RH
> specific.

FWIW, I raised this on the #xfs irc channel. One ask they had was:

│15:56:40  dchinner | andres: can you get a metadump of a filesystem that is 
displaying these symptoms for us to analyse?
│15:57:54  dchinner | metadumps don't contain data, and metadata is obfuscated 
so no filenames or attributes are exposed, either.

Greetings,

Andres




POC: track vacuum/analyze cumulative time per relation

2025-01-02 Thread Sami Imseih
Hi,

After a recent question regarding tracking vacuum start_time in
pg_stat_all_tables [1], it dawned on me that this view is missing
an important cumulative metric, which is how much time is spent
performing vacuum per table.

Currently, the only way a user can get this
information is if they enable autovacuum logging or track timing
for manual vacuums. Even then, if a user wants to trend
the time spent vacuuming over time, they must store the
timing data somewhere and perform the calculations.

Also, unless autovacuum logging is enabled for all a/v
operations, they could have gaps in their analysis.

Having the total (auto)vacuum elapsed time
along side the existing (auto)vaccum_count
allows a user to track the average time an
operating overtime and to find vacuum tuning
opportunities.

The same can also be said for (auto)analyze.

attached a patch ( without doc changes)
that adds 4 new columns:

total_autovacuum_time
total_vacuum_time
total_autoanalyze_time
total_analyze_time

Below is an example of output and how it
can be used to derive the average vacuum
operation time.

postgres=# select
relname,
autovacuum_count,
total_autovacuum_time,
total_autovacuum_time/NULLIF(autovacuum_count,0) average_autovac_time,
vacuum_count,
total_vacuum_time,
total_vacuum_time/NULLIF(vacuum_count,0) average_vac_time
from pg_catalog.pg_stat_all_tables
where relname = 'pgbench_history';
-[ RECORD 1 ]-+-
relname   | pgbench_history
autovacuum_count  | 3
total_autovacuum_time | 1689
average_autovac_time  | 563
vacuum_count  | 1
total_vacuum_time | 1
average_vac_time  | 1

It should be noted that the timing is only tracked
when the vacuum or analyze operation completes,
as is the case with the other metrics.

Also, there is another discussion in-flight [2] regarding
tracking vacuum run history in a view, but this serves a
different purpose as this will provide all the metrics
that are other wise exposed in vacuum logging
via sql. This history will also be required to drop
entries using some criteria to keep the cache from
growing infinitely.

Feedback for the attached patch is appreciated!

Regards,

Sami Imseih
Amazon Web Services (AWS)

[1] 
https://www.postgresql.org/message-id/flat/CAGjGUAKQ4UBNdkjunH2qLsdUVG-3F9gCuG0Kb0hToo%2BuMmSteQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/b68ab452-c41f-4d04-893f-eaab84f1855b%40vondra.me


POC-track-vacuum-analyze-cumulative-time-per-table.patch
Description: Binary data


Re: IWYU annotations

2025-01-02 Thread Peter Eisentraut

On 09.12.24 17:37, Tom Lane wrote:

Peter Eisentraut  writes:

I have done a pass over much of the source code with
include-what-you-use (IWYU) to remove superfluous includes (commits
dbbca2cf299, 9be4e5d293b, ecb5af77987).  Along the way I have collected
some pragma annotations to deal with exceptions and special cases and
peculiarities of the PostgreSQL source code header structures (see [0]
for description).  Here I'm proposing a set of patches to add such
annotations in commonly useful cases that should deal with most of the
noise.


This seems to be going in the direction that there will be Yet Another
tool that committers have to know everything about in order to not
commit bad code.  I'm feeling resistant to that, mainly because I'm
far from convinced that IWYU brings us enough value to justify
everybody having to learn about it.


It's not realistic at the moment for it to be a tool that everyone needs 
to use and everyone needs to keep 100% clean.  We're certainly very far 
from that being feasible at the moment.


I see it useful in two areas:

First, when you add new files or move lots of code around, you can have 
it provide an informed opinion about what includes to keep and add. 
Because in practice that's mostly a crapshoot nowadays.  An example from 
a current patch under review:


$ iwyu_tool.py -p build src/backend/libpq/auth-oauth.c
...
../src/backend/libpq/auth-oauth.c should remove these lines:
- #include   // lines 19-19
- #include   // lines 18-18
- #include "storage/fd.h"  // lines 28-28

Second, clangd (the language server) has support for this also, and so 
depending on local configuration and preferences, it can highlight 
missing or redundant includes or even add some automatically as you 
edit.  (The latter obviously needs some manual supervision, but it is 
arguably kind of neat that you don't need to jump to the top and 
manually add includes as you type in new code that needs an additional 
header.)


But in order for either of this to work, it needs to have some 
information about basic PostgreSQL code conventions.  Otherwise, it will 
also do this:


../src/backend/libpq/auth-oauth.c should add these lines:
...
#include   // for false, bool, true
#include// for NULL, size_t
#include "c.h"// for Assert, explicit_bzero, 
pg_strncasecmp

...

because it doesn't know that the convention is that you are supposed to 
include "postgres.h" (or one of the other always-first headers) and then 
everything that it brings it should usually not be included again 
directly.  (Or conversely in some cases it will suggest to remove the 
include of "postgres.h" because it doesn't provide anything of use.)


So right now you get a bunch of noise and misleading information for 
each file and the whole clangd support is of limited use.  That's what 
my patch set is mainly trying to address.




In particular, this patchset introduces what seem like very
error-prone setups, such as in rmgrdesc.c where there's now one
group of #include's with "pragma: begin_keep/pragma: end_keep"
around it and another group without.  Most of us are likely
to just blindly stick a new #include into alphabetical order
somewhere in there and not notice that there's now an additional
concern.  The fact that that you've added precisely zero
documentation about what these pragmas are doesn't help.


It's a fair point that some documentation could be provided.  I suppose 
we don't want to verbosely explain each pragma individually.  Should 
there be some central explanation, maybe in src/tools/pginclude/README?


Note that if you google like "IWYU pragma: export" it will take you to 
the upstream documentation as the first hit, so the full documentation 
is pretty easy to find.






RE: [RFC] Lock-free XLog Reservation from WAL

2025-01-02 Thread Zhou, Zhiguo
This message is a duplicate of 
ph7pr11mb5796659f654f9be983f3ad97ef...@ph7pr11mb5796.namprd11.prod.outlook.com. 
Please consider dropping this thread and review the original one instead.

Sorry for your inconvenience.

-Original Message-
From: Zhou, Zhiguo  
Sent: Thursday, January 2, 2025 3:20 PM
To: pgsql-hackers@lists.postgresql.org
Subject: [RFC] Lock-free XLog Reservation from WAL

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal 
regarding the "Lock-free XLog Reservation from WAL." We have identified some 
challenges with the current WAL insertions, which require space reservations in 
the WAL buffer which involve updating two shared-memory statuses in 
XLogCtlInsert: CurrBytePos (the start position of the current XLog) and 
PrevBytePos (the prev-link to the previous XLog). Currently, the use of 
XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, 
hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a 
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the 
next XLog always points to the head of the current Xlog,we will slightly exceed 
the reserved memory range of the current XLog to update the prev-link of the 
next XLog, regardless of which backend acquires the next memory space. The next 
XLog inserter will wait until its prev-link is updated for CRC calculation 
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link 
of its next XLog first, then return to the header position for the current log 
insertion. This change will reduce the dependency of XLog writes on previous 
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to 
separate the LSN it intends to access from the LSN it expects to update in the 
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing 
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the 
parallelism.

The attached patch could pass the regression tests (make check, make 
check-world), and in the performance test of this POC on SPR (480 vCPU) shows 
that this optimization could help the TPCC benchmark better scale with the core 
count and as a result the performance with full cores enabled could be improved 
by 2.04x.

Before we proceed with further patch validation and refinement work, we are 
eager to hear the community's thoughts and comments on this optimization so 
that we can confirm our current work aligns with expectations.




RE: RFC: Lock-free XLog Reservation from WAL

2025-01-02 Thread Zhou, Zhiguo
This message is a duplicate of 
ph7pr11mb5796659f654f9be983f3ad97ef...@ph7pr11mb5796.namprd11.prod.outlook.com. 
Please consider dropping this thread and review the original one instead.

Sorry for your inconvenience.

-Original Message-
From: Zhou, Zhiguo  
Sent: Thursday, January 2, 2025 5:15 PM
To: pgsql-hackers@lists.postgresql.org
Subject: RFC: Lock-free XLog Reservation from WAL

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal 
regarding the "Lock-free XLog Reservation from WAL." We have identified some 
challenges with the current WAL insertions, which require space reservations in 
the WAL buffer which involve updating two shared-memory statuses in 
XLogCtlInsert: CurrBytePos (the start position of the current XLog) and 
PrevBytePos (the prev-link to the previous XLog). Currently, the use of 
XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, 
hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a 
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the 
next XLog always points to the head of the current Xlog,we will slightly exceed 
the reserved memory range of the current XLog to update the prev-link of the 
next XLog, regardless of which backend acquires the next memory space. The next 
XLog inserter will wait until its prev-link is updated for CRC calculation 
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link 
of its next XLog first, then return to the header position for the current log 
insertion. This change will reduce the dependency of XLog writes on previous 
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to 
separate the LSN it intends to access from the LSN it expects to update in the 
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing 
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the 
parallelism.

The attached patch could pass the regression tests (make check, make 
check-world), and in the performance test of this POC on SPR (480 vCPU) shows 
that this optimization could help the TPCC benchmark better scale with the core 
count and as a result the performance with full cores enabled could be improved 
by 2.04x.

Before we proceed with further patch validation and refinement work, we are 
eager to hear the community's thoughts and comments on this optimization so 
that we can confirm our current work aligns with expectations.





Re: read stream on amcheck

2025-01-02 Thread Matheus Alcantara
Thanks for reviewing!

Em qui., 2 de jan. de 2025 às 13:16, Kirill Reshke
 escreveu:
>
> However, this:
>
> >- if (skip_option == SKIP_PAGES_ALL_FROZEN)
> >- {
> >- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> >- continue;
> >- }
> >-
> >- if (skip_option == SKIP_PAGES_ALL_VISIBLE)
> >- {
> >- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> >- continue;
> >- }
>
> changes to this
> (in heapam_read_stream_next_block)
> >+
> >+ if ((p->skip_option == SKIP_PAGES_ALL_FROZEN && (mapbts & 
> >VISIBILITYMAP_ALL_FROZEN) != 0) ||
> >+ (p->skip_option == SKIP_PAGES_ALL_VISIBLE && (mapbts & 
> >VISIBILITYMAP_ALL_VISIBLE) != 0))
> > + continue;
>
> I don't understand this change. The patch aims to be purely applying
> streaming API, not refactoring. And if we refactor this code, this is
> less readable than it was.

Yeap, I agree. Attached a v2 fixed.

-- 
Matheus Alcantara


v2-0001-Use-read-stream-on-amcheck.patch
Description: Binary data


Re: Switching XLog source from archive to streaming when primary available

2025-01-02 Thread Andrey M. Borodin



> On 23 Mar 2024, at 14:22, Bharath Rupireddy 
>  wrote:
> 
> IMHO, it makes sense to have something like replay_source_order if
> there's any use case that arises in future requiring the standby to
> intentionally switch to pg_wal or archive. But not as part of this
> feature.

IMO, it's vital part of a feature.

In my observation restore from archive is many orders of magnitude faster than 
streaming replication. Advanced archive tools employ compression (x6 to speed), 
download parallelism (x4), are not constrained be primary's network limits (x3) 
and disk limits, do not depend on complicated FEBE protocol, etc.
When I have to cope with lagging replica, almost always I kill walreceiver and 
tweak server readahead.

But there might be cases where you still have to attach replica ASAP. I can 
think of releasing replication slot, transiently failed archive network or 
storage.

Finally, one might want to have many primary connections: cascading replica 
might want to stream from any available host from the group of HA hosts.


Best regards, Andrey Borodin.



Re: IWYU annotations

2025-01-02 Thread Tom Lane
Peter Eisentraut  writes:
> On 09.12.24 17:37, Tom Lane wrote:
>> In particular, this patchset introduces what seem like very
>> error-prone setups, such as in rmgrdesc.c where there's now one
>> group of #include's with "pragma: begin_keep/pragma: end_keep"
>> around it and another group without.  Most of us are likely
>> to just blindly stick a new #include into alphabetical order
>> somewhere in there and not notice that there's now an additional
>> concern.  The fact that that you've added precisely zero
>> documentation about what these pragmas are doesn't help.

> It's a fair point that some documentation could be provided.  I suppose 
> we don't want to verbosely explain each pragma individually.  Should 
> there be some central explanation, maybe in src/tools/pginclude/README?

That might do, but perhaps instead in the "PostgreSQL Coding
Conventions" chapter of the SGML docs?  Actually, I think we could do
with a centralized explanation of our inclusion conventions --- I'm
not sure that the whole business of "postgres.h or a sibling must be
first" is explained in any easy-to-find place.  This topic would
likely fit well with such an explanation.

But really, the point I was trying to make above is that I don't
want this to break our very long-standing convention that headers
should be #include'd alphabetically and there is never a need to
impose some other order (at least not without lots of commentary
about it at the scene of the crime).  The way you've done it here
is just asking for trouble, IMO.  If that means redundant pragma
commands, so be it.

regards, tom lane




Re: read stream on amcheck

2025-01-02 Thread Kirill Reshke
On Thu, 2 Jan 2025 at 20:29, Matheus Alcantara  wrote:
>
> Hi,
>
> While reviewing some other patches implementing stream API for core 
> subsystems,
> I noticed that the amcheck extension could also benefit from that.
>
> Notice the refactor when handling the "skip" parameter; The logic was moved to
> the heapam_read_stream_next_block callback so that verify_heapam don't need to
> touch any private field of heapam_read_stream_next_block_private struct.
>
> One other think to mention is that the test cases of "skip" parameter
> that I've seen just test when the first page is corrupted, so I think
> that a carefully review on callback logic would be good to ensure that
> we don't accidentally skip a page when doing p->current_blocknum++;
>
> This patch doesn't show any performance improvements (or regression)
> but I think that it would be good to replace the ReadBufferExtended
> usage with the read stream API, so in the future it could be benefit
> from the AIO project.
>
> --
> Matheus Alcantara

Hi!
+1 on idea

However, this:

>- if (skip_option == SKIP_PAGES_ALL_FROZEN)
>- {
>- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
>- continue;
>- }
>-
>- if (skip_option == SKIP_PAGES_ALL_VISIBLE)
>- {
>- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
>- continue;
>- }

changes to this
(in heapam_read_stream_next_block)
>+
>+ if ((p->skip_option == SKIP_PAGES_ALL_FROZEN && (mapbts & 
>VISIBILITYMAP_ALL_FROZEN) != 0) ||
>+ (p->skip_option == SKIP_PAGES_ALL_VISIBLE && (mapbts & 
>VISIBILITYMAP_ALL_VISIBLE) != 0))
> + continue;

I don't understand this change. The patch aims to be purely applying
streaming API, not refactoring. And if we refactor this code, this is
less readable than it was.

Other than that, LGTM.

-- 
Best regards,
Kirill Reshke




Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN

2025-01-02 Thread Robert Treat
On Tue, Dec 3, 2024 at 3:13 AM Masahiro Ikeda  wrote:
>
> Hi,
>
> The documentation seems to overlook the rewrite condition
> when executing ALTER TABLE ADD COLUMN.
>
> The current document states that a volatile DEFAULT will
> trigger a rewrite of the table and its indexes. However, the
> table and its indexes will also be rewritten when an IDENTITY
> column is added, or when a column with a domain data type that
> has constraints is added.
>
> What do you think?
>

We still see a number of people asking (or confused) about table
rewrites when adding columns, so I think the initial tip should
remain, though I think it can be cleaned up a little.

In the second section (alter_table.sgml) I liked the idea of adding
these additional examples, though I tweaked the wording a bit to
(hopefully) make it a little easier to read.

Modified patch attached.

Robert Treat
https://xzilla.net


v2-0001-Doc-fix-the-rewrite-condition-when-executing-ALTE.patch
Description: Binary data


Further _bt_first simplifications for parallel index scans

2025-01-02 Thread Peter Geoghegan
Attached patch goes a bit further with simplifying _bt_first's
handling of seizing the parallel scan. This continues recent work from
commits 4e6e375b and b5ee4e52.

Aside from requiring less code, the new structure relieves _bt_first
from having separate calls to _bt_start_array_keys for the serial and
parallel cases. It also places emphasis on the idea that it's expected
that calls to _bt_first will go through _bt_readfirstpage (not
_bt_readnextpage). While it is of course possible for a parallel
scan's _bt_first to call _bt_readnextpage instead, that is the
exception, not the rule.

-- 
Peter Geoghegan


v1-0001-Simplify-_bt_first.patch
Description: Binary data


Allow NOT VALID foreign key constraints on partitioned tables.

2025-01-02 Thread Amul Sul
Hi,

While working on NOT ENFORCED constraints[1], which are by default marked as NOT
VALID, I encountered an error when adding a NOT ENFORCED foreign key (FK)
constraint to a partitioned table [2]. Alvaro also confirmed off-list that NOT
VALID FK constraints have not yet been implemented. This patch addresses that
gap.

When adding a new FK constraint or attaching a partitioned table, where
matching FK constraints are merged, we allow the parent constraint to be NOT
VALID while the child constraint remains VALID, which is harmless. However, the
reverse scenario -- where the parent constraint is VALID and the child is NOT
VALID -- is incorrect. To address this, when merging a NOT VALID FK constraint
from the child with a VALID parent constraint, it implicitly validates the
child constraint against its existing data and marks it as VALID. This behavior
aligns with adding a new FK constraint directly to the child table, which would
also validate the existing data.

The 0001 patch focuses on code refactoring and does not modify or introduce new
behaviors. It splits ATExecValidateConstraint() into two separate functions for
handling FK and CHECK constraints. For this feature, I wanted to reuse the FK
validation logic and make it recursive for partitioned tables, necessitating
its separation. Although CHECK constraint validation isn't required for this
work, separating it simplifies ATExecValidateConstraint() and prepares the
codebase for adding support for other constraint types (e.g., NOT NULL) in the
future. Additional changes in the refactoring include renaming the variable
tuple to contuple within these functions, duplicating a few lines of code that
update pg_constraint.convalidated, and running pgindent, which rearranged the
code and comments. I hope the duplication is not a significant concern.

Please review the attached patches. Any comments or suggestions would
be greatly appreciated. Thank you!

1] 
https://postgr.es/m/caaj_b962c5acyw9kut_r_er5qs3fugbe4az-sp-vuwps-w-...@mail.gmail.com
2] 
https://postgr.es/m/caaj_b94+0-yfj4lopvqz_+c7ckkuya77g_5rgtjvnuyepuh...@mail.gmail.com

--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com
From 7b03ff68ae24ded5547a9e268be8692b196cf509 Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Thu, 2 Jan 2025 17:15:30 +0530
Subject: [PATCH v1 1/2] Refactor: Split ATExecValidateConstraint()

Splitting ATExecValidateConstraint() into two separate functions to
handle FOREIGN KEY and CHECK constraints respectively. The next patch
requires the FOREIGN KEY validation code in a separate function so it
can be called directly. Additionally, the function needs to be
recursive to handle NOT VALID constraints on declarative partitions.

Although moving the CHECK constraint-related code is not strictly
necessary for this task, maintaining separate code for FOREIGN KEY
constraints makes it logical to do the same for CHECK constraints.
This separation will also simplify adding support for other
constraints (e.g., NOT NULL) in ATExecValidateConstraint() in the
future.
---
 src/backend/commands/tablecmds.c | 282 +++
 1 file changed, 171 insertions(+), 111 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 33ea619224b..afe71fb7e8e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -394,6 +394,11 @@ static ObjectAddress ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd,
 static bool ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
 	 Relation rel, HeapTuple contuple, List **otherrelids,
 	 LOCKMODE lockmode);
+static void QueueFKConstraintValidation(List **wqueue, Relation conrel, Relation rel,
+		HeapTuple contuple, LOCKMODE lockmode);
+static void QueueCheckConstraintValidation(List **wqueue, Relation conrel, Relation rel,
+		   char *constrName, HeapTuple contuple,
+		   bool recurse, bool recursing, LOCKMODE lockmode);
 static ObjectAddress ATExecValidateConstraint(List **wqueue,
 			  Relation rel, char *constrName,
 			  bool recurse, bool recursing, LOCKMODE lockmode);
@@ -11951,6 +11956,169 @@ ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
 	return changed;
 }
 
+/*
+ * QueueFKConstraintValidation
+ *
+ * Add an entry to the wqueue to validate the given foreign key constraint in
+ * Phase 3 and update the convalidated field in the pg_constraint catalog
+ * for the specified relation.
+ */
+static void
+QueueFKConstraintValidation(List **wqueue, Relation conrel, Relation rel,
+			HeapTuple contuple, LOCKMODE lockmode)
+{
+	Form_pg_constraint con;
+	AlteredTableInfo *tab;
+	HeapTuple	copyTuple;
+	Form_pg_constraint copy_con;
+
+	con = (Form_pg_constraint) GETSTRUCT(contuple);
+	Assert(con->contype == CONSTRAINT_FOREIGN);
+
+	if (rel->rd_rel->relkind == RELKIND_RELATION)
+	{
+		NewConstraint *newcon;
+		Constraint *fkconstraint;
+
+		/* Queue validatio

Re: Allow NOT VALID foreign key constraints on partitioned tables.

2025-01-02 Thread Álvaro Herrera


On Thu, Jan 2, 2025, at 5:49 PM, Amul Sul wrote:
> When adding a new FK constraint or attaching a partitioned table, where
> matching FK constraints are merged, we allow the parent constraint to be NOT
> VALID while the child constraint remains VALID, which is harmless. However, 
> the
> reverse scenario -- where the parent constraint is VALID and the child is NOT
> VALID -- is incorrect. To address this, when merging a NOT VALID FK constraint
> from the child with a VALID parent constraint, it implicitly validates the
> child constraint against its existing data and marks it as VALID. This 
> behavior
> aligns with adding a new FK constraint directly to the child table, which 
> would
> also validate the existing data.

Hmm, I'm not sure about this, which may cause surprising delays. Maybe it would 
be better that the operation fails with an error, so that the user can do 
VALIDATE CONSTRAINT explicitly and retry the ATTACH once all the partitions 
have been so processed.

Re: magical eref alias names

2025-01-02 Thread Robert Haas
On Sun, Dec 22, 2024 at 12:45 PM Tom Lane  wrote:
> > Hmm, OK, that's useful. But I guess I'm still puzzled about the theory
> > here. A name like *VALUES* doesn't seem like it was created with the
> > idea of referring to it from some other part of your query. I do take
> > your point that it works and somebody's probably relying on it, but I
> > don't think you'd pick a name that requires quoting if you were trying
> > to make it easy to use that name in the query.
>
> As you say, some of this is lost in the mists of time; but I think the
> idea was specifically that these names should *not* be easy to type,
> because we don't want them to conflict with any relation alias that
> the user is likely to pick.  I'm fairly sure that the SQL spec
> actually has verbiage to the effect of "the implementation shall
> select a name that does not conflict with any other name in the query".

OK, but picking names that the user probably won't use is neither
necessary nor sufficient to guarantee uniqueness.

> > You might possibly also
> > try to generate names that are easy for users to guess, and distinct.
>
> Yeah, per spec they should be distinct; but somebody didn't bother
> with that early on, and now we've ended up in a situation where
> ruleutils.c does it instead.  I'm not sure that that's terribly evil.
> In particular, in a situation where we're trying to show a plan for
> a query with inlined views, EXPLAIN would probably have to have code
> to unique-ify the names anyway --- there's no way we're going to make
> these nonce names globally unique, so the view(s) might contain names
> that conflict with each other or the outer query.

When you say "there's no way we're going to make these nonce names
globally unique," is that because you think it would be too costly
from a performance perspective (which was my concern) or that you
think it's flat-out impossible for some reason (which doesn't seem to
me to be true)?

> > Yeah, I don't really want to be the one to break somebody's query that
> > explicitly references "*VALUES*" or whatever. At least not without a
> > better reason than I currently have. If this were just a display
> > artifact I think finding some way to clean it up would be pretty
> > worthwhile, but I would need a better reason to break working SQL.
>
> So it seems like we're coming to the conclusion that we don't want
> to change things here?
>
> The reason I want to push for a conclusion is that the discussion
> about "*VALUES*" over at [1] is on hold pending some decision here.
> The v3 patch in that thread is set up in such a way that it improves
> EXPLAIN output without breaking any existing SQL, and that's what
> I'd use if we refrain from changing things here.  But it's a tad
> inconsistent, so if we did decide it was okay to break some edge
> cases, I'd reconsider.

I think that if we can solve multiple problems all at once by breaking
some edge cases, that's worth considering. There probably aren't that
many people who have queries that reference the "*VALUES*" alias, so
if we wanted to change that to "values" I don't see that as completely
out of the question. Somebody will probably be inconvenienced but if
it solves other problems maybe it's worth it. But I don't think that
change by itself really helps anything here.

> Perhaps a similar idea could be applied to the other cases where
> we invent aliases, but it's less obvious how.  For instance in
>
> SELECT ... UNION SELECT ...
>
> there's no obvious place where we could get names for the
> two sub-SELECTs.

If we had global uniqueness, or even per-subquery-level uniqueness,
this would sort itself out somewhat nicely, I think. We would just end
up with select_1 and select_2 or union_1 and union_2 or something like
that, and I think that would be strictly better than the status quo
where we do sometimes generate *SELECT* %d, but also sometimes just
*SELECT* and other times unnamed_subquery, and also only ever *VALUES*
and not *VALUES* %d.

-- 
Robert Haas
EDB: http://www.enterprisedb.com