date:20250213

Re: Confine vacuum skip logic to lazy_scan_skip

2025-02-13 Thread Thomas Munro

On Fri, Feb 14, 2025 at 12:11 PM Melanie Plageman
 wrote:
> I've done some clean-up including incorporating a few off-list pieces
> of minor feedback from Andres.

I've been poking, reading, and trying out these patches.  They look good to me.

Tiny nit, maybe this comment could say something less obvious, cf the
similar comment near the other stream:

+   /* Set up the read stream */
+   stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,

I don't really love the cumbersome casting required around
per_buffer_data, but that's not your patches' fault (hmm, wonder what
we can do to improve that).

Re: Parallel heap vacuum

2025-02-13 Thread John Naylor

On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada  wrote:
>
> During eager vacuum scan, we reset the eager_scan_remaining_fails
> counter when we start to scan the new region. So if we want to make
> parallel heap vacuum behaves exactly the same way as the
> single-progress vacuum in terms of the eager vacuum scan, we would
> need to have the eager_scan_remaining_fails counters for each region
> so that the workers can decrement it corresponding to the region of
> the block that the worker is scanning. But I'm concerned that it makes
> the logic very complex. I'd like to avoid making newly introduced
> codes more complex by adding yet another new code on top of that.

Would it be simpler to make only phase III parallel? In other words,
how much of the infrastructure and complexity needed for parallel
phase I is also needed for phase III?

-- 
John Naylor
Amazon Web Services

Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-02-13 Thread Peter Smith

Hi Ajin,

Some review comments for the patch v13-0002.

==
src/backend/replication/logical/reorderbuffer.c

1. GENERAL

I felt that a summary/overview of how all this filter/cache logic
works should be given in the large file header comment at the top of
this file. There may be some overlap with comments that are already in
the "RelFileLocator filtering" section.

~~~

ReorderBufferRelFileLocatorEnt:

2.
+/* entry for hash table we use to track if the relation can be filtered */
+typedef struct ReorderBufferRelFileLocatorEnt

/* Hash table entry used to determine if the relation can be filtered. */

~~~

ReorderBufferQueueChange:

3.
+ /*
+ * If we're not filtering and we've crossed the change threshold,
+ * attempt to filter again
+ */

SUGGESTION
If filtering was suspended, and we've crossed the change threshold
then reenable filtering.

~~~

ReorderBufferGetRelation:

4.
+static Relation
+ReorderBufferGetRelation(ReorderBuffer *rb, RelFileLocator *rlocator,
+ bool has_tuple)

Would a better name be ReorderBufferGetRelationForDecoding(). Yeah,
it's a bit long but perhaps it explains the context/purpose better.

~~~

5.
+ if (filterable)
+ {
+ RelationClose(relation);
+ return NULL;
+ }

I wonder if some small descriptive comment would be helpful here just
to say we are returning NULL to indicate that this relation is not
needed and yada yada...

~~~

RelFileLocatorCacheInvalidateCallback:

6.
+ /* slightly inefficient algorithm but not performance critical path */
+ while ((entry = (ReorderBufferRelFileLocatorEnt *)
hash_seq_search(&status)) != NULL)
+ {
+ /*
+ * If relid is InvalidOid, signaling a complete reset, we must remove
+ * all entries, otherwise just remove the specific relation's entry.
+ * Always remove negative cache entries.
+ */
+ if (relid == InvalidOid || /* complete reset */
+ entry->relid == InvalidOid || /* negative cache entry */
+ entry->relid == relid) /* individual flushed relation */

6a.
Maybe uppercase that 1st comment.

~

6b.
It seems a bit unusual to be referring to "negative cache entries". I
thought it should be described in terms of InvalidOid since that is
what it is using in the condition.

~

6c.
If the relid parameter can take special values like "If relid is
InvalidOid, signaling a complete reset" that sounds like the kind of
thing that should be documented in the function comment.

~~~

ReorderBufferFilterByRelFileLocator

7.
Despite the extra indenting required, I wondered if the logic may be
easier to read (e.g. it shows the association of the
rb->can_filter_change and entry->filterable more clearly) if this is
refactored slightly by sharing a single common return like below:

BEFORE
...
+ key.reltablespace = rlocator->spcOid;
+ key.relfilenumber = rlocator->relNumber;
+ entry = hash_search(RelFileLocatorFilterCache, &key, HASH_ENTER, &found);
+
+ if (found)
+ {
+ rb->can_filter_change = entry->filterable;
+ return entry->filterable;
+ }
...
+ rb->can_filter_change = entry->filterable;
...
+ return entry->filterable;
+}

AFTER
...
+ key.reltablespace = rlocator->spcOid;
+ key.relfilenumber = rlocator->relNumber;
+ entry = hash_search(RelFileLocatorFilterCache, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
...
+ }
+
+ rb->can_filter_change = entry->filterable;
+ return entry->filterable;
+}

==
src/include/replication/reorderbuffer.h

8.
+ /* should we try to filter the change? */
+ bool can_filter_change;
+

I think most of my difficulty reading this patch was due to this field
name 'can_filter_change'.

'can_filter_change' somehow sounded to me like it is past tense. e.g.
like as if we already found some change and we yes, we CAN filter it.

But AFAICT the real meaning is simply that (when the flag is true) we
are ALLOWED to check to see if there is anything filterable. In fact,
the change may or may not be filterable.

Can this be renamed to something more "correct"? e.g.
- 'allow_change_filtering'
- 'enable_change_filtering'
- etc.

~~

9.
+ /* number of changes after a failed attempt at filtering */
+ int8 processed_changes;

Maybe 'unfiltered_changes_count' is a better name for this field?

~~~

10.
+extern bool ReorderBufferCanFilterChanges(ReorderBuffer *rb);

Should match the 'can_filter_change' field name, so if you change that
(see comment #8), then change this too.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)

2025-02-13 Thread Michael Paquier

On Tue, Feb 11, 2025 at 02:05:18PM +, Bertrand Drouvot wrote:
>> 2. I have a small suggestion for pg_stat_statements: would it make sense to
>> move wal_buffers_full next to wal_records, wal_fpi and wal_bytes? This way,
>> all WAL-related information would be grouped together.
> 
> I think I prefer to add it in "append" order. That way, that does not break
> queries that rely on ordinal numbers.

Not sure.  FWIW, it makes sense to me to group them by "family" in
this case, as they would belong to the same instrument structure.
--
Michael


signature.asc
Description: PGP signature

Re: DOCS - inactive_since field readability

2025-02-13 Thread Amit Kapila

On Fri, Feb 14, 2025 at 4:04 AM Peter Smith  wrote:
>
> On Tue, Feb 11, 2025 at 10:10 PM Amit Kapila  wrote:
> >
> ...
> > The change in 0001 looks odd after seeing it in HTML format. We should
> > either add one empty line between two paragraphs otherwise it doesn't
> > appear good. Did you see multi-paragraphs in any other column
> > definitions?
> >
>
> Patch 0001
>
> The pg_replication_slots system view is unusual in that there can be
> entirely different descriptions of the same field depending on the
> context, such as:
> a- for logical slots
> b- for physical slots
> c- for primary servers versus standby servers
>
> IIUC your 0001 feedback says that a blank line might be ok, but just
> doing it for 'active_since' and nothing else makes it look odd.
>

No, I meant to say that the description didn't looked any better to me
even after your 0001 patch. The second paragraph started immediately
in the next line which doesn't make it look any better. If we really
want to make it look better then one more additional line is required.
However, I don't want to go in that direction unless we have some
history of writing the docs similarly. I suggest let's go with your
0002 patch as that makes the description concise and clear.

-- 
With Regards,
Amit Kapila.

Re: Change GUC hashtable to use simplehash?

2025-02-13 Thread John Naylor

Hi Anton, could you please test if the attached passes for you? This
seems the simplest way.

--
John Naylor
Amazon Web Services
diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index e07c0226c1..bb09f87abe 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -290,13 +290,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 		str += FH_SIZEOF_ACCUM;
 	}
 
-	/*
-	 * The byte corresponding to the NUL will be 0x80, so the rightmost bit
-	 * position will be in the range 7, 15, ..., 63. Turn this into byte
-	 * position by dividing by 8.
-	 */
-	remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
+	remainder = fasthash_accum_cstring_unaligned(hs, str);
 	str += remainder;
 
 	return str - start;

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-13 Thread Shubham Khanna

On Thu, Feb 13, 2025 at 5:36 PM Shlok Kyal  wrote:
>
> On Thu, 13 Feb 2025 at 15:20, Shubham Khanna
>  wrote:
> >
> > On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal  wrote:
> > >
> > > On Tue, 11 Feb 2025 at 09:51, Shubham Khanna
> > >  wrote:
> > > >
> > > > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu)
> > > >  wrote:
> > > > >
> > > > > Dear Shubham,
> > > > >
> > > > > Thanks for updating the patch.
> > > > >
> > > > > Previously you told that you had a plan to extend the patch to drop 
> > > > > other replication
> > > > > objects [1], but I think it is not needed. pg_createsubscriber has 
> > > > > already been
> > > > > able to drop the existing subscrisubscriptions in 
> > > > > check_and_drop_existing_subscriptions().
> > > > > As for the replication slot, I have told in [2], it would be created 
> > > > > intentionally
> > > > > thus I feel it should not be dropped.
> > > > > Thus I regard the patch does not have concrete extending plan.
> > > > >
> > > > > Below part contains my review comment.
> > > > >
> > > > > 01. Option name
> > > > >
> > > > > Based on the above discussion, "--cleanup-publisher-objects" is not 
> > > > > suitable because
> > > > > it won't drop replication slots. How about "--cleanup-publications"?
> > > > >
> > > >
> > > > I have changed the name of the option  to 
> > > > "--cleanup-existing-publications"
> > > >
> > > > > 02. usage()
> > > > > ```
> > > > > +   printf(_("  -C  --cleanup-publisher-objects drop all 
> > > > > publications on the logical replica\n"));
> > > > > ```
> > > >
> > > > Fixed.
> > > >
> > > > > s/logical replica/subscriber
> > > > >
> > > > > 03. drop_all_publications
> > > > > ```
> > > > > +/* Drops all existing logical replication publications from all 
> > > > > subscriber
> > > > > + * databases
> > > > > + */
> > > > > +static void
> > > > > ```
> > > > >
> > > > > Initial line of the comment must be blank [3].
> > > > >
> > > >
> > > > Removed this function.
> > > >
> > > > > 04. main
> > > > > ```
> > > > > +   {"cleanup-publisher-objects", no_argument, NULL, 'C'},
> > > > > ```
> > > > >
> > > > > Is there a reason why upper case is used? I feel lower one is enough.
> > > > >
> > > >
> > > > Fixed.
> > > >
> > > > > 05. main
> > > > > ```
> > > > > +   /* Drop publications from the subscriber if requested */
> > > > > +   if (opt.cleanup_publisher_objects)
> > > > > +   drop_all_publications(dbinfo);
> > > > > ```
> > > > >
> > > > > After considering more, I noticed that we have already called 
> > > > > drop_publication()
> > > > > in the setup_subscriber(). Can we call drop_all_publications() there 
> > > > > instead when
> > > > > -C is specified?
> > > > >
> > > >
> > > > I agree with you on this. I have removed the drop_all_publication()
> > > > and added the "--cleanup-existing-publications" option to the
> > > > drop_publication()
> > > >
> > > > > 06. 040_pg_createsubscriber.pl
> > > > >
> > > > > ```
> > > > > +$node_s->start;
> > > > > +# Create publications to test it's removal
> > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL 
> > > > > TABLES;");
> > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL 
> > > > > TABLES;");
> > > > > +
> > > > > +# Verify the existing publications
> > > > > +my $pub_count_before =
> > > > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > > > +is($pub_count_before, '2',
> > > > > +   'two publications created before --cleanup-publisher-objects 
> > > > > is run');
> > > > > +
> > > > > +$node_s->stop;
> > > > > ```
> > > > >
> > > > > I feel it requires unnecessary startup and shutdown. IIUC, creating 
> > > > > publications and check
> > > > > counts can be before stopping the node_s, around line 331.
> > > > >
> > > >
> > > > Fixed.
> > > >
> > > > > 07. 040_pg_createsubscriber.pl
> > > > >
> > > > > ```
> > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL 
> > > > > TABLES;");
> > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL 
> > > > > TABLES;");
> > > > > +
> > > > > +# Verify the existing publications
> > > > > +my $pub_count_before =
> > > > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > > > +is($pub_count_before, '2',
> > > > > +   'two publications created before --cleanup-publisher-objects 
> > > > > is run');
> > > > > +
> > > > > ```
> > > > >
> > > > > Also, there is a possibility that CREATE PUBLICATION on node_p is not 
> > > > > replicated yet
> > > > > when SELECT COUNT(*) is executed. Please wait for the replay.
> > > > >
> > > > > [1]: 
> > > > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com
> > > > > [2]: 
> > > > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
> > > > > [3]: https://www.postgresql.org/docs/devel/source-format.html
> > > > >
> > > >
>

Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)

2025-02-13 Thread Michael Paquier

On Tue, Feb 11, 2025 at 09:37:37AM +, Bertrand Drouvot wrote:
> While at it, adding 0004 to report wal_buffers_full in VACUUM/ANALYZE 
> (VERBOSE).

Thanks for summarizing the history behind WalUsage and
wal_buffers_full.  FWIW, 0001 that moves wal_buffers_full from
PgStat_PendingWalStats to WalUsage is going to make our lives much
easier for your pending patch to adds backend statistics for WAL.  WAL
write/sync numbers/times will be covered in the backend stats by
pg_stat_io, allowing us to remove entirely the dependency to
PgStat_PendingWalStats.

I have been wondering for a bit if the comment at the top of WalUsage
should be updated, but the current description fits as well with the
follow-up patch series.

Anyway, if there are no objections, I am planning to apply this one to
lift this barrier and help with the follow-up work for the backend
stats.  That's the mandatory piece to makes the backend WAL stats
integration easier. 

0002, 0003 and 0004 are straight-forward follow-ups.  It's IMO one of
these things where extra data is cheap to have access to, and can be
useful to be aware when distributed across multiple contexts like
queries, plans or even EXPLAIN.  So no real objections here.  If other
have comments, feel free.

show_wal_usage() should have its check on (usage->wal_buffers_full) in
explain.c, as Ilia has mentioned.  It's true that you would not get a
(wal_buffers_full > 0) if at least the condition on wal_bytes is not
satisfied, but the addition makes sense on consistency grounds, at
least.

Agreed about the attribute ordering in pgss with everything associated
to WalUsage grouped together, btw.
--
Michael

signature.asc
Description: PGP signature

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2025-02-13 Thread Michael Paquier

On Thu, Feb 13, 2025 at 09:53:52AM -0800, Jacob Champion wrote:
> I guess I'm going to zero in on your definition of "may know nothing
> about". If that is true, something is very wrong IMO.
> 
> My understanding of the backend code was that port->peer is only set
> after OpenSSL has verified that the certificate has been issued by an
> explicitly trusted CA. (Our verify_cb() doesn't override the internal
> checks to allow failed certificates through.) That may not be enough
> to authorize entry into the server, but it also shouldn't be untrusted
> data. If a CA is issuing Subject data that is somehow dangerous to the
> operation of the server, I think that's a security problem in and of
> itself: there are clientcert HBA modes that don't validate the
> Subject, but they're still going to push that data into the catalogs,
> aren't they?

Is that the case before we finish authentication now?  I was not sure
how much of this data is set before and after finishing
authentication, tracking that this was part of the init phase of the
connection, something we do earlier than the actual authentication.
It feels like this should be documented more clearly in the patch if
pgstat_bestart_security() is allowed to be called multiple times (I
think we should not allow that).  That's quite unclear now in the
patch; on HEAD everything is set after authentication completes.

> So if we're concerned that Subject data is dangerous at this point in
> the code, I agree that my patch makes it even more dangerous and I'll
> modify it -- but then I'm going to split off another thread to try to
> fix that underlying issue. A user should not have to be authorized to
> access the server in order for signed authentication data from the
> transport layer to be considered trustworthy. Being able to monitor
> those separately is important for auditability.

Making this information visible in the catalogs for already logged in
users increases the potential of problems, and this is a sensitive
area of the code, so..

>> As a whole we still have a gap between what could be OK, what could
>> not be OK, and the fact that pgstat_bestart_security() is called twice
>> makes that confusing.
> 
> My end goal is that all of this _should_ always be OK, so calling it
> once or twice or thirty times should be safe. (But again, if that's
> not actually true today, I'll remove it for now.)

The concept of making pgstat_bestart_security() callable multiple
times relates also to the issue pointed upthread by Andres, somewhat:
allowing this pattern may lead to errors in the future regarding this
that should or should not be set this early in the authentication
process.  Sounds just saner to me to do pgstat_bestart_security()
once for now once we're OK with authentication, and it does not
prevent to address your initial point of being able to know if
backends with a given PID are stuck at a certain point.  At least
that's a step towards more debuggability as the backend entries are
available earlier than they are now.

Getting ready for tomatoes now, please aim freely.

> I agree with Robert's goal in general. I just think that following
> that rule to *this* extent is counterproductive. But I won't die on
> that hill; in the end, I just want to be able to see when LDAP calls
> hang.

Understood.
--
Michael

signature.asc
Description: PGP signature

Re: DROP CONSTRAINT, printing a notice and/or skipping when no action is taken

2025-02-13 Thread Andrew Atkinson

Oof, the subject line was meant to be DROP DEFAULT, not constraint



On Thu, Feb 13, 2025 at 11:13 AM Andrew Atkinson 
wrote:

> Hello. I noticed a small opportunity for a possible enhancement to DROP
> DEFAULT, and wanted to share the idea. Apologies if this idea was suggested
> before, I tried a basic search for pgsql-hackers similar things but didn’t
> find a hit.
>
>
> I noticed when running an ALTER TABLE with DROP DEFAULT, whether the
> column default exists or not, ALTER TABLE is always printed as the result.
> This is arguably slightly confusing, because it’s unclear if anything was
> done. In the scenario where there is no column default, there isn’t a
> message saying “skipped” or something equivalent, indicating that there was
> no default that was dropped. Some of the commands in Postgres do have this
> kind of feedback, so it seems like an opportunity for greater consistency.
>
>
>
> For example: if I create a column default, or repeatedly run the following
> ALTER TABLE statements for the "id_new" column, I always get ALTER TABLE
> back.
>
>
> ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;
>
> ALTER TABLE
>
> ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;
>
> ALTER TABLE
>
> ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;
>
> ALTER TABLE
>
>
> An opportunity would be to add a NOTICE type of message when ALTER TABLE
> ALTER COLUMN DROP DEFAULT is issued, at least when no column default
> exists, and no action was taken. In that scenario, the operation could
> possibly be skipped altogether, which might have some additional benefits.
>
>
> As a refreshed on a “Notice” type of message example, here’s one when
> adding an index and using the "if not exists" clause (an equivalent "if not
> exists" clause does not exist for DROP DEFAULT to my knowledge):
>
>
> -- an index called “foo” already exists
>
> psql> create index if not exists foo on organizations (id);
>
> NOTICE: relation "foo" already exists, skipping
>
> CREATE INDEX
>
>
> The message being “NOTICE: relation "foo" already exists, skipping”
>
>
> A similar message for DROP DEFAULT might look like:
>
>
> “NOTICE: default does not exist, skipping”
>
>
> Or an alternative that includes the column name might look like:
>
>
> “NOTICE: default does not exist for column id_new, skipping”
>
>
> Or another alternative might be a new (non-standard?) "if exists" clause
> for DROP DEFAULT. Example:
>
> ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT IF EXISTS;
>
>
> -- Or an alternative placement of the "if exists" clause, because I don’t
> really know where it would go:
>
> ALTER TABLE my_table ALTER COLUMN id_new DROP IF EXISTS DEFAULT;
>
>
>
> Thanks!
>
> - Andrew
>

Re: Restrict publishing of partitioned table with a foreign table as partition

2025-02-13 Thread Shlok Kyal

On Thu, 13 Feb 2025 at 20:12, vignesh C  wrote:
>
> On Thu, 13 Feb 2025 at 15:50, Shlok Kyal  wrote:
> >
> >
> > I have fixed the issue. Attached the updated v6 patch.
>
> There is another concurrency issue:
> In case of create publication for all tables with
> publish_via_partition_root we will call check_foreign_tables:
> @@ -876,6 +876,10 @@ CreatePublication(ParseState *pstate,
> CreatePublicationStmt *stmt)
> /* Associate objects with the publication. */
> if (stmt->for_all_tables)
> {
> +   /* Check if any foreign table is a part of partitioned table 
> */
> +   if (publish_via_partition_root)
> +   check_foreign_tables(stmt->pubname);
>
> At the time of check in check_foreign_tables, there are no foreign
> tables so this check will be successful:
> +check_foreign_tables_in_schema(Oid schemaid, char *pubname)
> +{
> +   RelationclassRel;
> +   ScanKeyData key[2];
> +   TableScanDesc scan;
> +   HeapTuple   tuple;
> +
> +   classRel = table_open(RelationRelationId, AccessShareLock);
> +
> +   ScanKeyInit(&key[0],
> +   Anum_pg_class_relnamespace,
> +   BTEqualStrategyNumber, F_OIDEQ,
> +   schemaid);
> +   ScanKeyInit(&key[1],
> +   Anum_pg_class_relkind,
> +   BTEqualStrategyNumber, F_CHAREQ,
> +   CharGetDatum(RELKIND_PARTITIONED_TABLE));
> +
> +   scan = table_beginscan_catalog(classRel, 2, key);
> +   while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
>
> Now immediately after execution of this, create a foreign table:
> postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES
> FROM (10) TO (15) SERVER fdw;
> CREATE FOREIGN TABLE
>
> And then continue execution of create publication, it will also be successful:
> postgres=# create publication pub1 for all tables with (
> publish_via_partition_root =true);
> CREATE PUBLICATION
>
> One probable way to fix this is to do the search similar to
> check_foreign_tables_in_schema where we can skip including schemaid
> key for all tables.
>
> How about something like the attached patch.
>

Hi Vignesh,

I have used the changes suggested by you. Also I have updated the
comments and the function name.

Thanks and Regards,
Shlok Kyal


v7-0001-Restrict-publishing-of-partitioned-table-with-for.patch
Description: Binary data

Re: AIO v2.3

2025-02-13 Thread Jakub Wartak

On Tue, Feb 11, 2025 at 12:10 AM Andres Freund  wrote:

>> TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

> Phew! Weee! Yay.

Another good news: I've completed a full 24h pgbench run on the same
machine and it did not fail or report anything suspicious. FYI,
patchset didn't not apply anymore (seems patches 1..6 are already
applied on master due to checkpoint shutdown sequence), but there was
a failed hunk in patch #12 yesterday too:
[..]
patching file src/backend/postmaster/postmaster.c
Hunk #10 succeeded at 2960 (offset 14 lines).
Hunk #11 FAILED at 3047.
[..]
1 out of 15 hunks FAILED -- saving rejects to file
src/backend/postmaster/postmaster.c.rej

anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan
24) to have clean apply I've used the below asserted build:

meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug
-Dsegsize_blocks=13 -Dcassert=true
/usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB
/usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400

with some add-on functionalities:
effective_io_concurrency = '4'
shared_buffers = '2GB'
max_connections = '1000'
archive_command = 'cp %p /dev/null'
archive_mode = 'on'
summarize_wal = 'on'
wal_summary_keep_time = '1h'
wal_compression = 'on'
wal_log_hints = 'on'
max_wal_size = '1GB'
shared_preload_libraries = 'pg_stat_statements'
huge_pages = 'off'
wal_receiver_status_interval = '1s'

so the above got perfect run:
[..]
duration: 86400 s
number of transactions actually processed: 129615534
number of failed transactions: 0 (0.000%)
latency average = 5.332 ms
latency stddev = 24.107 ms
rate limit schedule lag: avg 0.748 (max 1992.517) ms
initial connection time = 124.472 ms
tps = 1500.179231 (without initial connection time)

> > I was kind of afraid that additional IPC to separate processes would put
> > workers at a disadvantage a little bit , but that's amazingly not true.
>
> It's a measurable disadvantage, it's just more than counteracted by being able
> to do IO asynchronously :).
>
> It's possible to make it more visible, by setting io_combine_limit = 1. If you
> have a small shared buffers with everything in the kernel cache, the dispatch
> overhead starts to be noticeable above several GB/s. But that's ok, I think.

Sure it is.

> > 2. my very limited in terms of time data analysis thoughts
> > - most of the time perf  with aioworkers is identical (+/- 3%) as of
> > the master, in most cases it is much BETTER
>
> I assume s/most/some/ for the second most?

Right, pardon for my excited moment ;)

> > - on parallel seqscans "sata" with datasets bigger than VFS-cache
> > ("big") and high e_io_c with high client counts(sigh!), it looks like
> > it would user noticeable big regression but to me it's not regression
> > itself, probably we are issuing way too many posix_fadvise()
> > readaheads with diminishing returns. Just letting you know. Not sure
> > it is worth introducing some global (shared aioworkers e_io_c
> > limiter), I think not. I think it could also be some maintenance noise
> > on that I/O device, but I have no isolated SATA RAID10 with like 8x
> > HDDs in home to launch such a test to be absolutely sure.
>
> I think this is basically a configuration issue - configuring a high e_io_c
> for a device that can't handle that and then load it up with a lot of clients,
> well, that'll not work out great.

Sure, btw i'm going to also an idea about autotuning that e_io_c in
that related thread where everybody is complaining about it

> > 3. with aioworkers in documentation it would worth pointing out that
> > `iotop` won't be good enough to show which PID is doing I/O anymore .
> > I've often get question like this: who is taking the most of I/O right
> > now because storage is fully saturated on multi-use system. Not sure
> > it would require new view or not (pg_aios output seems to be not more
> > like in-memory debug view that would be have to be sampled
> > aggressively, and pg_statio_all_tables shows well table, but not PID
> > -- same for pg_stat_io). IMHO if docs would be simple like
> > "In order to understand which processes (PIDs) are issuing lots of
> > IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
> > it should be good enough for a start.
>
> pg_stat_get_backend_io() should allow to answer that, albeit with the usual
> weakness of our stats system, namely that the user has to diff two snapshots
> themselves. It probably also has the weakness of not showing results for
> queries before they've finished, although I think that's something we should
> be able to improve without too much trouble (not in this release though, I
> suspect).
>
> I guess we could easily reference pg_stat_get_backend_io(), but a more
> complete recipe isn't entirely trivial...

I was trying to come out with something that could be added to docs,
but the below thing is too ugly and as you stated the primary weakness
is that query needs to finish before it is reflected:

Re: Improve verification of recovery_target_timeline GUC.

2025-02-13 Thread Michael Paquier

On Fri, Jan 24, 2025 at 01:36:45PM +, David Steele wrote:
> I attached the wrong patch. Oops!

Thanks for the new patch.

> + timeline = strtoull(*newval, &endp, 0);
> +
> + if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
>   {
>   GUC_check_errdetail("\"recovery_target_timeline\" is 
> not a valid number.");
>   return false;
>   }

Why not using strtou64() here?  That's more portable depending on
SIZEOF_LONG (aka the 4 bytes on WIN32 vs 8 bytes for the rest of the
world).

> +
> + if (timeline < 2 || timeline > UINT_MAX)
> + {

A recovery_target_timeline of 1 is a valid value, isn't it?

> + GUC_check_errdetail(
> + "\"recovery_target_timeline\" must be between 2 
> and 4,294,967,295.");
> + return false;
> + }

I would suggest to rewrite that as \"%s\" must be between %u and %u,
which would be more generic for translations in case there is an
overlap with something else.

> +$logfile = slurp_file($node_standby->logfile());
> +ok($logfile =~ qr/must be between 2 and 4,294,967,295/,
> + 'target timelime out of max range');

These sequences of tests could be improved:
- Let's use start locations for the logs slurped, reducing the scope
of the logs scanned.
- Perhaps it would be better to rely on wait_for_log() to make sure
that the expected log contents are generated without any kind of race
condition?
--
Michael


signature.asc
Description: PGP signature

Re: Confine vacuum skip logic to lazy_scan_skip

2025-02-13 Thread Melanie Plageman

On Thu, Feb 13, 2025 at 8:30 PM Masahiko Sawada  wrote:
>
> On Thu, Feb 13, 2025 at 4:55 PM Melanie Plageman
>  wrote:
> >
> > We don't want to hang onto that pin for a
> > long time. But I can't move them to the bottom of the loop after we
> > release the buffer because some of the code paths don't make it that
> > far. I don't see a good way other than how I did it or special-casing
> > block 0. What do you think?
>
> How about adding 'vacrel->scanned_pages > 0' to the if statement?
> Which seems not odd to me.

Cool. I've done this in attached v19.

> Looking at the 0002 patch, it seems you reverted the change to the
> following comment:
>
>   /*
>* Vacuum the Free Space Map to make newly-freed space visible on
> -  * upper-level FSM pages.  Note we have not yet processed blkno.
> + * upper-level FSM pages. Note we have not yet processed blkno+1.
>*/
>
> I feel that the previous change I saw in v17 is clearer:

I've reverted to the old comment. Thanks

> The rest looks good to me.

Cool! I'll plan to push this tomorrow barring any objections.

- Melanie
From 25140d2966d2f87a4a410fc92147f2d88e587920 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 13 Feb 2025 16:49:10 -0500
Subject: [PATCH v19 1/3] Convert heap_vac_scan_next_block() boolean parameters
 to flags

The read stream API only allows one piece of extra per block state to be
passed back to the API user. lazy_scan_heap() needs to know whether or
not a given block was all-visible in the visibility map and whether or
not it was eagerly scanned. Convert these two pieces of information to
flags so that they can be passed as a single argument to
heap_vac_scan_next_block() (which will become the read stream API
callback for heap phase I vacuuming).

Reviewed-by: Masahiko Sawada 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/CAAKRu_bmx33jTqATP5GKNFYwAg02a9dDtk4U_ciEjgBHZSVkOQ%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 47 
 1 file changed, 27 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3df5b92afb8..c4d0f77ee2f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -248,6 +248,13 @@ typedef enum
  */
 #define EAGER_SCAN_REGION_SIZE 4096
 
+/*
+ * heap_vac_scan_next_block() sets these flags to communicate information
+ * about the block it read to the caller.
+ */
+#define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
+#define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -417,8 +424,7 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 		 VacuumParams *params);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
-	 bool *all_visible_according_to_vm,
-	 bool *was_eager_scanned);
+	 uint8 *blk_info);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
    BlockNumber blkno, Page page,
@@ -1171,8 +1177,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	BlockNumber rel_pages = vacrel->rel_pages,
 blkno,
 next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm,
-was_eager_scanned = false;
+	uint8		blk_info = 0;
 	BlockNumber orig_eager_scan_success_limit =
 		vacrel->eager_scan_remaining_successes; /* for logging */
 	Buffer		vmbuffer = InvalidBuffer;
@@ -1196,8 +1201,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm,
-	&was_eager_scanned))
+	while (heap_vac_scan_next_block(vacrel, &blkno, &blk_info))
 	{
 		Buffer		buf;
 		Page		page;
@@ -1206,7 +1210,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		got_cleanup_lock = false;
 
 		vacrel->scanned_pages++;
-		if (was_eager_scanned)
+		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
 			vacrel->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
@@ -1331,7 +1335,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
-			vmbuffer, all_visible_according_to_vm,
+			vmbuffer,
+			blk_info & VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM,
 			&has_lpdead_items, &vm_page_frozen);
 
 		/*
@@ -1348,7 +1353,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * exclude pages skipped due to cleanup lock contention from eager
 		 * freeze algorithm caps.
 		 */
-		if (got_cleanup_lock && was_eager_scanned)
+		if (got_cleanup_lock &&
+			(blk_info & VAC_BLK_WAS_EAGER_SCANNED))
 		{
 			/* Aggressive vacuums do not eager scan. */
 			Assert(!vacrel->aggressive);
@@ -1479,11 +1485,11 @@ lazy_scan_heap(LVRelState *vacrel)
  * and various thresholds to skip

Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations

2025-02-13 Thread Sébastien

Thank you for your answer TOM.
I'd like to have this option only for initial loading of huge amounts of
data, where atomicity consistency is needed at all. Maybe an option on
startup command just for initial import mode, would be nice.

I had huge problems on server 3 weeks after a 6 TB migration from other DB.
I think it's sad to rewrite all data twice.

Le jeu. 13 févr. 2025 à 16:08, Tom Lane  a écrit :

> =?UTF-8?Q?S=C3=A9bastien?=  writes:
> > Implementation details:
>
> >- A new INSERT FROZEN option could be introduced, similar to COPY
> FREEZE,
> >allowing direct insertion of tuples in a frozen state.
> >- This would likely require changes in heap storage logic to ensure
> >tuples are written with a frozen XID at insert time.
> >- Consideration should be given to transaction semantics and WAL
> logging
> >to ensure consistency and crash recovery integrity.
>
> That last is exactly why this won't happen.  A frozen tuple would be
> considered committed and visible the instant it appears in the table,
> thus completely breaking both atomicity and integrity of the
> transaction.
>
> There has been work going on recently to reduce the impact of freezing
> massive amounts of data by spreading the work more effectively [1].
> I don't say that that particular commit has completely solved the
> problem, but I think that continued effort in that direction is more
> likely to yield usable results than what you're suggesting.
>
> BTW, this might or might not be usable in your particular workflow,
> but: there have long been some optimizations for data load into a
> table created in the same transaction.  The idea there is that if the
> transaction rolls back, the table will never have been visible to any
> other transaction at all, so that maintaining atomicity/integrity of
> its contents is moot.
>
> regards, tom lane
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=052026c9b
>


-- 
Sébastien Caunes
+33 6 7 229 229 7

Re: Elimination of the repetitive code at the SLRU bootstrap functions.

2025-02-13 Thread Evgeny Voropaev


Hello Aleksander!
Hello Álvaro!
Thank you for your attention to the patch.


Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to
me that it merely wastes space in SlruCtlData. On top of that I'm not
100% sure if all the supported platforms have C99 compilers with
designated initializers support.



Wouldn't it be simpler to pass the callback straight to
BootStrapSlruPage()?


I agree with your proposals. Moreover, similar solution is also implemented and 
exploited at
Tantor's versions of the xid64 patch 
(https://www.postgresql.org/message-id/5ca56aed-dc69-4c3e-968f-a822ae0937b5%40gmail.com).

The corrected patch version implementing these proposals is attached.

Best regards,
Evgeny Voropaev,
Tantor Labs LLC.
From a5a29cdbb37834bfcd23cd22db90a78a7db43d18 Mon Sep 17 00:00:00 2001
From: Evgeny Voropaev 
Date: Fri, 14 Feb 2025 14:03:01 +0800
Subject: [PATCH v2] Elimination of the repetitive code at the SLRU bootstrap
 functions.

The functions bootstrapping SLRU pages used to have a lot of repetitive code. The new
realized function BootStrapSlruPage has moved duplicating code into the
single place and eliminated code repetitions.

Author: Evgeny Voropaev  
---
 src/backend/access/transam/clog.c  | 27 +
 src/backend/access/transam/commit_ts.c | 24 +---
 src/backend/access/transam/multixact.c | 54 ++
 src/backend/access/transam/slru.c  | 35 +
 src/backend/access/transam/subtrans.c  | 28 ++---
 src/include/access/slru.h  |  1 +
 6 files changed, 55 insertions(+), 114 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 0d556c00b8c..5ae684da6af 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -832,19 +832,7 @@ check_transaction_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapCLOG(void)
 {
-	int			slotno;
-	LWLock	   *lock = SimpleLruGetBankLock(XactCtl, 0);
-
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	/* Create and zero the first page of the commit log */
-	slotno = ZeroCLOGPage(0, false);
-
-	/* Make sure it's written out */
-	SimpleLruWritePage(XactCtl, slotno);
-	Assert(!XactCtl->shared->page_dirty[slotno]);
-
-	LWLockRelease(lock);
+	BootStrapSlruPage(XactCtl, 0, ZeroCLOGPage);
 }
 
 /*
@@ -1114,19 +1102,8 @@ clog_redo(XLogReaderState *record)
 	if (info == CLOG_ZEROPAGE)
 	{
 		int64		pageno;
-		int			slotno;
-		LWLock	   *lock;
-
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
-
-		lock = SimpleLruGetBankLock(XactCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-
-		slotno = ZeroCLOGPage(pageno, false);
-		SimpleLruWritePage(XactCtl, slotno);
-		Assert(!XactCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(lock);
+		BootStrapSlruPage(XactCtl, pageno, ZeroCLOGPage);
 	}
 	else if (info == CLOG_TRUNCATE)
 	{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 95049acd0b5..54600f1b69c 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -747,16 +747,7 @@ ActivateCommitTs(void)
 
 	/* Create the current segment file, if necessary */
 	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
-	{
-		LWLock	   *lock = SimpleLruGetBankLock(CommitTsCtl, pageno);
-		int			slotno;
-
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = ZeroCommitTsPage(pageno, false);
-		SimpleLruWritePage(CommitTsCtl, slotno);
-		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
-		LWLockRelease(lock);
-	}
+		BootStrapSlruPage(CommitTsCtl, pageno, ZeroCommitTsPage);
 
 	/* Change the activation status in shared memory. */
 	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
@@ -1023,19 +1014,8 @@ commit_ts_redo(XLogReaderState *record)
 	if (info == COMMIT_TS_ZEROPAGE)
 	{
 		int64		pageno;
-		int			slotno;
-		LWLock	   *lock;
-
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
-
-		lock = SimpleLruGetBankLock(CommitTsCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-
-		slotno = ZeroCommitTsPage(pageno, false);
-		SimpleLruWritePage(CommitTsCtl, slotno);
-		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(lock);
+		BootStrapSlruPage(CommitTsCtl, pageno, ZeroCommitTsPage);
 	}
 	else if (info == COMMIT_TS_TRUNCATE)
 	{
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27ccdf9500f..30cc47021f7 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2033,32 +2033,8 @@ check_multixact_member_buffers(int *newval, void **extra, GucSource source)
 void
 BootStrapMultiXact(void)
 {
-	int			slotno;
-	LWLock	   *lock;
-
-	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0);
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
-	/* Create and zero the first page of the offsets log */
-	slotno = ZeroMultiXactOffsetPage(0, false);
-
-	/* Make sure it's written out */
-	SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-	As

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-13 Thread Shlok Kyal

On Thu, 13 Feb 2025 at 15:20, Shubham Khanna
 wrote:
>
> On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal  wrote:
> >
> > On Tue, 11 Feb 2025 at 09:51, Shubham Khanna
> >  wrote:
> > >
> > > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu)
> > >  wrote:
> > > >
> > > > Dear Shubham,
> > > >
> > > > Thanks for updating the patch.
> > > >
> > > > Previously you told that you had a plan to extend the patch to drop 
> > > > other replication
> > > > objects [1], but I think it is not needed. pg_createsubscriber has 
> > > > already been
> > > > able to drop the existing subscrisubscriptions in 
> > > > check_and_drop_existing_subscriptions().
> > > > As for the replication slot, I have told in [2], it would be created 
> > > > intentionally
> > > > thus I feel it should not be dropped.
> > > > Thus I regard the patch does not have concrete extending plan.
> > > >
> > > > Below part contains my review comment.
> > > >
> > > > 01. Option name
> > > >
> > > > Based on the above discussion, "--cleanup-publisher-objects" is not 
> > > > suitable because
> > > > it won't drop replication slots. How about "--cleanup-publications"?
> > > >
> > >
> > > I have changed the name of the option  to 
> > > "--cleanup-existing-publications"
> > >
> > > > 02. usage()
> > > > ```
> > > > +   printf(_("  -C  --cleanup-publisher-objects drop all 
> > > > publications on the logical replica\n"));
> > > > ```
> > >
> > > Fixed.
> > >
> > > > s/logical replica/subscriber
> > > >
> > > > 03. drop_all_publications
> > > > ```
> > > > +/* Drops all existing logical replication publications from all 
> > > > subscriber
> > > > + * databases
> > > > + */
> > > > +static void
> > > > ```
> > > >
> > > > Initial line of the comment must be blank [3].
> > > >
> > >
> > > Removed this function.
> > >
> > > > 04. main
> > > > ```
> > > > +   {"cleanup-publisher-objects", no_argument, NULL, 'C'},
> > > > ```
> > > >
> > > > Is there a reason why upper case is used? I feel lower one is enough.
> > > >
> > >
> > > Fixed.
> > >
> > > > 05. main
> > > > ```
> > > > +   /* Drop publications from the subscriber if requested */
> > > > +   if (opt.cleanup_publisher_objects)
> > > > +   drop_all_publications(dbinfo);
> > > > ```
> > > >
> > > > After considering more, I noticed that we have already called 
> > > > drop_publication()
> > > > in the setup_subscriber(). Can we call drop_all_publications() there 
> > > > instead when
> > > > -C is specified?
> > > >
> > >
> > > I agree with you on this. I have removed the drop_all_publication()
> > > and added the "--cleanup-existing-publications" option to the
> > > drop_publication()
> > >
> > > > 06. 040_pg_createsubscriber.pl
> > > >
> > > > ```
> > > > +$node_s->start;
> > > > +# Create publications to test it's removal
> > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL 
> > > > TABLES;");
> > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL 
> > > > TABLES;");
> > > > +
> > > > +# Verify the existing publications
> > > > +my $pub_count_before =
> > > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > > +is($pub_count_before, '2',
> > > > +   'two publications created before --cleanup-publisher-objects is 
> > > > run');
> > > > +
> > > > +$node_s->stop;
> > > > ```
> > > >
> > > > I feel it requires unnecessary startup and shutdown. IIUC, creating 
> > > > publications and check
> > > > counts can be before stopping the node_s, around line 331.
> > > >
> > >
> > > Fixed.
> > >
> > > > 07. 040_pg_createsubscriber.pl
> > > >
> > > > ```
> > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL 
> > > > TABLES;");
> > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL 
> > > > TABLES;");
> > > > +
> > > > +# Verify the existing publications
> > > > +my $pub_count_before =
> > > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > > +is($pub_count_before, '2',
> > > > +   'two publications created before --cleanup-publisher-objects is 
> > > > run');
> > > > +
> > > > ```
> > > >
> > > > Also, there is a possibility that CREATE PUBLICATION on node_p is not 
> > > > replicated yet
> > > > when SELECT COUNT(*) is executed. Please wait for the replay.
> > > >
> > > > [1]: 
> > > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com
> > > > [2]: 
> > > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
> > > > [3]: https://www.postgresql.org/docs/devel/source-format.html
> > > >
> > >
> > > Fixed.
> > >
> > > The attached Patch contains the suggested changes.
> > >
> >
> > Hi Shubham,
> >
> > I have some comments for v4 patch:
> > 1. I think we should update the comment for the function
> > 'drop_publication'. As its usage is changed with this patch
> > Currently it states:
> > /*
> >  * Remove publication if it couldn't f

Re: Elimination of the repetitive code at the SLRU bootstrap functions.

2025-02-13 Thread Aleksander Alekseev

Hi Evgeny,

> The functions, bootstrapping SLRU pages, such as BootStrapMultiXact,
> BootStrapCLOG, ActivateCommitTs, multixact_redo and others, have a lot
> of repetitive code.
>
> A new proposed function BootStrapSlruPage moves a duplicating code into
> the single place. Additionally, a new member ZeroFunc is implemented in
> the SlruCtlData structure. The ZeroFunc keeps a pointer to a function
> proper for nullifying SLRU pages of a type defined by a corresponding
> SlruCtlData object.
>
> Applying proposed modifications can simplify maintenance and future
> development of Postgres code in the realm of bootstrapping SLRU.

Thanks for the patch.

```
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -102,14 +102,6 @@ TransactionIdToPage(TransactionId xid)
  */
 #define THRESHOLD_SUBTRANS_CLOG_OPT5

-/*
- * Link to shared-memory data structures for CLOG control
- */
-static SlruCtlData XactCtlData;
-
-#define XactCtl (&XactCtlData)
-
-
 static intZeroCLOGPage(int64 pageno, bool writeXlog);
 static bool CLOGPagePrecedes(int64 page1, int64 page2);
 static void WriteZeroPageXlogRec(int64 pageno);
@@ -130,6 +122,14 @@ static void
TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int64 pageno);

+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData XactCtlData = { .ZeroPage = ZeroCLOGPage };
+
+#define XactCtl (&XactCtlData)
+
+
```

Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to
me that it merely wastes space in SlruCtlData. On top of that I'm not
100% sure if all the supported platforms have C99 compilers with
designated initializers support. Wouldn't it be simpler to pass the
callback straight to BootStrapSlruPage()?

After changing the patch accordingly it shouldn't move XactCtl and
others. This is just an irrelevant change.

-- 
Best regards,
Aleksander Alekseev

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Tomas Vondra




On 2/13/25 01:40, Melanie Plageman wrote:
> On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra  wrote:
>>
>> For the nvme RAID (device: raid-nvme), it's looks almost exactly the
>> same, except that with parallel query (page 27) there's a clear area of
>> regression with eic=1 (look for "column" of red cells). That's a bit
>> unfortunate, because eic=1 is the default value.
> 
> So, I feel pretty confident after even more analysis today with Thomas
> Munro that all of the parallel cases with effective_io_concurrency ==
> 1 are because master cheats effective_io_concurrency. Therefore, I'm
> not too worried about those for now. We probably will have to increase
> effective_io_concurrency before merging this patch, though.
> 
> Thomas mentioned this to me off-list, and I think he's right. We
> likely need to rethink the way parallel bitmap heap scan workers get
> block assignments for reading and prefetching to make it more similar
> to parallel sequential scan. The workers should probably get
> assignments of a range of blocks. On master, each worker does end up
> issuing reads/fadvises for a bunch of blocks in a row -- even though
> that isn't the intent of the parallel bitmap table scan
> implementation. We are losing some of that with the patch -- but only
> because it is behaving as you would expect given the implementation
> and design. I don't consider that a blocker, though.
> 

Agreed. If this is due to master doing something wrong (and either
prefetching more, or failing to prefetch), that's not the fault of this
patch. People might perceive this as a regression, but increasing the
default eic value would fix that.

BTW I recall we ran into issues with prefetching and parallel workers
about a year ago [1]. Is this the same issue, or something new?

[1]
https://www.postgresql.org/message-id/20240315211449.en2jcmdqxv5o6tlz%40alap3.anarazel.de

>> For the SATA SSD RAID (device: raid-sata), it's similar - a couple
>> regressions for eic=0, just like for NVMe. But then there's also a
>> couple regressions for higher eic values, usually for queries with
>> selective conditions.
> 
> I started trying to reproduce the regressions you saw on your SATA
> devices for higher eic values. I was focusing just on serial bitmap
> heap scan -- since that doesn't have the problems mentioned above. I
> don't have a SATA drive, but I used dmsetup to add a 10ms delay to my
> nvme drive. Unfortunately, that did the opposite of reproduce the
> issue. With dm delay 10ms, the patch is 5 times faster than master.
> 

I'm not quite sure at which point does dm-setup add the delay. Does it
affect the fadvise call too, or just the I/O if the page is not already
in page cache?

regards

-- 
Tomas Vondra

Re: NOT ENFORCED constraint feature

2025-02-13 Thread Álvaro Herrera

On 2025-Feb-13, Ashutosh Bapat wrote:

> > So considering that, I think a three-state system makes more sense.
> > Something like:
> >
> > 1) NOT ENFORCED -- no data is checked
> > 2) NOT VALID -- existing data is unchecked, new data is checked
> > 3) ENFORCED -- all data is checked
> >
> > Transitions:
> >
> > (1) - [ ALTER TABLE ... ALTER CONSTRAINT ... NOT VALID ] -> (2)
> 
> Per your notation, this means the the constraint is not enforced but
> new data is checked - that seems a contradiction, how would we check
> the data when the constraint is not being enforced. Or do you suggest
> that we convert a NOT ENFORCED constraint to ENFORCED as a result of
> converting it to NOT VALID?

I agree this one is a little weird.  For this I would have the command
be
ALTER TABLE ... ALTER CONSTRAINT ... ENFORCED NOT VALID
this way it's explicit that what we want is flip the ENFORCED bit while
leaving NOT VALID as-is.

> > (2) - [ ALTER TABLE ... VALIDATE CONSTRAINT ... ] -> (3)
> 
> As a result of this a not enforced constraint would turn into an
> enforced constraint. The user might have intended to just validate the
> data but not enforce it to avoid paying price for the checks on new
> data.

I'm not sure there's a use case for validating existing data without
starting to enforce the constraint.  The data can become invalid
immediately after you've run the command, so why bother?

> I think, what you intend to say is clearer with 4 state system {NE, E}
> * {NV, V} = {(NE, NV), (NE, V), (E, NV), (E, V)} where (NE, V) is
> unreachable. Let's name them S1, S2, S3, S4 respectively.
[...]
> Notice that there are no edges to and from S2.

So why list it as a possible state?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Tomas Vondra

On 2/13/25 05:15, Thomas Munro wrote:
> On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
>  wrote:
>> On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra  wrote:
>>> For the nvme RAID (device: raid-nvme), it's looks almost exactly the
>>> same, except that with parallel query (page 27) there's a clear area of
>>> regression with eic=1 (look for "column" of red cells). That's a bit
>>> unfortunate, because eic=1 is the default value.
>>
>> So, I feel pretty confident after even more analysis today with Thomas
>> Munro that all of the parallel cases with effective_io_concurrency ==
>> 1 are because master cheats effective_io_concurrency. Therefore, I'm
>> not too worried about those for now. We probably will have to increase
>> effective_io_concurrency before merging this patch, though.
> 
> Yeah.  Not only did we see it issuing up to 20 or so unauthorised
> overlapping POSIX_FADV_WILLNEED calls, but in the case we studied they
> didn't really do anything at all because it was *postfetching* (!), ie
> issuing advice for blocks it had already read.  The I/O was sequential
> (reading all or most blocks in order), so the misdirected advice at
> least kept out of the kernel's way, and it did some really effective
> readahead.  Meanwhile the stream version played by the rules and
> respected eic=1, staying only one block ahead.  We know that advice for
> sequential blocks blocks readahead, so that seems to explain the red
> for that parameter/test permutation.  (Hypothesis based on some
> quick reading: I guess kernel pages faulted in that way don't get
> marked with the internal "readahead was here" flag, which would
> normally cause the following pread() to trigger more asynchronous
> readahead with ever larger physical size).  But that code
> in master is also *trying* to do exactly what we're doing, it's just
> failing because of some combination of bugs in the parallel scan code
> (?).  It would be absurd to call it the winner:  we're setting out
> with the same goal, but for that particular set of parameters and test
> setup, apparently two (?) bugs make a right.  Melanie mentioned that
> some other losing cases also involved unauthorised amounts of
> overlapping advice, except there it was random and beneficial and the
> iterators didn't get out of sync with each other, so again it beat us
> in a few red patches, seemingly with a different accidental behaviour.
> Meanwhile we obediently trundled along with just 1 concurrent I/O or
> whatever, and lost.
> 

Makes sense, likely explains the differences. And I agree we should not
hold this against this patch series, it's master that's doing something
wrong. And in most of these cases we'd not even do bitmap scan anyway, I
think, it only happens because we forced the plan.

BTW how are you investigating those I/O requests? strace? iosnoop? perf?
Something else?

> On top of that, read_stream.c is also *trying* to avoid issuing advice
> for access patterns that look sequential, and *trying* to do I/O
> combining, which all works OK in serial BHS, but it breaks down in
> parallel because:
> 
>> Thomas mentioned this to me off-list, and I think he's right. We
>> likely need to rethink the way parallel bitmap heap scan workers get
>> block assignments for reading and prefetching to make it more similar
>> to parallel sequential scan. The workers should probably get
>> assignments of a range of blocks. On master, each worker does end up
>> issuing reads/fadvises for a bunch of blocks in a row -- even though
>> that isn't the intent of the parallel bitmap table scan
>> implementation. We are losing some of that with the patch -- but only
>> because it is behaving as you would expect given the implementation
>> and design. I don't consider that a blocker, though.
> 
> +/*
> + * The maximum number of tuples per page is not large (typically 256 with
> + * 8K pages, or 1024 with 32K pages).  So there's not much point in 
> making
> + * the per-page bitmaps variable size.  We just legislate that the size 
> is
> + * this:
> + */
> +OffsetNumber offsets[MaxHeapTuplesPerPage];
>  } TBMIterateResult;
> 
> Seems to be 291?  So sizeof(TBMIterateResult) must be somewhere
> around 588 bytes?  There's one of those for every buffer, and we'll
> support queues of up to effective_io_concurrency (1-1000) *
> io_combine_limit (1-128?), which would require ~75MB of
> per-buffer-data with maximum settings  That's a lot of memory to have
> to touch as you whizz around the circular queue even when you don't
> really need it.  With more typical numbers like 10 * 32 it's ~90kB,
> which I guess is OK.  I wonder if it would be worth trying to put a
> small object in there instead that could be expanded to the results
> later, cf f6bef362 for TIDStore, but I'm not sure if it's a blocker,
> just an observation.

Not sure I follow. Doesn't f6bef362 do quite the opposite, i.e. removing
the repalloc() calls and replacing that with a local on-stack buffer
sized for MaxOffsetNum

Re: pg_stat_statements and "IN" conditions

2025-02-13 Thread Álvaro Herrera

On 2025-Feb-13, Dmitry Dolgov wrote:

> Here is how it looks like (posting only the first patch, since we
> concentrate on it). This version handles just a little more to cover
> simpe cases like the implicit convertion above. The GUC is also moved
> out from pgss and renamed to query_id_merge_values. On top I've added
> more tests showing the impact, as well as sometimes awkward looking
> normalized query I was talking about. I'm going to experiment how to
> iron out the latter.

Thanks!  It's looking better.  Some small comments -- please add the new
GUC to postgresql.conf.sample.  Also, how wed are you to
"query_id_merge_values" as a name?  It's not in any way obvious that
this is about values in arrays.  How about query_id_squash_arrays?  Or
are you thinking in having values in other types of structures squashed
as well, and that this first patch does it for arrays only but you want
the GUC to also control some future feature?

(I think I prefer "squash" here as a verb to "merge").

> +static bool
> +IsMergeableConst(Node *element)
> +{
> + if (IsA(element, RelabelType))
> + element = (Node *) ((RelabelType *) element)->arg;
> +
> + if (IsA(element, CoerceViaIO))
> + element = (Node *) ((CoerceViaIO *) element)->arg;
> +
> + if(IsA(element, FuncExpr))
> + {
> + FuncExpr *func = (FuncExpr *) element;
> + char provolatile = func_volatile(func->funcid);

I think calling func_volatile potentially once per array element is not
good; this might cause dozens/thousands of identical syscache lookups.
Maybe we can pass an initially NIL list from IsMergeableConstList (as
List **), which IsMergeableConst fills with OIDs of functions that have
been checked and found acceptable.  Then the second time around we
search the list first and only do func_volatile() after not finding a
match.

Another thing I didn't quite understand is why you did this rather
baroque-looking list scan:

> +static bool
> +IsMergeableConstList(List *elements, Node **firstExpr, Node **lastExpr)
> +{
> + ListCell   *temp;
> + Node   *firstElem = NULL;
> +
> + if (elements == NIL)
> + return false;
> +
> + if (!query_id_merge_values)
> + {
> + /* Merging is disabled, process everything one by one */
> + return false;
> + }
> +
> + firstElem = linitial(elements);
> +
> + /*
> +  * If the first expression is a constant, verify if the following 
> elements
> +  * are constants as well. If yes, the list is eligible for merging.
> +  */
> + if (IsMergeableConst(firstElem))
> + {
> + foreach(temp, elements)
> + {
> + Node *element = lfirst(temp);
> +
> + if (!IsMergeableConst(element))
> + return false;
> + }
> +
> + *firstExpr = firstElem;
> + *lastExpr = llast(elements);
> + return true;
> + }

Why not just scan the list in the straightforward way, that is

 foreach(temp, elements)
 {
if (!IsMergeableConst(lfirst(temp)))
   return false;
 }
 *firstExpr = linitial(elements);
 *lastExpr = llast(elements);
 return true;

Is there something being optimized here specifically for the first
element?  I don't see it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-13 Thread Pavel Borisov

On Tue, 11 Feb 2025 at 21:50, Tom Lane  wrote:
>
> Andrey Borodin  writes:
> > On 7 Feb 2025, at 02:05, Tom Lane  wrote:
> >> Do you have any further comments on this patch?
>
> > No, all steps of the patch set look good to me.
>
> Pushed then.  Thanks for reviewing!
Thanks to Michel and everyone working on the patch!

Regards,
Pavel Borisov

Re: Adding NetBSD and OpenBSD to Postgres CI

2025-02-13 Thread Nazir Bilal Yavuz

Hi,

On Wed, 12 Feb 2025 at 17:49, Andres Freund  wrote:
> I finally pushed this. The meson fix backpatched to 16.
>
> I did some very minor polishing, reordering the OS lists to stay alphabetical,
> instead of adding netbsd/openbsd somewhere to the front of lists.

Thanks!

> Obviously not your fault, but I do think it's pretty crazy that with the same
> available resources, netbsd and openbsd take considerably longer than linux
> and freebsd, which both do a lot more (linux tests 32 and 64 bit with ubsan
> with autoconf and asan with meson, freebsd tests a bunch of debugging
> options).  I wonder what is going wrong.  I suspect we might be waiting for
> the filesystem a lot, according to cirrus-ci's CPU usage graph we're not CPU
> bound during the test phase.  Or maybe they just scale badly.

Yes, I could not find the reason why either.

> Any chance you're interested in rebasing and expanding
> https://postgr.es/m/20240413021221.hg53rvqlvldqh57i%40awork3.anarazel.de
>
> It'd be nice if we could enable these tasks on cfbot, where we bring our own
> compute, while leaving them on manual for everyone else.

Sure, I will work on this. Thank you for mentioning it.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Virtual generated columns

2025-02-13 Thread jian he

On Tue, Feb 11, 2025 at 10:34 AM Richard Guo  wrote:
>
> On Mon, Feb 10, 2025 at 1:16 PM Zhang Mingli  wrote:
> > I believe virtual columns should behave like stored columns, except they 
> > don't actually use storage.
> > Virtual columns are computed when the table is read, and they should adhere 
> > to the same rules of join semantics.
> > I agree with Richard, the result seems incorrect. The right outcome should 
> > be:
> > gpadmin=# SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE;
> >  a | b
> > --+--
> >  NULL | NULL
> >  NULL | NULL
> > (2 rows)
>
> Yeah, I also feel that the virtual generated columns should adhere to
> outer join semantics, rather than being unconditionally replaced by
> the generation expressions.  But maybe I'm wrong.
>
> If that's the case, this incorrect-result issue isn't limited to
> constant expressions; it could also occur with non-strict ones.
>
> CREATE TABLE t (a int, b int GENERATED ALWAYS AS (COALESCE(a, 100)));
> INSERT INTO t VALUES (1);
> INSERT INTO t VALUES (2);
>
> # SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE;
>  a |  b
> ---+-
>| 100
>| 100
> (2 rows)
>

Now I agree with you.
I think the following two should return the same result.

SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE;
SELECT t2.a, t2.b FROM t t1 LEFT JOIN (select * from t) t2 ON FALSE;


atatch refined patch solves the failure to copy the nullingrel bits
for the virtual generated columns.
in ReplaceVarsFromTargetList_callback.
I tried to just use add_nulling_relids, but failed, so I did the
similar thing as SetVarReturningType.


I didn't solve the out join semantic issue.
i am wondering, can we do the virtual generated column expansion in
the rewrite stage as is,
and wrap the expressions in PHVs if the virtual generated
columns come from the nullable side of an outer join.
I am looking at pullup_replace_vars_callback, but it seems not very
helpful to us.
From 2f9bd9ec737227b1b2dc52a4800d006bfa228003 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 13 Feb 2025 11:28:57 +0800
Subject: [PATCH v2 1/1] fix expand virtual generated column Var node
 varnullingrels field

To expand the virtual generated column Var node,
we need to copy the fields of the original virtual generated column Var node,
including varnullingrels, varlevelsup, varreturningtype, and varattno,
to the newly expanded Var node.

ReplaceVarsFromTargetList_callback didn't taken care of varnullingrels,
this patch fix this issue.

discussion: https://postgr.es/m/75eb1a6f-d59f-42e6-8a78-124ee808c...@gmail.com
---
 src/backend/rewrite/rewriteManip.c| 62 +++
 src/include/rewrite/rewriteManip.h|  3 +
 .../regress/expected/generated_virtual.out| 45 ++
 src/test/regress/sql/generated_virtual.sql| 11 
 src/tools/pgindent/typedefs.list  |  1 +
 5 files changed, 122 insertions(+)

diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index a115b217c91..69701ce6ecd 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -884,6 +884,64 @@ IncrementVarSublevelsUp_rtable(List *rtable, int delta_sublevels_up,
 	   QTW_EXAMINE_RTES_BEFORE);
 }
 
+/*
+ * SetVarNullingrels - adjust Var nodes for a specified varnullingrels.
+ *
+ * Find all Var nodes referring to the specified varno in the given
+ * expression and set their varnullingrels to the specified value.
+ */
+typedef struct
+{
+	int			sublevels_up;
+	int			varno;
+	Bitmapset *varnullingrels;
+} SetVarNullingrels_context;
+
+static bool
+SetVarNullingrels_walker(Node *node,
+		 SetVarNullingrels_context *context)
+{
+	if (node == NULL)
+		return false;
+
+	if (IsA(node, Var))
+	{
+		Var		   *var = (Var *) node;
+
+		if (var->varlevelsup == context->sublevels_up &&
+			var->varno == context->varno)
+			var->varnullingrels = bms_union(context->varnullingrels,
+			var->varnullingrels);
+
+		return false;
+	}
+
+	if (IsA(node, Query))
+	{
+		/* Recurse into subselects */
+		bool		result;
+
+		context->sublevels_up++;
+		result = query_tree_walker((Query *) node, SetVarNullingrels_walker,
+   context, 0);
+		context->sublevels_up--;
+		return result;
+	}
+	return expression_tree_walker(node, SetVarNullingrels_walker, context);
+}
+void
+SetVarNullingrels(Node *node, int sublevels_up, int varno, Bitmapset *varnullingrels)
+{
+	SetVarNullingrels_context context;
+
+	context.sublevels_up = sublevels_up;
+	context.varno = varno;
+	context.varnullingrels = varnullingrels;
+
+	/* Expect to start with an expression */
+	SetVarNullingrels_walker(node, &context);
+}
+
 /*
  * SetVarReturningType - adjust Var nodes for a specified varreturningtype.
  *
@@ -1832,6 +1890,10 @@ ReplaceVarsFromTargetList_callback(Var *var,
 		if (var->varlevelsup > 0)
 			IncrementVarSublevelsUp((Node *) newnode, var->varlevelsup, 0);
 
+		if (var->varnullingrels != NULL)
+			SetVarNullingrels((Node *) newno

Re: Get rid of WALBufMappingLock

2025-02-13 Thread Yura Sokolov

13.02.2025 12:34, Alexander Korotkov пишет:
> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov  wrote:
>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov  
>>> wrote:
 Good, thank you.  I think 0001 patch is generally good, but needs some
 further polishing, e.g. more comments explaining how does it work.
>>
>> I tried to add more comments. I'm not good at, so recommendations are 
>> welcome.
>>
>>> Two things comes to my mind worth rechecking about 0001.
>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
>>> sensitive to that.  If so, I would propose to explicitly comment that
>>> and add corresponding asserts.
>>
>> They're certainly page aligned, since they are page borders.
>> I added assert on alignment of InitializeReserved for the sanity.
>>
>>> 2) Check if there are concurrency issues between
>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>
>> There are no issues:
>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>> GetXLogBuffer to zero-out WAL page.
>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
>> so switching wal is not concurrent. (Although, there is no need in this
>> exclusiveness, imho.)
> 
> Good, thank you.  I've also revised commit message and comments.
> 
> But I see another issue with this patch.  In the worst case, we do
> XLogWrite() by ourselves, and it could potentially could error out.
> Without patch, that would cause WALBufMappingLock be released and
> XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> cause other processes infinitely waiting till we finish the
> initialization.
> 
> Possible solution would be to save position of the page to be
> initialized, and set it back to XLogCtl->InitializeReserved on error
> (everywhere we do LWLockReleaseAll()).  We also must check that on
> error we only set XLogCtl->InitializeReserved to the past, because
> there could be multiple concurrent failures.  Also we need to
> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

---
regards
Yura Sokolov aka funny-falcon

Re: Restrict publishing of partitioned table with a foreign table as partition

2025-02-13 Thread Shlok Kyal

On Thu, 13 Feb 2025 at 11:25, vignesh C  wrote:
>
> On Tue, 11 Feb 2025 at 16:55, Shlok Kyal  wrote:
> > I have handled the above cases and added tests for the same.
>
> There is a concurrency issue with the patch:
> +check_partrel_has_foreign_table(Form_pg_class relform)
> +{
> +   boolhas_foreign_tbl = false;
> +
> +   if (relform->relkind == RELKIND_PARTITIONED_TABLE)
> +   {
> +   List   *relids = NIL;
> +
> +   relids = find_all_inheritors(relform->oid, NoLock, NULL);
> +
> +   foreach_oid(relid, relids)
> +   {
> +   Relationrel = table_open(relid,
> AccessShareLock);
> +
> +   if (RelationGetForm(rel)->relkind ==
> RELKIND_FOREIGN_TABLE)
> +   has_foreign_tbl = true;
> +
> +   table_close(rel, AccessShareLock);
> +
> +   if (has_foreign_tbl)
> +   break;
> +   }
> +   }
> +
> +   return has_foreign_tbl;
> +}
>
> In an ideal scenario, the creation of a foreign table should fail if
> there is an associated publication, as demonstrated below:
> CREATE TABLE t(id int) PARTITION BY RANGE(id);
> CREATE TABLE part1 PARTITION OF t FOR VALUES FROM (0) TO (5);
> CREATE TABLE part2 PARTITION OF t FOR VALUES FROM (5) TO (15)
> PARTITION BY RANGE(id);
> CREATE PUBLICATION pub1 FOR TABLE t with (publish_via_partition_root = true);
>
> postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES
> FROM (10) TO (15) SERVER fdw;
> ERROR:  cannot create table foreign partition "part22"
> DETAIL:  partition table "part2" is published with option
> publish_via_partition_root
>
> Consider a scenario where the publication is being created and after
> the check_partrel_has_foreign_table execution is done, concurrently
> creation of foreign table is executed, then the creation will be
> successful.
> postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES
> FROM (10) TO (15) SERVER fdw;
> CREATE FOREIGN TABLE
>
> I felt the problem here is that you have released the lock:
> +   if (RelationGetForm(rel)->relkind ==
> RELKIND_FOREIGN_TABLE)
> +   has_foreign_tbl = true;
> +
> +   table_close(rel, AccessShareLock);
>
> We should retain the lock to fix this issue:
> +   if (RelationGetForm(rel)->relkind ==
> RELKIND_FOREIGN_TABLE)
> +   has_foreign_tbl = true;
> +
> +   table_close(rel, NoLock);
>

Hi Vignesh,

I have fixed the issue. Attached the updated v6 patch.

Thanks and Regards,
Shlok Kyal


v6-0001-Restrict-publishing-of-partitioned-table-with-for.patch
Description: Binary data

Re: Restrict copying of invalidated replication slots

2025-02-13 Thread vignesh C

On Tue, 4 Feb 2025 at 15:27, Shlok Kyal  wrote:
>
> Hi,
>
> Currently, we can copy an invalidated slot using the function
> 'pg_copy_logical_replication_slot'. As per the suggestion in the
> thread [1], we should prohibit copying of such slots.
>
> I have created a patch to address the issue.

This patch does not fix all the copy_replication_slot scenarios
completely, there is a very corner concurrency case where an
invalidated slot still gets copied:
+   /* We should not copy invalidated replication slots */
+   if (src_isinvalidated)
+   ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("cannot copy an invalidated
replication slot")));

Consider the following scenario:
step 1) Set up streaming replication between the primary and standby nodes.
step 2) Create a logical replication slot (test1) on the standby node.
step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
add a sleep in InvalidatePossiblyObsoleteSlot function like below:
if (cause == RS_INVAL_WAL_LEVEL)
{
while (bsleep)
sleep(1);
}
step 4) Reduce wal_level on the primary to replica and restart the primary node.
step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
'test2');  -- It will wait till the lock held by
InvalidatePossiblyObsoleteSlot is released while trying to create a
slot.
step 6) Increase wal_level back to logical on the primary node and
restart the primary.
step 7) Now allow the invalidation to happen (continue the breakpoint
held at step 3), the replication control lock will be released and the
invalidated slot will be copied

After this:
postgres=# SELECT 'copy' FROM
pg_copy_logical_replication_slot('test1', 'test2');
 ?column?
--
 copy
(1 row)

-- The invalidated slot (test1) is copied successfully:
postgres=# select * from pg_replication_slots ;
 slot_name |plugin | slot_type | datoid | database | temporary
| active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e |  inactive_since  | conflicting |
invalidation_reason   | failover | synced
---+---+---++--+---+++--+--+-+-++---+-
--+--+-++--+
 test1 | test_decoding | logical   |  5 | postgres | f
| f  ||  |  745 | 0/4029060   | 0/4029098
 | lost   |   | f
  | 2025-02-13 15:26:54.666725+05:30 | t   |
wal_level_insufficient | f| f
 test2 | test_decoding | logical   |  5 | postgres | f
| f  ||  |  745 | 0/4029060   | 0/4029098
 | reserved   |   | f
  | 2025-02-13 15:30:30.477836+05:30 | f   |
 | f| f
(2 rows)

-- A subsequent attempt to decode changes from the invalidated slot
(test2) fails:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING:  detected write past chunk end in TXN 0x5e77e6c6f300
ERROR:  logical decoding on standby requires "wal_level" >= "logical"
on the primary

-- Alternatively, the following error may occur:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING:  detected write past chunk end in TXN 0x582d1b2d6ef0
data

 BEGIN 744
 COMMIT 744
(2 rows)

This is an edge case that can occur under specific conditions
involving replication slot invalidation when there is a huge lag
between primary and standby.
There might be a similar concurrency case for wal_removed too.

Regards,
Vignesh

Re: Small memory fixes for pg_createsubcriber

2025-02-13 Thread Ranier Vilela

Em qua., 12 de fev. de 2025 às 18:17, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Coverity has some reports about pg_createsubcriber.
>
> > CID 1591322: (#1 of 1): Resource leak (RESOURCE_LEAK)
> > 10. leaked_storage: Variable dbname going out of scope leaks the storage
> it
> > points to.
>
> FTR, the security team's Coverity instance also complained about that.
> I was planning to fix it after the release freeze lifted, but you
> beat me to it, which is fine.  Our report turned up a couple other
> things that I just pushed fixes for.
>
Yeah, I see the commits, thanks for that.
I still have some reports that I could post that Coverity thinks are bugs.
They are not, but I think it is worth the effort to fix them because the
code is confusing.
I think it would improve readability and future maintainability.


>
> (It seems like Coverity must've updated their rules recently,
> because we also got a bunch of false-positive reports that were
> not there before, mostly in pre-existing code.)
>
 I believe they are trying to innovate at some point.
Many of these false positives come from a risky coding style,
I am much more cautious in my analyses.

best regards,
Ranier Vilela

Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-13 Thread Alvaro Herrera

On 2025-Feb-12, Sami Imseih wrote:

> Greg S. Mullane wrote:
>
> > I agree fingerprint is the right final word. But "jumble" conveys
> > the *process* better than "fingerprinting".  I view it as jumbling
> > produces an object that can be fingerprinted.
> 
> hmm, "jumble" describes something that is scrambled
> or not in order, such as the 64-bit hash produced. It
> sounds like the final product.

I don't understand why we would change any naming here at all.  I think
you should be looking at a much broader consensus and plus-ones that a
renaming is needed.  -1 from me.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."

Re: Non-text mode for pg_dumpall

2025-02-13 Thread Mahendra Singh Thalor

Thanks Jian.

On Wed, 12 Feb 2025 at 12:45, jian he  wrote:
>
> On Wed, Feb 12, 2025 at 1:17 AM Mahendra Singh Thalor
>  wrote:
> >
> > >
> > > There are some tests per https://commitfest.postgresql.org/52/5495, I
> > > will check it later.
>
> hi.
> the cfbot failure is related to function _tocEntryRequired
>
>   if (strcmp(te->desc, "DATABASE") == 0 ||
>   strcmp(te->desc, "DATABASE PROPERTIES") == 0)
>   {
> - if (ropt->createDB)
> + if (ropt->createDB || AH->format != archNull)
>   return REQ_SCHEMA;
>   else
>   return 0;
>
> for restoring multiple databases:
> in v16 implementation: pg_restore even if you do not specify --create,
> it actually did what pg_restore --create option does.
>
> if there are multiple databases in the archive:
> to make the pg_restore --file output is usable, the output file need
> have \connect and CREATE DATABASE
> command. that is exactly what  --create option would do.
> pg_restore --file behavior need align with pg_restore --dbname.
> therefore pg_restore restoring multiple databases will use --create option.
>
>
> we can either error out (pg_fatal) saying
> restoring multiple databases requires the pg_restore --create option.
> Or we can add a pg_log_info saying
> pg_restore --create option will be set to true while restoring
> multiple databases.

In my earlier version, I was giving an error if --create option was
not specified.

I think it will be good and more preferable if we give an error
without the --create option if dump was taken from pg_dumpall. Even
though there is a single database in the dump of pg_dumpall, it is
possible that a particular database hasn't been created.
Ex: -d postgres and we have db1 dump in file. In this case, we have
only one database dump but this database has not been created.
If the user wants to restore a single database, then the user should
use a single database dump file. Forcefully adding --create option is
not a good idea, instead we will give an error to the user and let him
correct the inputs.

Apart from the above handling, I fixed all the pending review comments
in this patch and made some more changes.

Here, I am attaching an updated patch for review and testing.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

v17_pg_dumpall-with-non-text_format-13th_feb.patch
Description: Binary data

Re: Elimination of the repetitive code at the SLRU bootstrap functions.

2025-02-13 Thread Álvaro Herrera

On 2025-Feb-13, Aleksander Alekseev wrote:

> Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to
> me that it merely wastes space in SlruCtlData. On top of that I'm not
> 100% sure if all the supported platforms have C99 compilers with
> designated initializers support.

They do -- we use them quite extensively nowadays.

> Wouldn't it be simpler to pass the callback straight to
> BootStrapSlruPage()?

Yeah, maybe this would be easier.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Just treat us the way you want to be treated + some extra allowance
 for ignorance."(Michael Brusser)

Re: SQL:2011 application time

2025-02-13 Thread Peter Eisentraut


On 23.01.25 16:40, Peter Eisentraut wrote:

I think my interpretation of what RESTRICT should do is different.

The clause "Execution of referential actions" in the SQL standard only 
talks about referenced and referencing columns, not periods.  So this 
would mean you can change the period columns all you want (as long as 
they maintain referential integrity).  So it would be like the NO ACTION 
case.  But you can't change any of the non-period columns on the primary 
key if they are referenced by any referencing columns, even if the 
respective periods are disjoint.


Maybe this makes sense, or maybe this is a mistake (neglected to update 
this part when periods were introduced?).  But in any case, I can't get 
from this to what the patch does.  When I apply the tests in the patch 
without the code changes, what I would intuitively like are more errors 
than the starting state, but your patch results in fewer errors.


After staring at this a bit more, I think my interpretation above was 
not correct.  This seems better:


The clause "Execution of referential actions" in the SQL standard only
talks about referenced and referencing columns, not periods.  The 
RESTRICT error is raised when a "matching row" exists in the referencing 
table.  The "matching row" is determined purely by looking at the 
"normal" columns of the key, not the period columns.


So in our implementation in ri_restrict(), ISTM, we just need to ignore 
the period/range columns when doing the RESTRICT check.


Attached is a quick patch that demonstrates how this could work.  I 
think the semantics of this are right and make sense.
From 7d2ae226da67838825120f759a912362cf91e065 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 13 Feb 2025 13:15:23 +0100
Subject: [PATCH] WIP: Fix RESTRICT behavior for temporal foreign keys

---
 src/backend/utils/adt/ri_triggers.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c
index 8473448849c..bbd1d5712f2 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -787,6 +787,12 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
Oid pk_type = RIAttType(pk_rel, 
riinfo->pk_attnums[i]);
Oid fk_type = RIAttType(fk_rel, 
riinfo->fk_attnums[i]);
 
+   if (riinfo->hasperiod && !is_no_action)
+   {
+   if (i == riinfo->nkeys - 1)
+   continue;
+   }
+
quoteOneName(attname,
 RIAttName(fk_rel, 
riinfo->fk_attnums[i]));
sprintf(paramname, "$%d", i + 1);
@@ -2734,6 +2740,12 @@ ri_ReportViolation(const RI_ConstraintInfo *riinfo,
Datum   datum;
boolisnull;
 
+   if (riinfo->hasperiod && is_restrict)
+   {
+   if (idx == riinfo->nkeys - 1)
+   continue;
+   }
+
name = NameStr(att->attname);
 
datum = slot_getattr(violatorslot, fnum, &isnull);
-- 
2.48.1

Re: Elimination of the repetitive code at the SLRU bootstrap functions.

2025-02-13 Thread Aleksander Alekseev

Hi Alvaro,

> > Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to
> > me that it merely wastes space in SlruCtlData. On top of that I'm not
> > 100% sure if all the supported platforms have C99 compilers with
> > designated initializers support.
>
> They do -- we use them quite extensively nowadays.

Got it, thanks. I was a bit worried about Visual Studio in particular.

-- 
Best regards,
Aleksander Alekseev

Fwd: [Feature Request] Per-Database Transaction Logs for Enhanced Isolation and New Capabilities

2025-02-13 Thread Sébastien

Introduce per-database transaction logs (WAL) and transaction ID spaces to
improve database isolation, enable hot-mounting/unmounting, selective
replication, and open new possibilities in PostgreSQL.
Business Use-case:

With modern SSDs offering high throughput and low latency, maintaining a
single *global* transaction log across all databases in a PostgreSQL
instance is becoming an unnecessary constraint.

By allowing *each database to have its own transaction log and transaction
ID space*, PostgreSQL could achieve significant improvements in
performance, isolation, and flexibility.
*Key Benefits*:

   - *Better isolation between databases*:
  - A long-running transaction in one database would no longer prevent
  vacuuming of tables in another.
  - No risk of transaction wraparound issues in one database affecting
  others.
   - *Hot-mounting/unmounting databases*:
  - Ability to attach/detach databases dynamically at the filesystem
  level without impacting the rest of the cluster.
  - Faster database restores and migrations by simply copying database
  files and starting the instance.
   - *Selective replication*:
  - Currently, logical replication can be done at the table level, but
  physical replication applies to the entire cluster.
  - With per-database WAL, it would be possible to *replicate only
  specific databases* without requiring complex logical replication
  setups.
   - *More flexible backup & restore*:
  - Ability to back up and restore *individual databases* with
  transaction consistency, instead of full-cluster backups.
  - Faster recovery and better disaster recovery options.
   - *Better integration with cloud and containerized environments*:
   Would enable dynamically adding and removing databases in cloud
   environments without cluster-wide restarts.

User impact with the change:

   - Users with large multi-database clusters would see *better transaction
   isolation*, fewer maintenance conflicts, and *more flexible database
   management*.
   - Organizations running *multi-tenant* environments or *per-database
   replication* setups would gain *easier and more efficient ways to manage
   databases*.
   - PostgreSQL would become much more *modular and cloud-friendly*,
   aligning it with modern high-availability and container-based deployments.

Implementation details:

   - Requires modifying PostgreSQL's WAL and transaction system to support
   per-database transaction logs.
   - WAL archiving, replication, and recovery logic would need adjustments
   to support per-database operations.
   - Needs careful handling of catalog metadata (such as pg_database) to
   ensure atomicity when attaching/detaching databases.

Estimated Development Time:

I do not know PostgreSQL's internal architecture well enough to assess the
full impact of such a change. However, taking a step back, it seems that
rather than deeply modifying the core engine, an alternative approach could
be to spawn a separate PostgreSQL engine per database. In this case, the
main entry point would act more like a connection bouncer, routing requests
to individual database engines.
Opportunity Window Period:

As SSD and cloud-based infrastructures become the norm, this change would
provide *major competitive advantages* for PostgreSQL in multi-tenant,
high-performance, and cloud-native use cases.
Budget Money:

...
Contact Information:

Sebastien Caunes
sebast...@pixseed.fr

[Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations

2025-02-13 Thread Sébastien

One-line Summary:

Introduce an INSERT FROZEN feature to bypass vacuum processing for
large-scale cold data imports, reducing the impact on system performance
post-import. For large imports, migrations and major version upgrades.
Business Use-case:

When importing massive datasets (e.g., 6-10TB) into PostgreSQL on a heavily
loaded server, we observed that the system struggled significantly weeks
later when the autovacuum process had to freeze all the imported data
pages. This led to severe performance degradation, requiring manual
intervention to prioritize vacuum jobs to complete them as quickly as
possible.

This issue is particularly critical during database *migrations* or *version
upgrades*, where a full data reload is often necessary. Each time a major
PostgreSQL upgrade occurs, users must reimport large datasets, leading to
the same problem of vacuum storms post-import. An INSERT FROZEN feature
would allow importing data that is known to be immutable, preventing
unnecessary vacuum overhead and reducing system strain.
User impact with the change:

   - Users importing large, cold datasets (initial loads, migrations,
   version upgrades) can mark them as "frozen" during insertion, so pages are
   directly marked as frozen.
   - Reduced risk of autovacuum storms weeks after large imports.
   - More predictable system performance post-import and post-upgrade.
   - Avoid unnecessary rewriting of all pages after.
   - Significant improvement for users who perform regular major version
   upgrades and need to reload data.

Implementation details:

   - A new INSERT FROZEN option could be introduced, similar to COPY FREEZE,
   allowing direct insertion of tuples in a frozen state.
   - This would likely require changes in heap storage logic to ensure
   tuples are written with a frozen XID at insert time.
   - Consideration should be given to transaction semantics and WAL logging
   to ensure consistency and crash recovery integrity.

Estimated Development Time:

Unknown (would require input from core developers to assess complexity).
But I think it's not that much and pretty straightforward to implement
experimentally. Then.
Opportunity Window Period:

...
Budget Money:

...
Contact Information:
If you have further question regarding the issues I experienced that this
would solve, feel free to contact me

Sébastien Caunes bokan...@gmail.com

Thank you for your attention.


-- 
Sébastien Caunes
+33 6 7 229 229 7

Re: Restrict publishing of partitioned table with a foreign table as partition

2025-02-13 Thread vignesh C

On Thu, 13 Feb 2025 at 15:50, Shlok Kyal  wrote:
>
>
> I have fixed the issue. Attached the updated v6 patch.

There is another concurrency issue:
In case of create publication for all tables with
publish_via_partition_root we will call check_foreign_tables:
@@ -876,6 +876,10 @@ CreatePublication(ParseState *pstate,
CreatePublicationStmt *stmt)
/* Associate objects with the publication. */
if (stmt->for_all_tables)
{
+   /* Check if any foreign table is a part of partitioned table */
+   if (publish_via_partition_root)
+   check_foreign_tables(stmt->pubname);

At the time of check in check_foreign_tables, there are no foreign
tables so this check will be successful:
+check_foreign_tables_in_schema(Oid schemaid, char *pubname)
+{
+   RelationclassRel;
+   ScanKeyData key[2];
+   TableScanDesc scan;
+   HeapTuple   tuple;
+
+   classRel = table_open(RelationRelationId, AccessShareLock);
+
+   ScanKeyInit(&key[0],
+   Anum_pg_class_relnamespace,
+   BTEqualStrategyNumber, F_OIDEQ,
+   schemaid);
+   ScanKeyInit(&key[1],
+   Anum_pg_class_relkind,
+   BTEqualStrategyNumber, F_CHAREQ,
+   CharGetDatum(RELKIND_PARTITIONED_TABLE));
+
+   scan = table_beginscan_catalog(classRel, 2, key);
+   while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)

Now immediately after execution of this, create a foreign table:
postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES
FROM (10) TO (15) SERVER fdw;
CREATE FOREIGN TABLE

And then continue execution of create publication, it will also be successful:
postgres=# create publication pub1 for all tables with (
publish_via_partition_root =true);
CREATE PUBLICATION

One probable way to fix this is to do the search similar to
check_foreign_tables_in_schema where we can skip including schemaid
key for all tables.

How about something like the attached patch.

Regards,
Vignesh
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 520fa2f382..3214104dec 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -1388,21 +1388,24 @@ check_foreign_tables_in_schema(Oid schemaid, char *pubname)
 {
 	Relation	classRel;
 	ScanKeyData key[2];
+	int 		keycount = 0;
 	TableScanDesc scan;
 	HeapTuple	tuple;
 
 	classRel = table_open(RelationRelationId, AccessShareLock);
 
-	ScanKeyInit(&key[0],
-Anum_pg_class_relnamespace,
-BTEqualStrategyNumber, F_OIDEQ,
-schemaid);
-	ScanKeyInit(&key[1],
+	ScanKeyInit(&key[keycount++],
 Anum_pg_class_relkind,
 BTEqualStrategyNumber, F_CHAREQ,
 CharGetDatum(RELKIND_PARTITIONED_TABLE));
-
-	scan = table_beginscan_catalog(classRel, 2, key);
+	
+	if (OidIsValid(schemaid))
+		ScanKeyInit(&key[keycount++],
+		Anum_pg_class_relnamespace,
+		BTEqualStrategyNumber, F_OIDEQ,
+		schemaid);
+
+	scan = table_beginscan_catalog(classRel, keycount, key);
 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
 		Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
@@ -1436,46 +1439,3 @@ check_foreign_tables_in_schema(Oid schemaid, char *pubname)
 	table_endscan(scan);
 	table_close(classRel, AccessShareLock);
 }
-
-/* Check if any foreign table is a partition table */
-void
-check_foreign_tables(char *pubname)
-{
-	Relation	classRel;
-	ScanKeyData key[1];
-	TableScanDesc scan;
-	HeapTuple	tuple;
-
-	classRel = table_open(RelationRelationId, AccessShareLock);
-
-	ScanKeyInit(&key[0],
-Anum_pg_class_relkind,
-BTEqualStrategyNumber, F_CHAREQ,
-CharGetDatum(RELKIND_FOREIGN_TABLE));
-
-	scan = table_beginscan_catalog(classRel, 1, key);
-	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
-
-		if (relForm->relispartition)
-		{
-			Oid			parent_oid;
-			char	   *parent_name;
-			List	   *ancestors = get_partition_ancestors(relForm->oid);
-
-			parent_oid = llast_oid(ancestors);
-			parent_name = get_rel_name(parent_oid);
-
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("cannot set parameter \"%s\" to true for publication \"%s\"",
-			"publish_via_partition_root", pubname),
-	 errdetail("partition table \"%s\" in publication contains a foreign partition",
-			   parent_name)));
-		}
-	}
-
-	table_endscan(scan);
-	table_close(classRel, AccessShareLock);
-}
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index e71d5408c7..b8da4ed130 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -878,7 +878,7 @@ CreatePublication(ParseState *pstate, CreatePublicationStmt *stmt)
 	{
 		/* Check if any foreign table is a part of

Re: [PATCH] Add regression tests of ecpg command notice (error / warning)

2025-02-13 Thread Fujii Masao




On 2025/02/06 8:57, Ryo Kanbayashi wrote:

On Wed, Feb 5, 2025 at 9:31 PM Ryo Kanbayashi  wrote:


Hi hackers,

When I wrote patch of ecpg command notice bug, I recognized needs of
regression tests for ecpg command notices and I say that I write the
tests.


Thanks for working on this!



I explain about implementation of this patch.

What is this patch
- add regression tests which test ecpg command notices such as warning
and errors
- test notices implemented in ecpg.addons file

Basic policy on implementation
- do in a way that matches the method using the existing pg_regress
command as much as possible
- avoid methods that increase the scope of influence

Next, I list answers to points that are likely to be pointed out in
advance below :)
- shell scripts and bat files is used due to ...
 avoid non zero exit code of ecpg command makes tests failure
 avoid increasing C code for executing binary which cares cross platform
- python code is used because I couldn't write meson.build
appropriately describe dependency about materials which is used on
tests without it. please help me...
- as you said, kick this kind of tests by pg_regress accompanied with
needless PG server process execution. but pg_regress doesn't execute
test without it and making pg_regress require modification which has
not small scope of influence


Wouldn't it be simpler to use the existing TAP test mechanism,
as shown in the attached patch? Please note that this patch is very WIP,
so there would be many places that need further implementation and refinement.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/interfaces/ecpg/preproc/Makefile 
b/src/interfaces/ecpg/preproc/Makefile
index 84199a9a5d..d0e3852a87 100644
--- a/src/interfaces/ecpg/preproc/Makefile
+++ b/src/interfaces/ecpg/preproc/Makefile
@@ -81,6 +81,9 @@ ecpg_keywords.o: ecpg_kwlist_d.h
 c_keywords.o: c_kwlist_d.h
 keywords.o: $(top_srcdir)/src/include/parser/kwlist.h
 
+check:
+   $(prove_check)
+
 install: all installdirs
$(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)'
 
diff --git a/src/interfaces/ecpg/preproc/t/001_ecpg.pl 
b/src/interfaces/ecpg/preproc/t/001_ecpg.pl
new file mode 100644
index 00..bc52d19a7f
--- /dev/null
+++ b/src/interfaces/ecpg/preproc/t/001_ecpg.pl
@@ -0,0 +1,37 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+program_help_ok('ecpg');
+program_version_ok('ecpg');
+program_options_handling_ok('ecpg');
+command_fails(['ecpg'], 'ecpg without arguments fails');
+
+command_checks_all(
+   [ 'ecpg', 't/notice.pgc' ],
+   3,
+   [qr//],
+   [
+   qr/ERROR: AT option not allowed in CONNECT statement/,
+   qr/ERROR: AT option not allowed in DISCONNECT statement/,
+   qr/ERROR: AT option not allowed in SET CONNECTION statement/,
+   qr/ERROR: AT option not allowed in TYPE statement/,
+   qr/ERROR: AT option not allowed in WHENEVER statement/,
+   qr/ERROR: AT option not allowed in VAR statement/,
+   qr/WARNING: COPY FROM STDIN is not implemented/,
+   qr/ERROR: using variable "cursor_var" in different declare 
statements is not supported/,
+   qr/ERROR: cursor "duplicate_cursor" is already defined/,
+   qr/ERROR: SHOW ALL is not implemented/,
+   qr/WARNING: no longer supported LIMIT/,
+   qr/WARNING: cursor "duplicate_cursor" has been declared but not 
opened/,
+   qr/WARNING: cursor "duplicate_cursor" has been declared but not 
opened/,
+   qr/WARNING: cursor ":cursor_var" has been declared but not 
opened/,
+   qr/WARNING: cursor ":cursor_var" has been declared but not 
opened/
+   ],
+   'ecpg with warnings');
+
+done_testing();
diff --git a/src/interfaces/ecpg/preproc/t/notice.pgc 
b/src/interfaces/ecpg/preproc/t/notice.pgc
new file mode 100644
index 00..7e54627fcb
--- /dev/null
+++ b/src/interfaces/ecpg/preproc/t/notice.pgc
@@ -0,0 +1,42 @@
+/* Test ECPG notice/warning/error messages */
+
+#include 
+
+int
+main(void)
+{
+   EXEC SQL BEGIN DECLARE SECTION;
+   char *cursor_var = "mycursor";
+   short a;
+   EXEC SQL END DECLARE SECTION;
+
+   /* For consistency with other tests */
+   EXEC SQL CONNECT TO testdb AS con1;
+
+   /* Test AT option errors */
+   EXEC SQL AT con1 CONNECT TO testdb2;
+   EXEC SQL AT con1 DISCONNECT;
+   EXEC SQL AT con1 SET CONNECTION TO testdb2;
+   EXEC SQL AT con1 TYPE string IS char[11];
+   EXEC SQL AT con1 WHENEVER NOT FOUND CONTINUE;
+   EXEC SQL AT con1 VAR a IS int;
+
+   /* Test COPY FROM STDIN warning */
+   EXEC SQL COPY test FROM stdin;
+
+   /* Test same variable in multi declare statement */
+   EXEC SQL DE

Re: Address the bug in 041_checkpoint_at_promote.pl

2025-02-13 Thread Nitin Jadhav

> > Anyway, how did you find that?  Did you see a pattern when running the
> > test on a very slow machine or just from a read?  That was a good
> > catch.
> +1. I was wondering the same.

I was writing a TAP test to reproduce a crash recovery issue and used
parts of 041_checkpoint_at_promote.pl. Unfortunately, my test wasn't
waiting for the desired message to appear in the log. I then realized
there was a mistake in log_contains(), which I had copied from the
existing test. After testing 041_checkpoint_at_promote.pl multiple
times to see if it worked as expected, I noticed differences in some
iterations.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Thu, Feb 13, 2025 at 11:18 AM Ashutosh Bapat
 wrote:
>
> On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier  wrote:
> >
> > Anyway, how did you find that?  Did you see a pattern when running the
> > test on a very slow machine or just from a read?  That was a good
> > catch.
> +1. I was wondering the same.
>
>
> --
> Best Wishes,
> Ashutosh Bapat

Simplify the logic a bit (src/bin/scripts/reindexdb.c)

2025-02-13 Thread Ranier Vilela

Hi.

Coverity complained about possible dereference null pointer
in *reindex_one_database* function.
That's not really true.
But the logic is unnecessarily complicated.

Let's simplify it to humans and machines.

patch attached.

Best regards,
Ranier Vilela


simplifies-reindex-one-database-reindexdb.patch
Description: Binary data

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Melanie Plageman

On Thu, Feb 13, 2025 at 7:08 AM Tomas Vondra  wrote:
>
>
>
> On 2/13/25 01:40, Melanie Plageman wrote:
> > On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra  wrote:
> >>
> >> For the nvme RAID (device: raid-nvme), it's looks almost exactly the
> >> same, except that with parallel query (page 27) there's a clear area of
> >> regression with eic=1 (look for "column" of red cells). That's a bit
> >> unfortunate, because eic=1 is the default value.
> >
> > So, I feel pretty confident after even more analysis today with Thomas
> > Munro that all of the parallel cases with effective_io_concurrency ==
> > 1 are because master cheats effective_io_concurrency. Therefore, I'm
> > not too worried about those for now. We probably will have to increase
> > effective_io_concurrency before merging this patch, though.
> >
> > Thomas mentioned this to me off-list, and I think he's right. We
> > likely need to rethink the way parallel bitmap heap scan workers get
> > block assignments for reading and prefetching to make it more similar
> > to parallel sequential scan. The workers should probably get
> > assignments of a range of blocks. On master, each worker does end up
> > issuing reads/fadvises for a bunch of blocks in a row -- even though
> > that isn't the intent of the parallel bitmap table scan
> > implementation. We are losing some of that with the patch -- but only
> > because it is behaving as you would expect given the implementation
> > and design. I don't consider that a blocker, though.
> >
>
> Agreed. If this is due to master doing something wrong (and either
> prefetching more, or failing to prefetch), that's not the fault of this
> patch. People might perceive this as a regression, but increasing the
> default eic value would fix that.
>
> BTW I recall we ran into issues with prefetching and parallel workers
> about a year ago [1]. Is this the same issue, or something new?
>
> [1]
> https://www.postgresql.org/message-id/20240315211449.en2jcmdqxv5o6tlz%40alap3.anarazel.de

Th postfetching is the same issue as you linked in [1].

> >> For the SATA SSD RAID (device: raid-sata), it's similar - a couple
> >> regressions for eic=0, just like for NVMe. But then there's also a
> >> couple regressions for higher eic values, usually for queries with
> >> selective conditions.
> >
> > I started trying to reproduce the regressions you saw on your SATA
> > devices for higher eic values. I was focusing just on serial bitmap
> > heap scan -- since that doesn't have the problems mentioned above. I
> > don't have a SATA drive, but I used dmsetup to add a 10ms delay to my
> > nvme drive. Unfortunately, that did the opposite of reproduce the
> > issue. With dm delay 10ms, the patch is 5 times faster than master.
> >
>
> I'm not quite sure at which point does dm-setup add the delay. Does it
> affect the fadvise call too, or just the I/O if the page is not already
> in page cache?

So dmsetup adds the delay only when the page is not found in kernel
buffer cache -- actually below the filesystem before issuing it to the
block device. reads, writes, and flushes can be delayed (I delayed all
three by 10ms). fadvises can't since they are above that level.

- Melanie

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-02-13 Thread Oliver Ford

On Mon, Feb 3, 2025 at 11:46 AM Tatsuo Ishii  wrote:
>
> > I've looked at it again and I think the code is correct, but I
> > miswrote that the array needs to be sorted. The above query returns:
> >  x | y | nth_value
> > ---+---+---
> >  1 | 1 | 2
> >  2 | 2 | 1
> >  3 |   | 2
> >  4 | 4 |
> >  5 |   | 4
> >  6 | 6 | 7
> >  7 | 7 | 6
> > (7 rows)
> >
> > This is correct, for values of x:
> >
> > 1: The first non-null value of y is at position 0, however we have
> > EXCLUDE CURRENT ROW so it picks the next non-null value at position 1
> > and stores it in the array, returning 2.
> > 2: We can now take the first non-null value of y at position 0 and
> > store it in the array, returning 1.
> > 3. We take 1 preceding, using the position stored in the array, returning 2.
> > 4. 1 preceding and 1 following are both null, and we exclude the
> > current row, so returning null.
> > 5. 1 preceding is at position 3, store it in the array, returning 4.
> > 6. 1 preceding is null and we exclude the current row, so store
> > position 6 in the array, returning 7.
> > 7. 1 preceding is at position 5, store it in the array and return 6.
> >
> > It will be unordered when the EXCLUDE clause is used but the code
> > should handle this correctly.
>
> I ran this query (not using IGNORE NULLS) and get a result.
>
> SELECT
>   x,
>   nth_value(x,2) OVER w
> FROM generate_series(1,5) g(x)
> WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE 
> CURRENT ROW);
>  x | nth_value
> ---+---
>  1 | 3
>  2 | 3
>  3 | 2
>  4 | 3
>  5 | 4
> (5 rows)
>
> Since there's no NULL in x column, I expected the same result using
> IGNORE NULLS, but it was not:
>
> SELECT
>   x,
> nth_value(x,2) IGNORE NULLS OVER w
> FROM generate_series(1,5) g(x)
> WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE 
> CURRENT ROW);
>  x | nth_value
> ---+---
>  1 | 3
>  2 | 4
>  3 | 4
>  4 | 3
>  5 | 4
> (5 rows)
>
> I suspect the difference is in the code path of
> ignorenulls_getfuncarginframe and the code path in
> WinGetFuncArgInFrame, which takes care of EXCLUDE like this.
>
> case FRAMEOPTION_EXCLUDE_CURRENT_ROW:
> if (abs_pos >= winstate->currentpos &&
> winstate->currentpos >= 
> winstate->frameheadpos)
> abs_pos++;

Attached version doesn't use the nonnulls array if an Exclude is
specified, as I think it's not going to work with exclusions (as it's
only an optimization, this is ok and can be taken out entirely if you
prefer). I've also added your tests above to the tests.


0007-ignore-nulls.patch
Description: Binary data

Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations

2025-02-13 Thread Tom Lane

=?UTF-8?Q?S=C3=A9bastien?=  writes:
> Implementation details:

>- A new INSERT FROZEN option could be introduced, similar to COPY FREEZE,
>allowing direct insertion of tuples in a frozen state.
>- This would likely require changes in heap storage logic to ensure
>tuples are written with a frozen XID at insert time.
>- Consideration should be given to transaction semantics and WAL logging
>to ensure consistency and crash recovery integrity.

That last is exactly why this won't happen.  A frozen tuple would be
considered committed and visible the instant it appears in the table,
thus completely breaking both atomicity and integrity of the
transaction.

There has been work going on recently to reduce the impact of freezing
massive amounts of data by spreading the work more effectively [1].
I don't say that that particular commit has completely solved the
problem, but I think that continued effort in that direction is more
likely to yield usable results than what you're suggesting.

BTW, this might or might not be usable in your particular workflow,
but: there have long been some optimizations for data load into a
table created in the same transaction.  The idea there is that if the
transaction rolls back, the table will never have been visible to any
other transaction at all, so that maintaining atomicity/integrity of
its contents is moot.

regards, tom lane

[1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=052026c9b

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-13 Thread Shubham Khanna

On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal  wrote:
>
> On Tue, 11 Feb 2025 at 09:51, Shubham Khanna
>  wrote:
> >
> > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear Shubham,
> > >
> > > Thanks for updating the patch.
> > >
> > > Previously you told that you had a plan to extend the patch to drop other 
> > > replication
> > > objects [1], but I think it is not needed. pg_createsubscriber has 
> > > already been
> > > able to drop the existing subscrisubscriptions in 
> > > check_and_drop_existing_subscriptions().
> > > As for the replication slot, I have told in [2], it would be created 
> > > intentionally
> > > thus I feel it should not be dropped.
> > > Thus I regard the patch does not have concrete extending plan.
> > >
> > > Below part contains my review comment.
> > >
> > > 01. Option name
> > >
> > > Based on the above discussion, "--cleanup-publisher-objects" is not 
> > > suitable because
> > > it won't drop replication slots. How about "--cleanup-publications"?
> > >
> >
> > I have changed the name of the option  to "--cleanup-existing-publications"
> >
> > > 02. usage()
> > > ```
> > > +   printf(_("  -C  --cleanup-publisher-objects drop all publications 
> > > on the logical replica\n"));
> > > ```
> >
> > Fixed.
> >
> > > s/logical replica/subscriber
> > >
> > > 03. drop_all_publications
> > > ```
> > > +/* Drops all existing logical replication publications from all 
> > > subscriber
> > > + * databases
> > > + */
> > > +static void
> > > ```
> > >
> > > Initial line of the comment must be blank [3].
> > >
> >
> > Removed this function.
> >
> > > 04. main
> > > ```
> > > +   {"cleanup-publisher-objects", no_argument, NULL, 'C'},
> > > ```
> > >
> > > Is there a reason why upper case is used? I feel lower one is enough.
> > >
> >
> > Fixed.
> >
> > > 05. main
> > > ```
> > > +   /* Drop publications from the subscriber if requested */
> > > +   if (opt.cleanup_publisher_objects)
> > > +   drop_all_publications(dbinfo);
> > > ```
> > >
> > > After considering more, I noticed that we have already called 
> > > drop_publication()
> > > in the setup_subscriber(). Can we call drop_all_publications() there 
> > > instead when
> > > -C is specified?
> > >
> >
> > I agree with you on this. I have removed the drop_all_publication()
> > and added the "--cleanup-existing-publications" option to the
> > drop_publication()
> >
> > > 06. 040_pg_createsubscriber.pl
> > >
> > > ```
> > > +$node_s->start;
> > > +# Create publications to test it's removal
> > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;");
> > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;");
> > > +
> > > +# Verify the existing publications
> > > +my $pub_count_before =
> > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > +is($pub_count_before, '2',
> > > +   'two publications created before --cleanup-publisher-objects is 
> > > run');
> > > +
> > > +$node_s->stop;
> > > ```
> > >
> > > I feel it requires unnecessary startup and shutdown. IIUC, creating 
> > > publications and check
> > > counts can be before stopping the node_s, around line 331.
> > >
> >
> > Fixed.
> >
> > > 07. 040_pg_createsubscriber.pl
> > >
> > > ```
> > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;");
> > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;");
> > > +
> > > +# Verify the existing publications
> > > +my $pub_count_before =
> > > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > > +is($pub_count_before, '2',
> > > +   'two publications created before --cleanup-publisher-objects is 
> > > run');
> > > +
> > > ```
> > >
> > > Also, there is a possibility that CREATE PUBLICATION on node_p is not 
> > > replicated yet
> > > when SELECT COUNT(*) is executed. Please wait for the replay.
> > >
> > > [1]: 
> > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com
> > > [2]: 
> > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
> > > [3]: https://www.postgresql.org/docs/devel/source-format.html
> > >
> >
> > Fixed.
> >
> > The attached Patch contains the suggested changes.
> >
>
> Hi Shubham,
>
> I have some comments for v4 patch:
> 1. I think we should update the comment for the function
> 'drop_publication'. As its usage is changed with this patch
> Currently it states:
> /*
>  * Remove publication if it couldn't finish all steps.
>  */
>

Fixed.

> 2. In case when --cleanup_existing_publications is not specified the
> info message has two double quotes.
>
> pg_createsubscriber: dropping publication
> ""pg_createsubscriber_5_aa3c31f2"" in database "postgres"
>
> The code:
> +   appendPQExpBufferStr(targets,
> +PQescapeIdentifier(conn, dbinfo->pubname,
> +

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-13 Thread Shubham Khanna

On Wed, Feb 12, 2025 at 12:57 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> Thanks for updating the patch! Few comments.
>
> 01. pg_createsubscriber.sgml
> ```
> +  
> +   Remove all existing publications on the subscriber node before 
> creating
> +   a subscription.
> +  
> +
> ```
>

Fixed.

> I think this is not accurate. Publications on databases which are not target 
> will
> retain. Also, I'm not sure it is important to clarify "before creating a 
> subscription.".
>
> How about: Remove all existing publications from specified databases.
>
> 02. main()
> ```
> +   opt.cleanup_existing_publications = false;
> ```
>
> ISTM initialization is done with the same ordering with 
> CreateSubscriberOptions.
> Thus this should be at after recovery_timeout.
>

Fixed.

> 03. 040_pg_createsubscriber.pl
> ```
> +$node_p->wait_for_replay_catchup($node_s);
> +
> +# Verify the existing publications
> +my $pub_count_before =
> +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> +is($pub_count_before, '2',
> +   'two publications created before --cleanup-existing-publications is 
> run');
> ```
>
> I think no need to declare $pub_count_before. How about:
>
> ```
> ok( $node_s->poll_query_until(
> $db1, "SELECT COUNT(*) = 2 FROM pg_publication"),
> 'two publications created before --cleanup-existing-publications is 
> run');
> ```
>

Fixed.

> 04. 040_pg_createsubscriber.pl
> ``` +my $pub_count_after =
> +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> +is($pub_count_after, '0',
> +   'all publications dropped after --cleanup-existing-publications is 
> run');
> +
> ```
>
> I think no need to declare $pub_count_after. How about:
>
> ```
> is($node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"), '0',
> 'all publications dropped after --cleanup-existing-publications is 
> run');
> ```
>

Fixed.

> 05.
>
> According to the document [1], we must do double-quote for each identifiers. 
> But current
> patch seems not to do. Below example shows the case when three publications 
> exist.
>
> ```
> pg_createsubscriber: dropping publications "aaa, bbb, ccc" in database 
> "postgres"
> ```
>
> I think the output must be `"aaa", "bbb", "ccc"`. This means 
> drop_publication() should be refactored.
>
> [1]: 
> https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-QUOTES
>

Fixed.

The attached patch at [1] contains the suggested changes.

[1] - 
https://www.postgresql.org/message-id/CAHv8RjKGywu%2B3cAv9MPgxi1_TUXBT8yzUsW%2Bf%3Dg5UsgeJ8Uk6g%40mail.gmail.com

Thanks and regards,
Shubham Khanna.

Re: Get rid of WALBufMappingLock

2025-02-13 Thread Alexander Korotkov

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov  wrote:
> 13.02.2025 12:34, Alexander Korotkov пишет:
> > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov  
> > wrote:
> >> 08.02.2025 13:07, Alexander Korotkov пишет:
> >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov  
> >>> wrote:
>  Good, thank you.  I think 0001 patch is generally good, but needs some
>  further polishing, e.g. more comments explaining how does it work.
> >>
> >> I tried to add more comments. I'm not good at, so recommendations are 
> >> welcome.
> >>
> >>> Two things comes to my mind worth rechecking about 0001.
> >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> >>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> >>> sensitive to that.  If so, I would propose to explicitly comment that
> >>> and add corresponding asserts.
> >>
> >> They're certainly page aligned, since they are page borders.
> >> I added assert on alignment of InitializeReserved for the sanity.
> >>
> >>> 2) Check if there are concurrency issues between
> >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> >>
> >> There are no issues:
> >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> >> GetXLogBuffer to zero-out WAL page.
> >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> >> so switching wal is not concurrent. (Although, there is no need in this
> >> exclusiveness, imho.)
> >
> > Good, thank you.  I've also revised commit message and comments.
> >
> > But I see another issue with this patch.  In the worst case, we do
> > XLogWrite() by ourselves, and it could potentially could error out.
> > Without patch, that would cause WALBufMappingLock be released and
> > XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> > cause other processes infinitely waiting till we finish the
> > initialization.
> >
> > Possible solution would be to save position of the page to be
> > initialized, and set it back to XLogCtl->InitializeReserved on error
> > (everywhere we do LWLockReleaseAll()).  We also must check that on
> > error we only set XLogCtl->InitializeReserved to the past, because
> > there could be multiple concurrent failures.  Also we need to
> > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
>
> The single place where AdvanceXLInsertBuffer is called outside of critical
> section is in XLogBackgroundFlush. All other call stacks will issue server
> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
>
> XLogBackgroundFlush explicitely avoids writing buffers by passing
> opportunistic=true parameter.
>
> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> since server will shutdown/restart.
>
> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> to XLogWrite here?

You're correct.  I just reflected this in the next revision of the patch.

--
Regards,
Alexander Korotkov
Supabase


v6-0001-Get-rid-of-WALBufMappingLock.patch
Description: Binary data


v6-0002-several-attempts-to-lock-WALInsertLocks.patch
Description: Binary data

Re: Windows meson build

2025-02-13 Thread Vladlen Popolitov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

Hi!

This patch clarify the options used by meson to build PostgreSQL, how
to included libraries and include files from dependencies:
1) environment variable PKG_CONFIG_PATH
2) options -Dextra_include_dirs= and -Dextra_lib_dirs
This information useful for Windows builds and any other system,
where meson build is used.
 
+1 for commit

The new status of this patch is: Ready for Committer

Re: explain analyze rows=%.0f

2025-02-13 Thread Ilia Evdokimov



On 12.02.2025 22:56, Robert Haas wrote:

On Wed, Feb 12, 2025 at 2:55 PM Andrei Lepikhov  wrote:

On 13/2/2025 01:40, Tom Lane wrote:

I was idly speculating yesterday about letting the Ryu code print
the division result, so that we get a variable number of digits.
Realistically, that'd probably result in many cases in more digits
than anybody wants, so it's not a serious proposal.  I'm cool with
the fixed-two-digits approach to start with.

Okay, since no one else voted for the meaningful-numbers approach, I
would say that fixed size is better than nothing. It may cover some of
my practical cases, but unfortunately, not the most problematic ones.

I don't love it either, but I do think it is significantly better than nothing.




I'm in favor of having some improvement rather than nothing at 
all—otherwise, we might never reach a consensus.


1. Documentation 
(v9-0001-Clarify-display-of-rows-as-decimal-fractions-DOC.patch)


One thing that bothers me is that the documentation explains how to 
compute total time, but it does not clarify how to compute total rows. 
Maybe this was obvious to others before, but now that we are displaying 
|rows| as a fraction, we should explicitly document how to interpret it 
alongside total time.


I believe it would be helpful to show a non-integer rows value in an 
example query. However, to achieve this, we need the index scan results 
to vary across iterations. One way to accomplish this is by using the 
condition t1.unique2 > t2.unique2. Additionally, I suggest making loops 
a round number (e.g., 100) for better readability, which can be achieved 
using t1.thousand < 10. The final query would look like this:


EXPLAIN ANALYZE SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2;

I believe this is an ideal example for the documentation because it not 
only demonstrates fractional rows, but also keeps the execution plan 
nearly unchanged. While the estimated and actual average row counts 
become slightly rounded, I don't see another way to ensure different 
results for each index scan.


I'm open to any feedback or suggestions for a better example to use in 
the documentation or additional explaining fractional rows in the text.


2. Code and tests 
(v9-0002-Clarify-display-of-rows-as-decimal-fractions.patch)


I left the code and tests unchanged since we agreed on a fixed format of 
two decimal places.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From 789d98383988ef6d3b0bc2c6b9b5c73a83ffd6d4 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Thu, 13 Feb 2025 11:19:03 +0300
Subject: [PATCH v9] Clarify display of rows as decimal fractions

When loops > 1, the average rows value is now displayed as a decimal fraction
with two digits after the decimal point in EXPLAIN ANALYZE.

Previously, the average rows value was always rounded to an integer,
which could lead to misleading results when estimating total rows
by multiplying rows by loop. This change ensures that users
get a more accurate representation of the data flow through execution nodes.
---
 doc/src/sgml/perform.sgml | 41 ---
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index a502a2aaba..dd61ae4507 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -717,26 +717,26 @@ FROM tenk1 t1 WHERE t1.ten = (SELECT (random() * 10)::integer);
 
 EXPLAIN ANALYZE SELECT *
 FROM tenk1 t1, tenk2 t2
-WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-
-   QUERY PLAN
&zwsp;--
- Nested Loop  (cost=4.65..118.50 rows=10 width=488) (actual time=0.017..0.051 rows=10 loops=1)
-   Buffers: shared hit=36 read=6
-   ->  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.38 rows=10 width=244) (actual time=0.009..0.017 rows=10 loops=1)
- Recheck Cond: (unique1 < 10)
- Heap Blocks: exact=10
- Buffers: shared hit=3 read=5 written=4
- ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1)
-   Index Cond: (unique1 < 10)
+WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2;
+
+ QUERY PLAN
+---&zwsp;-
+ Nested Loop  (cost=5.40..11571.44 rows=356667 width=488) (actual time=0.042..117.205 rows=513832 loops=1)
+   Buffers: shared hit=19377 read=29
+   ->  Bitmap Heap Scan on tenk1 t1  (cost=5.11..233.60 rows=107 width=244) (actual time=0.021..0.103 rows=100 loops=1)
+ Recheck Cond: (thousand < 10)
+ Heap Blocks: exact=90
+ Buffers: shared hit=92
+ ->  Bitmap Index Scan on te

Re: Get rid of WALBufMappingLock

2025-02-13 Thread Alexander Korotkov

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov  wrote:
> 08.02.2025 13:07, Alexander Korotkov пишет:
> > On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov  
> > wrote:
> >> Good, thank you.  I think 0001 patch is generally good, but needs some
> >> further polishing, e.g. more comments explaining how does it work.
>
> I tried to add more comments. I'm not good at, so recommendations are welcome.
>
> > Two things comes to my mind worth rechecking about 0001.
> > 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> > sensitive to that.  If so, I would propose to explicitly comment that
> > and add corresponding asserts.
>
> They're certainly page aligned, since they are page borders.
> I added assert on alignment of InitializeReserved for the sanity.
>
> > 2) Check if there are concurrency issues between
> > AdvanceXLInsertBuffer() and switching to the new WAL file.
>
> There are no issues:
> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> GetXLogBuffer to zero-out WAL page.
> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> so switching wal is not concurrent. (Although, there is no need in this
> exclusiveness, imho.)

Good, thank you.  I've also revised commit message and comments.

But I see another issue with this patch.  In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced.  With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()).  We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures.  Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

> > Regarding 0002 patch, it looks generally reasonable.  But are 2
> > attempts always optimal?  Are there cases of regression, or cases when
> > more attempts are even better?  Could we have there some
> > self-adjusting mechanism like what we have for spinlocks?
>
> Well, I chose to perform 3 probes (2 conditional attempts + 1
> unconditional) based on intuition. I have some experience in building hash
> tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach
> 50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is
> cache miss, it is hardly sensible to do more probes.
>
> 3 probes did better than 2 in other benchmark [1], although there were
> NUM_XLOGINSERT_LOCK increased.
>
> Excuse me for not bencmarking different choices here. I'll try to do
> measurements in next days.
>
> [1] https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

Ok, let's wait for your measurements.

--
Regards,
Alexander Korotkov
Supabase


v5-0002-several-attempts-to-lock-WALInsertLocks.patch
Description: Binary data


v5-0001-Get-rid-of-WALBufMappingLock.patch
Description: Binary data

Re: pg_stat_statements and "IN" conditions

2025-02-13 Thread Dmitry Dolgov

> On Wed, Feb 12, 2025 at 08:48:03PM GMT, Dmitry Dolgov wrote:
> > On Wed, Feb 12, 2025 at 07:39:39PM GMT, Álvaro Herrera wrote:
> > The nastiness level of this seems quite low, compared to what happens to
> > this other example if we didn't handle these easy cases:
> >
> > create table t (a float);
> > select i from t where i in (1, 2);
> > select i from t where i in (1, '2');
> > select i from t where i in ('1', 2);
> > select i from t where i in ('1', '2');
> > select i from t where i in (1.0, 1.0);
>
> Yep, the current version I've got so far produces the same
> pg_stat_statements entry for all of those queries. I'm going to move out
> the renamed GUC and post the new patch tomorrow.

Here is how it looks like (posting only the first patch, since we
concentrate on it). This version handles just a little more to cover
simpe cases like the implicit convertion above. The GUC is also moved
out from pgss and renamed to query_id_merge_values. On top I've added
more tests showing the impact, as well as sometimes awkward looking
normalized query I was talking about. I'm going to experiment how to
iron out the latter.
>From ef4115248cd7213494e2bdf8175eaa930aa41640 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 3 Dec 2024 14:55:45 +0100
Subject: [PATCH v23] Prevent jumbling of every element in ArrayExpr

pg_stat_statements produces multiple entries for queries like

SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on the number of parameters, because every element of
ArrayExpr is jumbled. In certain situations it's undesirable, especially
if the list becomes too large.

Make an array of Const expressions contribute only the first/last
elements to the jumble hash. Allow to enable this behavior via the new
pg_stat_statements parameter query_id_merge_values with the default value off.

Reviewed-by: Zhihong Yu, Sergey Dudoladov, Robert Haas, Tom Lane,
Michael Paquier, Sergei Kornilov, Alvaro Herrera, David Geier, Sutou Kouhei,
Sami Imseih
Tested-by: Chengxi Sun, Yasuo Honda
---
 contrib/pg_stat_statements/Makefile   |   2 +-
 .../pg_stat_statements/expected/merging.out   | 432 ++
 contrib/pg_stat_statements/meson.build|   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  47 +-
 contrib/pg_stat_statements/sql/merging.sql| 169 +++
 doc/src/sgml/config.sgml  |  27 ++
 doc/src/sgml/pgstatstatements.sgml|  28 +-
 src/backend/nodes/gen_node_support.pl |  21 +-
 src/backend/nodes/queryjumblefuncs.c  | 167 ++-
 src/backend/postmaster/launch_backend.c   |   3 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 -
 src/include/nodes/nodes.h |   3 +
 src/include/nodes/primnodes.h |   2 +-
 src/include/nodes/queryjumble.h   |   8 +-
 15 files changed, 895 insertions(+), 26 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/merging.out
 create mode 100644 contrib/pg_stat_statements/sql/merging.sql

diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index 241c02587b..eef8d69cc4 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config 
$(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
user_activity wal entry_timestamp privileges extended \
-   parallel cleanup oldextversions
+   parallel cleanup oldextversions merging
 # Disabled because these tests require 
"shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/merging.out 
b/contrib/pg_stat_statements/expected/merging.out
new file mode 100644
index 00..e7815a200c
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/merging.out
@@ -0,0 +1,432 @@
+--
+-- Const merging functionality
+--
+CREATE EXTENSION pg_stat_statements;
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- No merging is performed, as a baseline result
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9);
+ id | data 
++--
+(0 rows)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
+ id | data 
++--
+(0 rows)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11);
+ id | data 
++--
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+query  
  | calls 
+-+---
+ SELECT * FROM test_merge WHE

Re: Windows meson build

2025-02-13 Thread Vladlen Popolitov


Vladlen Popolitov писал(а) 2025-02-13 17:58:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

Hi!

 Previous email has text "failed". It is generated by "send review" 
function,

I do not know, how to change it (tried in other browsers).

Nothing failed in this patch.

--
Best regards,

Vladlen Popolitov.

Re: Improve CRC32C performance on SSE4.2

2025-02-13 Thread John Naylor

On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart  wrote:
>
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible.  I looked at switching to always using runtime checks
> for this stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13

For short lengths, I tried unrolling the loop into a switch statement,
as in the attached v5-0006 (the other new patches are fixes for CI).
That usually looks faster for me, but not on the length used under the
WAL insert lock. Usual caveat: Using small fixed-sized lengths in
benchmarks can be misleading, because branches are more easily
predicted.

It seems like for always using runtime checks we'd need to use
branching, rather than function pointers, as has been proposed
elsewhere.

master:
20
latency average = 3.622 ms
latency average = 3.573 ms
latency average = 3.599 ms
64
latency average = 7.791 ms
latency average = 7.920 ms
latency average = 7.888 ms
80
latency average = 8.076 ms
latency average = 8.140 ms
latency average = 8.150 ms
96
latency average = 8.853 ms
latency average = 8.897 ms
latency average = 8.914 ms
112
latency average = 9.867 ms
latency average = 9.825 ms
latency average = 9.869 ms

v5:
20
latency average = 4.550 ms
latency average = 4.327 ms
latency average = 4.320 ms
64
latency average = 5.064 ms
latency average = 4.934 ms
latency average = 5.020 ms
80
latency average = 4.904 ms
latency average = 4.786 ms
latency average = 4.942 ms
96
latency average = 5.392 ms
latency average = 5.376 ms
latency average = 5.367 ms
112
latency average = 5.730 ms
latency average = 5.859 ms
latency average = 5.734 ms



--
John Naylor
Amazon Web Services
From acb63cddd8c8220db97ae0b012bf4f2fb5174e8a Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 12 Feb 2025 17:07:49 +0700
Subject: [PATCH v5 5/8] Improve CRC32C performance on x86_64

The current SSE4.2 implementation of CRC32C relies on the native
CRC32 instruction, which operates on 8 bytes at a time. We can get a
substantial speedup on longer inputs by using carryless multiplication
on SIMD registers, processing 64 bytes per loop iteration.

The PCLMULQDQ instruction has been widely available since 2011 (almost
as old as SSE 4.2), so this commit now requires that, as well as SSE
4.2, to build pg_crc32c_sse42.c.

The MIT-licensed implementation was generated with the "generate"
program from

https://github.com/corsix/fast-crc32/

Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009

Author: Raghuveer Devulapalli 
Author: John Naylor 
Discussion: PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com">https://postgr.es/m/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com
---
 config/c-compiler.m4  | 7 ++-
 configure | 7 ++-
 meson.build   | 7 +--
 src/port/pg_crc32c_sse42.c| 4 
 src/port/pg_crc32c_sse42_choose.c | 9 ++---
 5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..8b255b5cc8 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
 [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
 AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+#include 
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("sse4.2")))
+__attribute__((target("sse4.2,pclmul")))
 #endif
 static int crc32_sse42_test(void)
+
 {
+  __m128i x1 = _mm_set1_epi32(1);
   unsigned int crc = 0;
   crc = _mm_crc32_u8(crc, 0);
   crc = _mm_crc32_u32(crc, 0);
+  x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul
+  crc = crc + _mm_extract_epi32(x1, 1);
   /* return computed value, to prevent the above being optimized away */
   return crc == 0;
 }],
diff --git a/configure b/configure
index 0ffcaeb436..3f2a2a515e 100755
--- a/configure
+++ b/configure
@@ -17059,14 +17059,19 @@ else
   cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include 
+#include 
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("sse4.2")))
+__attribute__((target("sse4.2,pclmul")))
 #endif
 static int crc32_sse42_test(void)
+
 {
+  __m128i x1 = _mm_set1_epi32(1);
   unsigned int crc = 0;
   crc = _mm_crc32_u8(crc, 0);
   crc = _mm_crc32_u32(crc, 0);
+  x1 = _mm_clmulepi64_si128(x1, x1, 0x00);
+  crc = crc + _mm_extract_epi32(x1, 1);
   /* return computed value, to prevent the above being optimized away */
   return crc == 0;
 }
diff --git a/meson.build b/meson.build
index 1ceadb9a83..456c3fafc3 100644
--- a/meson.build
+++ b/meson.build
@@ -2227,15

Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations

2025-02-13 Thread Greg Sabino Mullane

On Thu, Feb 13, 2025 at 9:45 AM Sébastien  wrote:

> This issue is particularly critical during database *migrations* or *version
> upgrades*, where a full data reload is often necessary. Each time a major
> PostgreSQL upgrade occurs, users must reimport large datasets, leading to
> the same problem of vacuum storms post import.
>

What major upgrade process are you doing that uses INSERT?

Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Ranier Vilela

Em qui., 13 de fev. de 2025 às 16:05, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Interesting, Coverity has some new reports regarding PQescapeIdentifier.
>
> > CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN)
> > 2. alloc_strlen: Allocating insufficient memory for the terminating null
> of
> > the string. [Note: The source code implementation of the function has
> been
> > overridden by a builtin model.]
>
> That's not new, we've been seeing those for awhile.  I've been
> ignoring them on the grounds that (a) if the code actually had such a
> problem, valgrind testing would have found it, and (b) the message is
> saying in so many words that they're ignoring our code in favor of
> somebody's apparently-inaccurate model of said code.
>
Thanks Tom, extra care is needed when analyzing these reports.

best regards,
Ranier Vilela

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Melanie Plageman

On Thu, Feb 13, 2025 at 11:28 AM Tomas Vondra  wrote:
>
> On 2/13/25 17:01, Melanie Plageman wrote:
> > On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra  wrote:
> >>
> >> I reviewed v29 today, and I think it's pretty much ready to go.
> >>
> >> The one part where I don't quite get is 0001, which replaces a
> >> FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong,
> >> but I don't quite see the benefits / clarity. And I think Thomas might
> >> be right we may want to make this dynamic, to save memory.
> >>
> >> Not a blocker, but I'd probably skip 0001 (unless it's required by the
> >> later parts, I haven't checked/tried).
> >
> > So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE
> > (which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) --
> > see tbm_begin_private_iterate().
> >
> > So we always palloc the same amount of memory. The reason I changed it
> > from a flexible sized array to a fixed size is that we weren't using
> > the flexibility and having a flexible sized array in the
> > TBMIterateResult meant it couldn't be nested in another struct. Since
> > I have to separate the TBMIterateResult and TBMIterator to implement
> > the read stream API for BHS, once I separate them, I nest the
> > TBMIterateResult in the GinScanEntry. If the array of offsets is
> > flexible sized, then I would have to manage that memory separately in
> > GIN code for the TBMIterateResult..
> >
> > So, 0001 isn't a change in the amount of memory allocated.
> >
> > With the read stream API, these TBMIterateResults are palloc'd just
> > like we palloc'd one in master. However, we have to have more than one
> > at a time.
> >
>
> I know it's not changing how much memory we allocate (compared to
> master). I haven't thought about the GinScanEntry - yes, flexible array
> member would make this a bit more complex.

Oh, I see. I didn't understand Thomas' proposal. I don't know how hard
it would be to make tidbitmap allocate the offsets on-demand. I'd need
to investigate more. But probably not worth it for this patch.

- Melanie

Is pgAdmin the only front-end to PostgreSQL debugger ? And is "a working pl/pgsql debugger" something core should care to maintain ?

2025-02-13 Thread Hannu Krosing

Hallo PostgreSQL Hackers,


We recently discovered an error where pgAdmin fails when stepping into
nested function calls (
https://github.com/pgadmin-org/pgadmin4/issues/8443 ).

So while waiting for this to be fixed I would want to know if there
are other debugger front-ends that could be used to do basic debugging
of pl/pgsql code ?

And would these need a separate debugging extension, or is "debugging"
meant to be a generic PostgreSQL server feature that has a
well-defined API ?

I know there used to be another debugger as part of OmniDB which had
its own server-side extension as well, but wanted to know what is the
general thinking on this.

Is debugging pl/pgsq a generic feature of PostgreSQL core considering
that pl/pgsql itself kind of is ?

Should there be something similar for debugging plain SQL and possibly
other PL languages?

Should there perhaps be debugging support in psql ?

--
Hannu

Re: pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Ranier Vilela

Em qui., 13 de fev. de 2025 às 13:51, Justin Pryzby 
escreveu:

> I found errors in our sql log after upgrading to 17.3.
>
> error_severity | ERROR
> message| schema
> "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" does not exist
> query  | copy
> "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"."44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"
> from stdin
>
> The copy command is from pygresql's inserttable(), which does:
>
> do {
> t = strchr(s, '.');
> if (!t)
> t = s + strlen(s);
> table = PQescapeIdentifier(self->cnx, s, (size_t)(t - s));
> fprintf(stderr, "table %s len %ld => %s\n", s, t-s, table);
> if (bufpt < bufmax)
> bufpt += snprintf(bufpt, (size_t)(bufmax - bufpt), "%s",
> table);
> PQfreemem(table);
> s = t;
> if (*s && bufpt < bufmax)
> *bufpt++ = *s++;
> } while (*s);
>
> The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its
> len.
>
Interesting, Coverity has some new reports regarding PQescapeIdentifier.

CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN)
2. alloc_strlen: Allocating insufficient memory for the terminating null of
the string. [Note: The source code implementation of the function has been
overridden by a builtin model.]

Until now, I was in disbelief.

best regards,
Ranier Vilela

Patch: Log parameter types in detailed query logging

2025-02-13 Thread Степан

Dear PostgreSQL Hackers,

This patch adds the ability to log the types of parameters used in queries
when detailed query logging is enabled. Currently, detailed logging only
shows the parameter values, which often requires further investigation or
asking the client to determine the data types. This enhancement will
significantly aid in debugging problematic queries, especially when data
type mismatches are suspected.

The patch modifies the detailed logging output to include the data type of
each parameter, making it easier to understand the context of the query and
diagnose potential issues without additional communication overhead.

Here's an example of the new logging format:

```
2025-02-10 21:05:42.622 +07 [3702286] LOG: duration: 0.008 ms execute P_1:
SELECT
u.username,
u.email,
r.role_name,
o.order_date,
p.product_name,
oi.quantity,
ur.role_id,
(p.price * oi.quantity) AS total_price
  FROM
users u
  JOIN
user_roles ur ON u.user_id = ur.user_id
  JOIN
roles r ON ur.role_id = r.role_id
  JOIN
orders o ON u.user_id = o.user_id
  JOIN
order_items oi ON o.order_id = oi.order_id
  JOIN
products p ON oi.product_id = p.product_id
  where ur.role_id = $1
   and u.user_id = $2
   and oi.order_id = $3::bigint
  ORDER BY
o.order_date;
2025-02-10 21:05:42.622 +07 [3702286] DETAIL: Parameters: $1 =
(integer)'11', $2 = (integer)'86', $3 = (bigint)'14'
```

As you can see, the DETAIL log message now includes the data type in
parentheses before the parameter value.

I believe this addition will greatly improve the usefulness of detailed
query logging. I welcome your feedback and suggestions.

Thank you for your time and consideration.

Best regards,
Stepan Neretin
























Log in - PPG Jira




https://jira.postgrespro.ru";>



























var AJS=AJS||{};AJS.debug=true;

















html:not([data-theme]),
html[data-color-mode="dark"][data-theme~="dark:light"],
html[data-color-mode="light"][data-theme~="light:light"],
html[data-color-mode="light"][data-theme~="light:original"] {
--jira-color-gadgetcolor1: #205081;
--jira-color-gadgetcolor2: #de350b;
--jira-color-gadgetcolor3: #ff8b00;
--jira-color-gadgetcolor4: #00875a;
--jira-color-gadgetcolor5: #00a3bf;
--jira-color-gadgetcolor6: #6554c0;
--jira-color-gadgetcolor7: #5e6c84;
--jira-color-heroButtonBaseBGColour: #3572B0;
--jira-color-heroButtonTextColour: #deebff;
--jira-color-menuBackgroundColour: #ebecf0;
--jira-color-menuSeparatorColour: #dfe1e6;
--jira-color-menuTxtColour: #42526e;
--jira-color-textActiveLinkColour: #0065ff;
--jira-color-textHeadingColour: #172b4d;
--jira-color-textLinkColour: #3572B0;
--jira-color-topBackgroundColour: #205081;
--jira-color-topHighlightColor: #3572B0;
--jira-color-topSeparatorBackgroundColor: #2e3d54;
--jira-color-topTextHighlightColor: #deebff;
--jira-color-topTxtColour: #deebff;

}
html[data-color-mode="dark"] {
--jira-color-gadgetcolor1: var(--ds-background-brand-bold);
--jira-color-gadgetcolor2: var(--ds-background-danger-bold);
--jira-color-gadgetcolor3: var(--ds-background-accent-orange-subtler-pressed);
--jira-color-gadgetcolor4: var(--ds-background-accent-green-bolder-hovered);
--jira-color-gadgetcolor5: var(--ds-background-accent-teal-bolder);
--jira-color-gadgetcolor6: var(--ds-background-accent-purple-bolder-hovered);
--jira-color-gadgetcolor7: var(--ds-background-accent-gray-bolder);
--jira-color-heroButtonBaseBGColour: var(--ds-background-brand-bold);
--jira-color-heroButtonTextColour: var(--ds-text-inverse);
--jira-color-menuBackgroundColour: var(--ds-surface-overlay-hovered);
--jira-color-menuSeparatorColour: var(--ds-border);
--jira-color-menuTxtColour: var(--ds-text-subtlest);
--jira-color-textActiveLinkColour: var(--ds-link-pressed);
--jira-color-textHeadingColour: var(--ds-text);
--jira-color-textLinkColour: var(--ds-link);
--jira-color-topBackgroundColour: var(--ds-surface);
--jira-color-topHighlightColor: var(--ds-surface-hovered);
--jira-color-topSeparatorBackgroundColor: var(--ds-border);
--jira-color-topTextHighlightColor: var(--ds-text-subtle);
--jira-color-topTxtColour: var(--ds-text-subtle);

}



window.contextPath = '';

window.__resourcePhaseCheckpointResolvers={resolveDeferPhaseCheckpoint:null,resolveInteractionPhaseCheckpoint:null};if(window.performance&&window.performance.mark){window.DeferScripts||(window.DeferScripts={});window.DeferScripts.totalClicks=0;window.DeferScripts.totalKeydowns=0;window.DeferScripts.clickListener=function(){"use strict";window.DeferScripts.totalClicks+=1};window.addEventListener("click",window.DeferScripts.clickListener);window.DeferScripts.keydownListener=function(){"use strict";window.DeferScripts.totalKeydowns+=1};window.addEventListener("keydown",window.DeferScripts.keydownListener)}window.resourcePhaseCheckpoint=Object.freeze({defer:new Promise((function(e){"use strict";window.__resourc

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Tomas Vondra

On 2/13/25 17:01, Melanie Plageman wrote:
> On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra  wrote:
>>
>> I reviewed v29 today, and I think it's pretty much ready to go.
>>
>> The one part where I don't quite get is 0001, which replaces a
>> FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong,
>> but I don't quite see the benefits / clarity. And I think Thomas might
>> be right we may want to make this dynamic, to save memory.
>>
>> Not a blocker, but I'd probably skip 0001 (unless it's required by the
>> later parts, I haven't checked/tried).
> 
> So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE
> (which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) --
> see tbm_begin_private_iterate().
> 
> So we always palloc the same amount of memory. The reason I changed it
> from a flexible sized array to a fixed size is that we weren't using
> the flexibility and having a flexible sized array in the
> TBMIterateResult meant it couldn't be nested in another struct. Since
> I have to separate the TBMIterateResult and TBMIterator to implement
> the read stream API for BHS, once I separate them, I nest the
> TBMIterateResult in the GinScanEntry. If the array of offsets is
> flexible sized, then I would have to manage that memory separately in
> GIN code for the TBMIterateResult..
> 
> So, 0001 isn't a change in the amount of memory allocated.
> 
> With the read stream API, these TBMIterateResults are palloc'd just
> like we palloc'd one in master. However, we have to have more than one
> at a time.
> 

I know it's not changing how much memory we allocate (compared to
master). I haven't thought about the GinScanEntry - yes, flexible array
member would make this a bit more complex.

I don't want to bikeshed this - I'll leave the decision about 0001 to
you. I'm OK with both options.


regards

-- 
Tomas Vondra

Re: Get rid of WALBufMappingLock

2025-02-13 Thread Pavel Borisov

Hi, Yura and Alexander!

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov  wrote:
>
> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov  
> wrote:
> > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov  
> > > wrote:
> > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov 
> > >>>  wrote:
> >  Good, thank you.  I think 0001 patch is generally good, but needs some
> >  further polishing, e.g. more comments explaining how does it work.
> > >>
> > >> I tried to add more comments. I'm not good at, so recommendations are 
> > >> welcome.
> > >>
> > >>> Two things comes to my mind worth rechecking about 0001.
> > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > >>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> > >>> sensitive to that.  If so, I would propose to explicitly comment that
> > >>> and add corresponding asserts.
> > >>
> > >> They're certainly page aligned, since they are page borders.
> > >> I added assert on alignment of InitializeReserved for the sanity.
> > >>
> > >>> 2) Check if there are concurrency issues between
> > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > >>
> > >> There are no issues:
> > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > >> GetXLogBuffer to zero-out WAL page.
> > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > >> so switching wal is not concurrent. (Although, there is no need in this
> > >> exclusiveness, imho.)
> > >
> > > Good, thank you.  I've also revised commit message and comments.
> > >
> > > But I see another issue with this patch.  In the worst case, we do
> > > XLogWrite() by ourselves, and it could potentially could error out.
> > > Without patch, that would cause WALBufMappingLock be released and
> > > XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> > > cause other processes infinitely waiting till we finish the
> > > initialization.
> > >
> > > Possible solution would be to save position of the page to be
> > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > (everywhere we do LWLockReleaseAll()).  We also must check that on
> > > error we only set XLogCtl->InitializeReserved to the past, because
> > > there could be multiple concurrent failures.  Also we need to
> > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> >
> > The single place where AdvanceXLInsertBuffer is called outside of critical
> > section is in XLogBackgroundFlush. All other call stacks will issue server
> > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> >
> > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > opportunistic=true parameter.
> >
> > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > since server will shutdown/restart.
> >
> > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > to XLogWrite here?
>
> You're correct.  I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

The overall goal to avoid locking looks good. Both patches look right
to me. The only thing I'm slightly concerned about if patchset could
demonstrate performance differences in any real workload. I appreciate
it as a beautiful optimization anyway.

I have the following proposals to change the code and comments:

For patch 0001:

- Struct XLBlocks contains just one pg_atomic_uint64 member. Is it
still needed as a struct? These changes make a significant volume of
changes to the patch, being noop. Maybe it was inherited from v1 and
not needed anymore.

- Furthermore when xlblocks became a struct comments like:
> and xlblocks values certainly do.  xlblocks values are changed
need to be changed to xlblocks.bound. This could be avoided by
changing back xlblocks from type XLBlocks * to pg_atomic_uint64 *

- It's worth more detailed commenting
InitializedUpTo/InitializedUpToCondVar than:
+* It is updated to successfully initialized buffer's
identities, perhaps
+* waiting on conditional variable bound to buffer.

"perhaps waiting" could also be in style "maybe/even while AAA waits BBB"

"lock-free with cooperation with" -> "lock-free accompanied by changes to..."

- Comment inside typedef struct XLogCtlData:
/* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */
need to be returned back
/* 1st byte ptr-s + XLOG_BLCKSZ */

- Commented out code for cleanup in the final patch:
 //ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar);

- in AdvanceXLInsertBuffer()
npages initialised to 0 but it is not increased anymore
Block under
> if (XLOG_DEBUG && npages > 0)
became unreachable

(InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically
calls for adding #define FirstValidXLogRecPtr 1

Typo in a commit message: %s/usign/using/g

For patch 0002:

I think Yura

DROP CONSTRAINT, printing a notice and/or skipping when no action is taken

2025-02-13 Thread Andrew Atkinson

Hello. I noticed a small opportunity for a possible enhancement to DROP
DEFAULT, and wanted to share the idea. Apologies if this idea was suggested
before, I tried a basic search for pgsql-hackers similar things but didn’t
find a hit.


I noticed when running an ALTER TABLE with DROP DEFAULT, whether the column
default exists or not, ALTER TABLE is always printed as the result. This is
arguably slightly confusing, because it’s unclear if anything was done. In
the scenario where there is no column default, there isn’t a message saying
“skipped” or something equivalent, indicating that there was no default
that was dropped. Some of the commands in Postgres do have this kind of
feedback, so it seems like an opportunity for greater consistency.



For example: if I create a column default, or repeatedly run the following
ALTER TABLE statements for the "id_new" column, I always get ALTER TABLE
back.


ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;

ALTER TABLE

ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;

ALTER TABLE

ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT;

ALTER TABLE


An opportunity would be to add a NOTICE type of message when ALTER TABLE
ALTER COLUMN DROP DEFAULT is issued, at least when no column default
exists, and no action was taken. In that scenario, the operation could
possibly be skipped altogether, which might have some additional benefits.


As a refreshed on a “Notice” type of message example, here’s one when
adding an index and using the "if not exists" clause (an equivalent "if not
exists" clause does not exist for DROP DEFAULT to my knowledge):


-- an index called “foo” already exists

psql> create index if not exists foo on organizations (id);

NOTICE: relation "foo" already exists, skipping

CREATE INDEX


The message being “NOTICE: relation "foo" already exists, skipping”


A similar message for DROP DEFAULT might look like:


“NOTICE: default does not exist, skipping”


Or an alternative that includes the column name might look like:


“NOTICE: default does not exist for column id_new, skipping”


Or another alternative might be a new (non-standard?) "if exists" clause
for DROP DEFAULT. Example:

ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT IF EXISTS;


-- Or an alternative placement of the "if exists" clause, because I don’t
really know where it would go:

ALTER TABLE my_table ALTER COLUMN id_new DROP IF EXISTS DEFAULT;



Thanks!

- Andrew

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-02-13 Thread Jacob Champion

On Wed, Feb 12, 2025 at 6:55 AM Peter Eisentraut  wrote:
> This just depends on how people have built their libcurl, right?
>
> Do we have any information whether the async-dns-free build is a common
> configuration?

I don't think the annual Curl survey covers that, unfortunately.

On Wed, Feb 12, 2025 at 6:59 AM Peter Eisentraut  wrote:
> I think you can just remove that comment.  It's pretty established that
> internal errors don't need translation, so it would be understood from
> looking at the code.

Okay, will do.

Thanks,
--Jacob

Remove a unnecessary argument from execute_extension_script()

2025-02-13 Thread Yugo Nagata

Hi,

The attached patch is to remove a unnecessary argument "schemaOid"
from the static function execute_extension_script(). It might have
been intended to be used some way initially, but actually this is
not used since schemaName is sufficient.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index ba540e3de5b..ec1d38b2172 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -982,7 +982,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 		 const char *from_version,
 		 const char *version,
 		 List *requiredSchemas,
-		 const char *schemaName, Oid schemaOid)
+		 const char *schemaName)
 {
 	bool		switch_to_superuser = false;
 	char	   *filename;
@@ -1788,7 +1788,7 @@ CreateExtensionInternal(char *extensionName,
 	execute_extension_script(extensionOid, control,
 			 NULL, versionName,
 			 requiredSchemas,
-			 schemaName, schemaOid);
+			 schemaName);
 
 	/*
 	 * If additional update scripts have to be executed, apply the updates as
@@ -3380,7 +3380,7 @@ ApplyExtensionUpdates(Oid extensionOid,
 		execute_extension_script(extensionOid, control,
  oldVersionName, versionName,
  requiredSchemas,
- schemaName, schemaOid);
+ schemaName);
 
 		/*
 		 * Update prior-version name and loop around.  Since

pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Justin Pryzby

I found errors in our sql log after upgrading to 17.3.

error_severity | ERROR
message| schema 
"rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" does not exist
query  | copy 
"rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"."44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"
 from stdin

The copy command is from pygresql's inserttable(), which does:

do {
t = strchr(s, '.');
if (!t)
t = s + strlen(s);
table = PQescapeIdentifier(self->cnx, s, (size_t)(t - s));
fprintf(stderr, "table %s len %ld => %s\n", s, t-s, table);
if (bufpt < bufmax)
bufpt += snprintf(bufpt, (size_t)(bufmax - bufpt), "%s", table);
PQfreemem(table);
s = t;
if (*s && bufpt < bufmax)
*bufpt++ = *s++;
} while (*s);

The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len.

python3 -c "import pg; db=pg.DB('postgres'); 
db.inserttable('child.a', [1])")
table child.a len 5 => "child.a"
table a len 13 => "a"

-- 
Justin

Re: Is pgAdmin the only front-end to PostgreSQL debugger ? And is "a working pl/pgsql debugger" something core should care to maintain ?

2025-02-13 Thread Pavel Stehule

Hi

čt 13. 2. 2025 v 18:00 odesílatel Hannu Krosing  napsal:

> Hallo PostgreSQL Hackers,
>
>
> We recently discovered an error where pgAdmin fails when stepping into
> nested function calls (
> https://github.com/pgadmin-org/pgadmin4/issues/8443 ).
>
> So while waiting for this to be fixed I would want to know if there
> are other debugger front-ends that could be used to do basic debugging
> of pl/pgsql code ?


If I remember, there was a independent application that was a plpgsql
debugger, but it was in time before pgadmin 4

Maybe dbeaver uses it too. I think some commercial gui has support for
debugging plpgsql, but I am not sure that it is.


>
> And would these need a separate debugging extension, or is "debugging"
> meant to be a generic PostgreSQL server feature that has a
> well-defined API ?
>

Unfortunately  buildin API has not external interface. So everytime you
need an extension that use internal plpgsql debug API or allow a access to
some plpgsql data like stack or variables.

These extensions can be different, - plpgsql_check uses these API,
plprofiler or pldebugger

If I remember well, the development of pldebugger or plprofiler was a fully
independent project on Postgres core, and there was not any attempt to
merge it upstream.

The internal buildin part is really minimalistic - just few hooks


> I know there used to be another debugger as part of OmniDB which had
> its own server-side extension as well, but wanted to know what is the
> general thinking on this.
>
> Is debugging pl/pgsq a generic feature of PostgreSQL core considering
> that pl/pgsql itself kind of is ?
>

debugging plpgsql is independent of the PostgreSQL core.


> Should there be something similar for debugging plain SQL and possibly
> other PL languages?
>

I can imagine this - there can be some common API supported by language
handlers.


>
> Should there perhaps be debugging support in psql ?
>

psql does not support asynchronous API, so the implementation in psql
really should not be simple. Moreover, I doubt that users want to use a
debugger with a GUI similar to gdb. And it isn't possible to design a
better UI directly in psql. I can imagine supporting some external
debuggers executed inside / or outside the same terminal that communicates
with psql by some bidirect pipe (psql will be used like a proxy).  Then can
be designed some application similar to Borland debugger. This can be a
very nice feature with probably not too much code, but handling bidirect
pipes isn't trivial. Tuning reading from pipe for pspg for some less usual
patterns was not nice work (and there the communication is significantly
more simple than for debugger).

Regards

Pavel

>
> --
> Hannu
>
>
>

Re: pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Tom Lane

Ranier Vilela  writes:
> Interesting, Coverity has some new reports regarding PQescapeIdentifier.

> CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN)
> 2. alloc_strlen: Allocating insufficient memory for the terminating null of
> the string. [Note: The source code implementation of the function has been
> overridden by a builtin model.]

That's not new, we've been seeing those for awhile.  I've been
ignoring them on the grounds that (a) if the code actually had such a
problem, valgrind testing would have found it, and (b) the message is
saying in so many words that they're ignoring our code in favor of
somebody's apparently-inaccurate model of said code.

regards, tom lane

Re: pg_stat_statements and "IN" conditions

2025-02-13 Thread Sami Imseih

Hi,

Thanks for the updated patch!

I spent some time looking at v24 today, and I have some findings/comments.

1/
Constants passed as parameters to a prepared statement will not be
handled as expected. I did not not test explicit PREPARE/EXECUTE statement,
but I assume it will have the same issue.

postgres=# show query_id_squash_values;
 query_id_squash_values

 on
(1 row)

postgres=# select from foo where col_bigint in ($1, $2, $3, $4) \bind 1 2 3 4
postgres-# ;
--
(0 rows)

postgres=# select from foo where col_bigint in ($1, $2, $3, $4, $5)
\bind 1 2 3 4 5
postgres-# ;
--
(0 rows)

postgres=# select query, queryid, calls from pg_stat_statements where
query like 'select%from foo where%' order by stats_since asc;
  query   |
queryid| calls
--+--+---
 select from foo where col_bigint in ($1, $2, $3, $4) |
-1169585827903667511 | 1
 select from foo where col_bigint in ($1, $2, $3, $4, $5) |
-5591703027615838766 | 1
(2 rows)

I think the answer is here is to also check for "Param" when deciding
if an element
should be merged.

i.e.

if (!IsA(element, Const) && !IsA(element, Param))


2/

This case with an array passed to aa function seems to cause a regression
in pg_stat_statements query text. As you can see the text is incomplete.

CREATE OR REPLACE FUNCTION arrtest(i int[]) RETURNS void AS $$
BEGIN
NULL;
END;
$$ LANGUAGE plpgsql;


postgres=# select arrtest(array[1, 2]) from foo where col_bigint in (1, 2, 3);
 arrtest
-
(0 rows)

postgres=# select query from pg_stat_statements;
   query
---
 select arrtest(array[...)
(1 row)

it should otherwise look like this:

postgres=# select query from pg_stat_statements;
  query
-
 select arrtest(array[$1, $2]) from foo where col_bigint in ($3, $4, $5)
(1 row)


3/

A typo in the docs.

c/lenght/length

+occurrence with an array of different lenght.

4/

+   
+Specifies how an array of constants (e.g. for an IN
+clause) contributes to the query identifier computation.

Is this parameter specific to only useful to merge the values of an IN list.
Should the documentation be more specific and say that only IN lists
will benefit from this parameter?

Also, if there is only 1 value in the list, it will have a different
queryId than
that of the same query in which more than 1 value is passed to the IN list.
Should the documentation be clear about that?

5/

pg_node_attr of query_jumble_merge is doing something
very specific to the elements list of an ArrayExpr. The
merge code likely cannot be used for other node types.

/* the array elements or sub-arrays */
-   List   *elements;
+   List   *elements pg_node_attr(query_jumble_merge);

Why are we creating a new node attribute rather than following the
existing pattern of using the "custom_query_jumble" attribute on
ArrayExpr and creating a custom jumble function like we do with
_jumbleVariableSetStmt?

Regards,

Sami

Re: Confine vacuum skip logic to lazy_scan_skip

2025-02-13 Thread Masahiko Sawada

On Tue, Feb 11, 2025 at 5:10 PM Melanie Plageman
 wrote:
>
> On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman
>  wrote:
> >
> > On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman
> >  wrote:
> > >
> > > Yes, looking at these results, I also feel good about it. I've updated
> > > the commit metadata in attached v14, but I could use a round of review
> > > before pushing it.
> >
> > I've done a bit of self-review and updated these patches.
>
> This needed a rebase in light of 052026c9b90.
> v16 attached has an additional commit which converts the block
> information parameters to heap_vac_scan_next_block() into flags
> because we can only get one piece of information per block from the
> read stream API. This seemed nicer than a cumbersome struct.
>

Sorry for the late chiming in. I've reviewed the v16 patch set, and
the patches mostly look good. Here are some comments mostly about
cosmetic things:

0001:

-   boolall_visible_according_to_vm,
-   was_eager_scanned = false;
+   uint8   blk_flags = 0;

Can we probably declare blk_flags inside the main loop?

0002:

In lazy_scan_heap(), we have a failsafe check at the beginning of the
main loop, which is performed before reading the first block. Isn't it
better to move this check after scanning a block (especially after
incrementing scanned_pages)? Otherwise, we would end up calling
lazy_check_wraparound_failsafe() at the very first loop, which
previously didn't happen without the patch. Since we already called
lazy_check_wraparound_failsafe() just before calling lazy_scan_heap(),
the extra check would not make much sense.

---
+   /* Set up the read stream for vacuum's first pass through the heap */
+   stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+   vacrel->bstrategy,
+   vacrel->rel,
+   MAIN_FORKNUM,
+   heap_vac_scan_next_block,
+   vacrel,
+   sizeof(bool));

Is there any reason to use sizeof(bool) instead of sizeof(uint8) here?

---
/*
 * Vacuum the Free Space Map to make newly-freed space visible on
-* upper-level FSM pages.  Note we have not yet processed blkno.
+* upper-level FSM pages.  Note that blkno is the previously
+* processed block.
 */
FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
blkno);

Given the blkno is already processed, should we pass 'blkno + 1'
instead of blkno?

0003:

-   while ((iter_result = TidStoreIterateNext(iter)) != NULL)

I think we can declare iter_result in the main loop of lazy_vacuum_heap_rel().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Confine vacuum skip logic to lazy_scan_skip

2025-02-13 Thread Melanie Plageman

On Tue, Feb 11, 2025 at 8:10 PM Melanie Plageman
 wrote:
>
> On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman
>  wrote:
> >
> > On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman
> >  wrote:
> > >
> > > Yes, looking at these results, I also feel good about it. I've updated
> > > the commit metadata in attached v14, but I could use a round of review
> > > before pushing it.
> >
> > I've done a bit of self-review and updated these patches.
>
> This needed a rebase in light of 052026c9b90.
> v16 attached has an additional commit which converts the block
> information parameters to heap_vac_scan_next_block() into flags
> because we can only get one piece of information per block from the
> read stream API. This seemed nicer than a cumbersome struct.

I've done some clean-up including incorporating a few off-list pieces
of minor feedback from Andres.

- Melanie
From da251546b0e7a748884efc9fb6d27fab0b7a452d Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 13 Feb 2025 17:34:18 -0500
Subject: [PATCH v17 3/3] Use streaming read I/O in VACUUM's third phase

Make vacuum's third phase (its second pass over the heap), which reaps
dead items collected in the first phase and marks them as reusable, use
the read stream API. This commit adds a new read stream callback,
vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and
returns the next block number to read for vacuum.

Author: Melanie Plageman 
Co-authored-by: Thomas Munro 
Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 53 +---
 1 file changed, 48 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 32083a92c31..f85530a28f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2639,6 +2639,32 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
+/*
+ * Read stream callback for vacuum's third phase (second pass over the heap).
+ * Gets the next block from the TID store and returns it or InvalidBlockNumber
+ * if there are no further blocks to vacuum.
+ */
+static BlockNumber
+vacuum_reap_lp_read_stream_next(ReadStream *stream,
+void *callback_private_data,
+void *per_buffer_data)
+{
+	TidStoreIter *iter = callback_private_data;
+	TidStoreIterResult *iter_result;
+
+	iter_result = TidStoreIterateNext(iter);
+	if (iter_result == NULL)
+		return InvalidBlockNumber;
+
+	/*
+	 * Save the TidStoreIterResult for later, so we can extract the offsets.
+	 * It is safe to copy the result, according to TidStoreIterateNext().
+	 */
+	memcpy(per_buffer_data, iter_result, sizeof(*iter_result));
+
+	return iter_result->blkno;
+}
+
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
@@ -2659,6 +2685,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
+	ReadStream *stream;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
@@ -2679,7 +2706,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 			 InvalidBlockNumber, InvalidOffsetNumber);
 
 	iter = TidStoreBeginIterate(vacrel->dead_items);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+
+	/* Set up the read stream */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+		vacrel->bstrategy,
+		vacrel->rel,
+		MAIN_FORKNUM,
+		vacuum_reap_lp_read_stream_next,
+		iter,
+		sizeof(TidStoreIterResult));
+
+	while (true)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2690,9 +2727,15 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point(false);
 
-		blkno = iter_result->blkno;
-		vacrel->blkno = blkno;
+		buf = read_stream_next_buffer(stream, (void **) &iter_result);
+
+		/* The relation is exhausted */
+		if (!BufferIsValid(buf))
+			break;
 
+		vacrel->blkno = blkno = BufferGetBlockNumber(buf);
+
+		Assert(iter_result);
 		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
 		Assert(num_offsets <= lengthof(offsets));
 
@@ -2704,8 +2747,6 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
 		/* We need a non-cleanup exclusive lock to mark dead_items unused */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		lazy_vacuum_heap_page(vacrel, blkno, buf, offsets,
 			  num_offsets, vmbuffer);
@@ -2718,6 +2759,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+
+	read_stream_end(stream);
 	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
-- 
2.34.1

From 6abfb26af7cd2ff55c9b13cadcef2d56b6460a28 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 1

Re: Assignment before assert

2025-02-13 Thread Daniel Gustafsson

> On 13 Feb 2025, at 18:08, Dmitry Koval  wrote:
> 
> Hi!
> Function EvalPlanQualFetchRowMark contains an assignment
> 
> ExecRowMark *erm = earm->rowmark;
> 
> before assert
> 
> Assert(earm != NULL);
> 
> Maybe these lines need to be swapped?

That does admittedly look a bit odd, that assertion can't be reached if earm is
null since we've already dereferenced it by then.  I'll have another look after
coffee but something along the lines of your patch looks right (or just remove
the Assertion perhaps).

--
Daniel Gustafsson

Re: dblink: Add SCRAM pass-through authentication

2025-02-13 Thread Jacob Champion

On Wed, Feb 12, 2025 at 11:54 AM Matheus Alcantara
 wrote:
> Currently dblink_connstr_check and dblink_security_check only check if the
> password is present on connection options, in case of not superuser.

They also check for GSS delegation. I think SCRAM passthrough should
ideally be considered a second form of credentials delegation, from
the perspective of those functions.

> I added
> this logic because the password is not required for SCRAM but I agree with you
> that it sounds strange. Maybe these functions could check if the SCRAM is
> being used and then skip the password validation if needed internally?

As long as the end result is to enforce that the credentials must come
from the end user, I think that would be fine in theory.

> I also agree that we should enforce the use of the SCRAM on the remote for
> safety. To do this I think that we could set require_auth=scram-sha-256 on
> connection options if SCRAM pass-through is being used, with this we will get 
> a
> connection error. WYT?

We would need to verify that the user mapping can't overwrite that
with its own (less trusted) `require_auth` setting. (I think that
should be true already, but I'm not 100% sure.)

Hardcoding to scram-sha-256 would also prohibit the use of GSS or
standard password auth, now that I think about it. The docs currently
have a note about being able to choose... Should we add the other
permitted authentication types, i.e. `password,md5,scram-sha-256,gss`?
Or should we prohibit the use of other auth types if you've set
use_scram_passthrough? Or maybe there's an easier way to enforce the
use of the SCRAM keys, that better matches the current logic in
dblink_security_check?

> I can create a new patch to fix this on postgres_fdw too once we define the
> approach to this here on dblink.

Sounds good to me.

Thanks,
--Jacob

Re: describe special values in GUC descriptions more consistently

2025-02-13 Thread Nathan Bossart

On Wed, Feb 12, 2025 at 04:10:53PM -0700, David G. Johnston wrote:
> I presume it doesn't affect the actual output which just concatenates the
> fragments together but the source placement probably should be made
> consistent; the line containing the initial default value specification
> begins its own quoted fragment.  The following violate that convention.

Eh, most of the other descriptions with multiple sentences don't do that,
so IMHO there's no need for the special values to go in their own fragment.
You are correct that there should be no difference in the actual output.

> Also, maybe put the rules in the commit message into a comment in the file,
> or a README, instead.

I added a comment to guc_tables.h in v6.

-- 
nathan
>From 747384f1e0fe9f56cd6cf8363c74920b99028a4f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 12 Feb 2025 14:26:00 -0600
Subject: [PATCH v6 1/1] Describe special values in GUC descriptions more
 consistently.

Many GUCs accept special values like -1 or an empty string to
disable the feature, use a system default, etc.  While the
documentation consistently lists these special values, the GUC
descriptions do not.  Many such descriptions fail to mention the
special values, and those that do vary in phrasing and placement.
This commit aims to bring some consistency to this area by applying
the following rules:

* Special values should be listed at the end of the long
  description.
* Descriptions should use numerals (e.g., "0") instead of words
  (e.g., "zero").
* Special value mentions should be concise and direct (e.g., "0
  disables the timeout.", "An empty string means use the operating
  system setting.").
* Multiple special values should be listed in ascending order.

Of course, there are exceptions, such as
max_pred_locks_per_relation and search_path, whose special values
are too complex to include.  And there are cases like
listen_addresses, where the meaning of an empty string is arguably
too obvious to include.  In those cases, I've refrained from adding
special value information to the GUC description.

Reviewed-by: Peter Smith 
Reviewed-by: "David G. Johnston" 
Reviewed-by: Daniel Gustafsson 
Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan
---
 src/backend/utils/misc/guc_tables.c | 138 ++--
 src/include/utils/guc_tables.h  |  15 +++
 2 files changed, 84 insertions(+), 69 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 226af43fe23..408bd5969b8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2119,7 +2119,7 @@ struct config_int ConfigureNamesInt[] =
{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Sets the amount of time to wait before 
forcing a "
 "switch to the next WAL 
file."),
-   NULL,
+   gettext_noop("0 disables the timeout."),
GUC_UNIT_S
},
&XLogArchiveTimeout,
@@ -2196,7 +2196,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_pool_size", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of individuals in the 
population."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_pool_size,
@@ -2206,7 +2206,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_generations", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of iterations of the 
algorithm."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_generations,
@@ -2229,7 +2229,7 @@ struct config_int ConfigureNamesInt[] =
{
{"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum delay before canceling 
queries when a hot standby server is processing archived WAL data."),
-   NULL,
+   gettext_noop("-1 means wait forever."),
GUC_UNIT_MS
},
&max_standby_archive_delay,
@@ -2240,7 +2240,7 @@ struct config_int ConfigureNamesInt[] =
{
{"max_standby_streaming_delay", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum delay before canceling 
queries when a hot standby server is processing streamed WAL data."),
-   NULL,
+   gettext_noop("-1 means wait forever."),
GUC_UNIT_MS
}

Re: Remove a unnecessary argument from execute_extension_script()

2025-02-13 Thread Nathan Bossart

On Thu, Feb 13, 2025 at 03:18:51PM -0300, Fabrízio de Royes Mello wrote:
> On Thu, Feb 13, 2025 at 1:02 PM Yugo Nagata  wrote:
>> The attached patch is to remove a unnecessary argument "schemaOid"
>> from the static function execute_extension_script(). It might have
>> been intended to be used some way initially, but actually this is
>> not used since schemaName is sufficient.
> 
> LGTM.

Interesting.  This parameter seems to have appeared between v30 [0] and v31
[1] of the original extension patch, and even then it wasn't used.  And
from a quick skim, I don't see any discussion about it.  I'll plan on
committing this shortly.

[0] https://postgr.es/m/m24o8nhd69.fsf%402ndQuadrant.fr
[1] https://postgr.es/m/4171.1297135840%40sss.pgh.pa.us

-- 
nathan

Re: DOCS - inactive_since field readability

2025-02-13 Thread Peter Smith

On Tue, Feb 11, 2025 at 10:10 PM Amit Kapila  wrote:
>
...
> The change in 0001 looks odd after seeing it in HTML format. We should
> either add one empty line between two paragraphs otherwise it doesn't
> appear good. Did you see multi-paragraphs in any other column
> definitions?
>
> For the 0002 patch, I find the following changes as improvements,
> *
>   
> -Note that for slots on the standby
> to
> +For standby slots
>
> *
> On standby, this is useful for slots
> -that are being synced from a primary server (whose
> -synced field is true)
> -so they know when the slot stopped being synchronized.
> to
> This helps standby slots
> +track when synchronization was interrupted.
>
> Other than that, I find the current wording okay.
>

Hi Amit, thanks for the feedback!

//

Patch 0001

The pg_replication_slots system view is unusual in that there can be
entirely different descriptions of the same field depending on the
context, such as:
a- for logical slots
b- for physical slots
c- for primary servers versus standby servers

IIUC your 0001 feedback says that a blank line might be ok, but just
doing it for 'active_since' and nothing else makes it look odd. So, I
have modified the 0001 patch to add blank lines separating those
different contexts (e.g. a,b,c) for all fields. I think it improves
the readability.

In passing.
- changed null to NULL
- changed true to true
- changed false to false

//

Patch 0002

Modified the text as suggested.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-DOCS-pg_replication_slots.-Add-blank-lines.patch
Description: Binary data


v2-0002-DOCS-pg_replication_slots.-Improve-the-active_sin.patch
Description: Binary data

Re: pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Nathan Bossart

On Thu, Feb 13, 2025 at 02:00:09PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
>> The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len.
> 
> Ugh, yes.  Need something like the attached.

Your patch looks right to me.

-- 
nathan

Re: describe special values in GUC descriptions more consistently

2025-02-13 Thread David G. Johnston

On Thu, Feb 13, 2025 at 3:18 PM Nathan Bossart 
wrote:

> On Wed, Feb 12, 2025 at 04:10:53PM -0700, David G. Johnston wrote:
> > I presume it doesn't affect the actual output which just concatenates the
> > fragments together but the source placement probably should be made
> > consistent; the line containing the initial default value specification
> > begins its own quoted fragment.  The following violate that convention.
>
> Eh, most of the other descriptions with multiple sentences don't do that,
>

Apples and oranges.

so IMHO there's no need for the special values to go in their own fragment.
>

The examples shown look sloppy, IMHO; standing out since the other 50,
mostly due to the defaults being the only sentence in the gettext_noop, do
line up nicely with either a number or "The empty string" as the lead.

the worst offender is:

- "lost before a connection is considered dead. A value of 0 uses the "
- "system default."),
+ "lost before a connection is considered dead. 0 "
+ "means use the system default."),

Even if the diff has logic to it - only remove stuff from the first line,
only add stuff to the second, it isn't a big enough gain to offset leaving
the source ugly, IMHO.

David J.

Re: Remove a unnecessary argument from execute_extension_script()

2025-02-13 Thread Nathan Bossart

Committed.

-- 
nathan

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Thomas Munro

On Fri, Feb 14, 2025 at 5:52 AM Melanie Plageman
 wrote:
> On Thu, Feb 13, 2025 at 11:28 AM Tomas Vondra  wrote:
> > On 2/13/25 17:01, Melanie Plageman wrote:
> > I know it's not changing how much memory we allocate (compared to
> > master). I haven't thought about the GinScanEntry - yes, flexible array
> > member would make this a bit more complex.
>
> Oh, I see. I didn't understand Thomas' proposal. I don't know how hard
> it would be to make tidbitmap allocate the offsets on-demand. I'd need
> to investigate more. But probably not worth it for this patch.

No, what I meant was:  It would be nice to try to hold only one
uncompressed result set in memory at a time, like we achieved in the
vacuum patches.  The consumer expands them from a tiny object when the
associated buffer pops out the other end.  That should be possible
here too, right, because the bitmap is immutable and long lived, so
you should be able to stream (essentially) pointers into its internal
guts.  The current patch streams the uncompressed data itself, and
thus has to reserve space for the maximum possible amount of it, and
also forces you to think about fixed sizes.

I think if you want "consumer does the expanding" and also "dynamic
size" and also "consumer provides memory to avoid palloc churn", then
you might need two new functions: "how much memory would I need to
expand this thing?" and a "please expand it right here, it has the
amount of space you told me!".  Then I guess the consumer could keep
recycling the same piece of memory, and repalloc() if it's not big
enough.  Or something like that.

Yeah I guess you could in theory also stream pointers to individual
uncompressed result objects allocated with palloc(), that is point a
point in the per-buffer-data and make the consumer free it, but that
has other problems (less locality, allocator churn, need
cleanup/destructor mechanism for when the streams is reset or
destroyed early, still has lots of uncompressed copies of data in
memory *sometimes*) and is not what I was imagining.

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Thomas Munro

On Fri, Feb 14, 2025 at 11:50 AM Thomas Munro  wrote:
> Yeah I guess you could in theory also stream pointers to individual
> uncompressed result objects allocated with palloc(), that is point a
> point in the per-buffer-data and make the consumer free it, but that
> has other problems (less locality, allocator churn, need

Sorry, typing too fast, I meant to write:  "that is, put a pointer in
the per-buffer-data and make the consumer free it".

Re: describe special values in GUC descriptions more consistently

2025-02-13 Thread Nathan Bossart

On Thu, Feb 13, 2025 at 03:50:08PM -0700, David G. Johnston wrote:
> Even if the diff has logic to it - only remove stuff from the first line,
> only add stuff to the second, it isn't a big enough gain to offset leaving
> the source ugly, IMHO.

Okay, I took your suggestions in v7.

-- 
nathan
>From 7526dcc25c992be8e2bc6aed2a030dcdd1d536e2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 12 Feb 2025 14:26:00 -0600
Subject: [PATCH v7 1/1] Describe special values in GUC descriptions more
 consistently.

Many GUCs accept special values like -1 or an empty string to
disable the feature, use a system default, etc.  While the
documentation consistently lists these special values, the GUC
descriptions do not.  Many such descriptions fail to mention the
special values, and those that do vary in phrasing and placement.
This commit aims to bring some consistency to this area by applying
the following rules:

* Special values should be listed at the end of the long
  description.
* Descriptions should use numerals (e.g., "0") instead of words
  (e.g., "zero").
* Special value mentions should be concise and direct (e.g., "0
  disables the timeout.", "An empty string means use the operating
  system setting.").
* Multiple special values should be listed in ascending order.

Of course, there are exceptions, such as
max_pred_locks_per_relation and search_path, whose special values
are too complex to include.  And there are cases like
listen_addresses, where the meaning of an empty string is arguably
too obvious to include.  In those cases, I've refrained from adding
special value information to the GUC description.

Reviewed-by: Peter Smith 
Reviewed-by: "David G. Johnston" 
Reviewed-by: Daniel Gustafsson 
Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan
---
 src/backend/utils/misc/guc_tables.c | 139 ++--
 src/include/utils/guc_tables.h  |  15 +++
 2 files changed, 84 insertions(+), 70 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 226af43fe23..42728189322 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2119,7 +2119,7 @@ struct config_int ConfigureNamesInt[] =
{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Sets the amount of time to wait before 
forcing a "
 "switch to the next WAL 
file."),
-   NULL,
+   gettext_noop("0 disables the timeout."),
GUC_UNIT_S
},
&XLogArchiveTimeout,
@@ -2196,7 +2196,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_pool_size", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of individuals in the 
population."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_pool_size,
@@ -2206,7 +2206,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_generations", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of iterations of the 
algorithm."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_generations,
@@ -2229,7 +2229,7 @@ struct config_int ConfigureNamesInt[] =
{
{"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum delay before canceling 
queries when a hot standby server is processing archived WAL data."),
-   NULL,
+   gettext_noop("-1 means wait forever."),
GUC_UNIT_MS
},
&max_standby_archive_delay,
@@ -2240,7 +2240,7 @@ struct config_int ConfigureNamesInt[] =
{
{"max_standby_streaming_delay", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum delay before canceling 
queries when a hot standby server is processing streamed WAL data."),
-   NULL,
+   gettext_noop("-1 means wait forever."),
GUC_UNIT_MS
},
&max_standby_streaming_delay,
@@ -2273,7 +2273,7 @@ struct config_int ConfigureNamesInt[] =
{
{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the maximum wait time to receive 
data from the sending server."),
-   NULL,
+   gettext_noop("0 disables the timeout."),
GUC_UNIT_MS

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-02-13 Thread Daniel Gustafsson

> On 13 Feb 2025, at 22:23, Jacob Champion  
> wrote:
> 
> On Wed, Feb 12, 2025 at 6:55 AM Peter Eisentraut  wrote:
>> This just depends on how people have built their libcurl, right?
>> 
>> Do we have any information whether the async-dns-free build is a common
>> configuration?
> 
> I don't think the annual Curl survey covers that, unfortunately.

We should be able to get a decent idea by inspecting the packaging scripts for
the major distributions I think.

--
Daniel Gustafsson

Re: DOCS - Question about pg_sequences.last_value notes

2025-02-13 Thread Nathan Bossart

On Thu, Feb 13, 2025 at 03:59:45PM +1100, Peter Smith wrote:
> I noticed the pg_sequences system-view DOCS page [1] has a note about
> the 'last_value' field. But the note is not within the row for that
> field. Indeed, it is not even within the table.
> 
> Is it deliberate? Apparently, this changed in commit 3cb2f13 [2], so
> CC-ing author Nathan.

Yes, this was intentional.  I felt this was too much information to squeeze
into the table, and so I followed the example of pg_user_mappings [0] and
placed it below.

[0] https://www.postgresql.org/docs/devel/view-pg-user-mappings.html

-- 
nathan

Re: Simplify the logic a bit (src/bin/scripts/reindexdb.c)

2025-02-13 Thread Álvaro Herrera

On 2025-Feb-13, Ranier Vilela wrote:

> Hi.
> 
> Coverity complained about possible dereference null pointer
> in *reindex_one_database* function.
> That's not really true.
> But the logic is unnecessarily complicated.

Hmm, this code looks quite suspect, but I wonder if instead of (what
looks more or less like) a straight revert of cc0e7ebd304a as you
propose, a better fix wouldn't be to split get_parallel_object_list in
two: get_parallel_table_list for the DATABASE and SCHEMA cases, and
get_parallel_tabidx_list (or whatever) for the INDEX case.  In the first
case we just return a list of values, but in the latter case we also
meddle with the input list which becomes an output list ...

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
¡Ay, ay, ay!  Con lo mucho que yo lo quería (bis)
se fue de mi vera ... se fue para siempre, pa toíta ... pa toíta la vida
¡Ay Camarón! ¡Ay Camarón!(Paco de Lucía)

Assignment before assert

2025-02-13 Thread Dmitry Koval


Hi!
Function EvalPlanQualFetchRowMark contains an assignment

ExecRowMark *erm = earm->rowmark;

before assert

Assert(earm != NULL);

Maybe these lines need to be swapped?

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 39d80ccfbad..963aa390620 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2662,13 +2662,15 @@ bool
 EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 {
ExecAuxRowMark *earm = epqstate->relsubs_rowmark[rti - 1];
-   ExecRowMark *erm = earm->rowmark;
+   ExecRowMark *erm;
Datum   datum;
boolisNull;
 
Assert(earm != NULL);
Assert(epqstate->origslot != NULL);
 
+   erm = earm->rowmark;
+
if (RowMarkRequiresRowShareLock(erm->markType))
elog(ERROR, "EvalPlanQual doesn't support locking rowmarks");

Re: Proposal - Allow extensions to set a Plan Identifier

2025-02-13 Thread Sami Imseih

Thanks for the feedback!

> I think that makes sense and then the idea would be later on to move to 
> something
> like 5fd9dfa5f50, but for the "planId": is my understanding correct?

correct. This is adding infrastructure to eventually have an in-core
planId; but in the
meanwhile give extensions that already compute a planId the ability to
broadcast the planId in pg_stat_activity.

> > Proposal Overview:
> >
> > 1/ Add a planId field to PgBackendStatus.
> > 2/ Add a planId field to PlannedStmt.
> > 3/ Create APIs for extensions to set the current planId.
>
> Note sure 3 is needed (MyBEEntry is available publicly and PlannedStmt also
> through some hooks). But did not look close enough and would probably depends
> of the implementation, so let's see.

I don't think direct setting of values is a good idea. We will need an API
similar to pgstat_report_query_id which ensures we are only reporting top
level planIds -and- in the case of multiple extensions with the capability
to set a planId, only the first one in the stack wins. pgstat_report_query_id
does allow for forcing a queryId ( force flag which is false by default ), and
I think this new API should allow the same.

> > Storing planId in PlannedStmt ensures it is available with
> > cached plans.
> >
> > One consideration is whether reserving a 64-bit integer in PgBackendStatus
> > and PlannedStmt is acceptable, given that planId may not always be used.
>
> From what I can see PgBackendStatus is currently 432 bytes and PlannedStmt
> is 152 bytes. It looks like that adding an uint64 at the end of those structs
> would not add extra padding (that's what "ptype /o struct " tells me)
> so would increase to 440 bytes and 160 bytes respectively. I don't think 
> that's
> an issue.

Correct, thanks for adding this detail.

--

Sami

Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-13 Thread Sami Imseih

> I don't understand why we would change any naming here at all.  I think
> you should be looking at a much broader consensus and plus-ones that a
> renaming is needed.  -1 from me.

The reason for the change is because "query jumble" will no longer
make sense if the jumble code can now be used for other types of
trees, such as Plan.

I do agree that this needs a single-threaded discussion to achieve a
consensus.

--

Sami

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Andres Freund

Hi,

Thomas, there's a bit relevant to you at the bottom.


Melanie chatted with me about the performance regressions in Tomas' benchmarks
of the patch.  I experimented some and I think I found a few interesting
pieces.

I looked solely at cyclic, wm=4096, matches=8, eic=16 as I wanted to
narrow down what I was looking at as much as possible, and as that seemed a
significant regression.

I was able to reproduce the regression with the patch, although the noise was
very high.

My first attempt at reducing the noise was using MAP_POPULATE, as I was seeing
a lot of time spent in page fault related code, and that help stabilize the
output some.

After I couldn't really make sense of the different perf characteristics
looking at the explain analyze or iostat, so I switched to profiling. As part
of my profiling steps, I:

- bind postgres to a single core
- set the cpu governor to performance
- disable CPU idle just for that core (disabling it for all cores sometimes
  leads to slowness due to less clock boost headroom)
- try both with turbo boost on/off

With all of those applied, the performance difference almost vanished. Weird,
huh!


Normally CPUs only clock down when idle. That includes waiting for IO if that
wait period is long enough.


That made me look more at strace. Which shows this interesting difference
between master and the patch:

The chosen excerpts are randomly picked, but the pattern is similar
throughout execution of the query:

master:

pread64(52, "\0\0\0\0\260\7|\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339271680) = 8192 <0.10>
fadvise64(52, 339402752, 8192, POSIX_FADV_WILLNEED) = 0 <0.08>
pread64(52, "\0\0\0\0h\10|\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339279872) = 8192 <0.11>
fadvise64(52, 339410944, 8192, POSIX_FADV_WILLNEED) = 0 <0.08>
pread64(52, "\0\0\0\0 \t|\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339288064) = 8192 <0.11>
fadvise64(52, 339419136, 8192, POSIX_FADV_WILLNEED) = 0 <0.11>
pread64(52, "\0\0\0\0\330\t|\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339296256) = 8192 <0.11>
fadvise64(52, 339427328, 8192, POSIX_FADV_WILLNEED) = 0 <0.08>


With the patch:

fadvise64(52, 332365824, 131072, POSIX_FADV_WILLNEED) = 0 <0.21>
pread64(52, "\0\0\0\0(\223x\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329220096) = 131072 <0.000161>
pread64(52, "\0\0\0\0\250\236x\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329351168) = 131072 <0.000401>
pread64(52, "\0\0\0\0@\252x\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 329482240) = 65536 <0.44>
pread64(52, "\0\0\0\0\0\250y\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332365824) = 131072 <0.79>
pread64(52, "\0\0\0\0\200\263y\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332496896) = 131072 <0.000336>
fadvise64(52, 335781888, 131072, POSIX_FADV_WILLNEED) = 0 <0.21>
pread64(52, "\0\0\0\0\0\277y\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332627968) = 131072 <0.91>
pread64(52, "\0\0\0\0\230\312y\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332759040) = 131072 <0.000399>
pread64(52, "\0\0\0\0\30\326y\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 332890112) = 65536 <0.46>
pread64(52, "\0\0\0\0\220\324z\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335781888) = 131072 <0.81>
pread64(52, "\0\0\0\0(\340z\17\0\0\4\0x\0\200\30\0 \4 
\0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335912960) = 131072 <0.000335>
fadvise64(52, 339197952, 131072, POSIX_FADV_WILLNEED) = 0 <0.21>


There are a few interesting observations here:
- The patch does allow us to make larger reads, nice!

- With the patch we do *not* fadvise some of the blocks, the read stream
  sequentialness logic prevents it.

- With the patch there are occasional IOs that are *much* slower. E.g. the
  last pread64 is a lot slower than the two preceding ones.

Which I think explains what's happening. Because we don't fadvise all the
time, we have to synchronously wait for some IOs. Because of those slower IOs,
the CPU has time to go into an idle state. Slowing the whole query down.


If I disable that:

diff --git i/src/backend/storage/aio/read_stream.c 
w/src/backend/storage/aio/read_stream.c
index e4414b2e915..e58585c4e02 100644
--- i/src/backend/storage/aio/read_stream.c
+++ w/src/backend/storage/aio/read_stream.c
@@ -242,8 +242,9 @@ read_stream_start_pending_read(ReadStream *stream, bool 
suppress_advice)
  * isn't a strictly sequential pattern, then we'll issue advice.
  */
 if (!suppress_advice &&
-stream->advice_enabled &&
-stream->pending_read_blocknum != stream->seq_blocknum)
+stream->advice_enabled
+// && stream->

Re: pg_attribute_noreturn(), MSVC, C11

2025-02-13 Thread Peter Eisentraut


On 22.01.25 19:16, Peter Eisentraut wrote:

On 06.01.25 15:52, Peter Eisentraut wrote:

On 03.01.25 21:51, Dagfinn Ilmari Mannsåker wrote:

Peter Eisentraut  writes:


I suggest we define pg_noreturn as

1. If C11 is supported, then _Noreturn, else
2. If GCC-compatible, then __attribute__((noreturn)), else


Would it be worth also checking __has_attribute(noreturn)?  Or do all
compilers that have __attribute__((noreturn)) claim to be GCC?


I don't think that would expand the set of supported compilers in a 
significant way.  We can always add it if we find one, of course.


In fact, as another thought, we could even drop #2.  Among the GCC- 
compatible compilers, both GCC and Clang have supported #1 for ages, and 
the only other candidate I could find on the build farm is the Solaris 
compiler, which also supports C11 by default, per its documentation.



3. If MSVC, then __declspec((noreturn))


Here is an updated patch set that contains the above small change and 
fixes some conflicts that have arisen in the meantime.
From 8464526cede39d032f52239c5dcdfdbada189691 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 13 Feb 2025 16:01:06 +0100
Subject: [PATCH v2 1/4] pg_noreturn to replace pg_attribute_noreturn()

We want to support a "noreturn" decoration on more compilers besides
just GCC-compatible ones, but for that we need to move the decoration
in front of the function declaration instead of either behind it or
wherever, which is the current style afforded by GCC-style attributes.
Also rename the macro to "pg_noreturn" to be similar to the C11
standard "noreturn".

pg_noreturn is now supported on all compilers that support C11 (using
_Noreturn), as well as MSVC (using __declspec).  (When PostgreSQL
requires C11, the latter variant can be dropped.)  (We don't need the
previously used variant for GCC-compatible compilers using
__attribute__, because all reasonable candidates already support C11,
so that variant would be dead code in practice.)

Now, all supported compilers effectively support pg_noreturn, so the
extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped.

This also fixes a possible problem if third-party code includes
stdnoreturn.h, because then the current definition of

#define pg_attribute_noreturn() __attribute__((noreturn))

would cause an error.

Note that the C standard does not support a noreturn attribute on
function pointer types.  So we have to drop these here.  There are
only two instances at this time, so it's not a big loss.

Discussion: 
https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
---
 contrib/dblink/dblink.c   |  6 ++--
 contrib/pgcrypto/px.h |  2 +-
 src/backend/access/transam/xlogrecovery.c |  3 +-
 src/backend/backup/basebackup_incremental.c   |  6 ++--
 src/backend/postmaster/autovacuum.c   |  2 +-
 src/backend/postmaster/launch_backend.c   |  2 +-
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/replication/logical/tablesync.c   |  3 +-
 src/backend/replication/walsender.c   |  2 +-
 src/backend/utils/adt/ri_triggers.c   |  8 +++---
 src/backend/utils/fmgr/dfmgr.c|  4 +--
 src/backend/utils/hash/dynahash.c |  2 +-
 src/backend/utils/mmgr/slab.c |  2 +-
 src/bin/pg_combinebackup/load_manifest.c  |  6 ++--
 src/bin/pg_dump/pg_backup_utils.h |  2 +-
 src/bin/pg_upgrade/pg_upgrade.h   |  2 +-
 src/bin/pg_verifybackup/pg_verifybackup.c |  6 ++--
 src/bin/pg_verifybackup/pg_verifybackup.h |  4 +--
 src/bin/pgbench/pgbench.h | 12 
 src/include/bootstrap/bootstrap.h |  4 +--
 src/include/c.h   | 28 ---
 src/include/commands/defrem.h |  2 +-
 src/include/common/parse_manifest.h   |  3 +-
 src/include/mb/pg_wchar.h |  6 ++--
 src/include/parser/parse_relation.h   |  6 ++--
 src/include/parser/scanner.h  |  2 +-
 src/include/postmaster/autovacuum.h   |  4 +--
 src/include/postmaster/bgworker_internals.h   |  2 +-
 src/include/postmaster/bgwriter.h |  4 +--
 src/include/postmaster/pgarch.h   |  2 +-
 src/include/postmaster/postmaster.h   |  4 +--
 src/include/postmaster/startup.h  |  2 +-
 src/include/postmaster/syslogger.h|  2 +-
 src/include/postmaster/walsummarizer.h|  2 +-
 src/include/postmaster/walwriter.h|  2 +-
 src/include/replication/slotsync.h|  2 +-
 src/include/replication/walreceiver.h |  2 +-
 src/include/replication/walsender_private.h   |  2 +-
 src/include/storage/ipc.h |  2 +-
 src/include/storage/lock.h|  2 +-
 src/include/tcop/backend_startup.h|  2 +-
 src/include/tcop/tcopprot.h   | 12 --

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Tomas Vondra

On 2/10/25 23:31, Melanie Plageman wrote:
> On Mon, Feb 10, 2025 at 4:22 PM Melanie Plageman
>  wrote:
>>
>> I don't really know what to do about this. The behavior of master
>> parallel bitmap heap scan can be emulated with the patch by increasing
>> effective_io_concurrency. But, IIRC we didn't want to do that for some
>> reason?
>> Not only does effective_io_concurrency == 1 negatively affect read
>> ahead, but it also prevents read combining regardless of the
>> io_combine_limit.
> 
> Would a sensible default be something like 4?
> 

Could be. It would perform much better for most systems I think, but
it's still a fairly conservative value so unlikely to cause regressions.

If we wanted to validate the value, what tests do you think would be
appropriate? What about just running the same tests with different eic
values, and pick a reasonable value based on that? One that helps, but
before it stats causing regressions.

FWIW I don't think we need to do this before pushing the patches, as
long as we do both. AFAIK there'll be an "apparent regression" no matter
the order of changes.

> I did some self-review on the patches and cleaned up the commit
> messages etc. Attached is v29.
> 

I reviewed v29 today, and I think it's pretty much ready to go.

The one part where I don't quite get is 0001, which replaces a
FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong,
but I don't quite see the benefits / clarity. And I think Thomas might
be right we may want to make this dynamic, to save memory.

Not a blocker, but I'd probably skip 0001 (unless it's required by the
later parts, I haven't checked/tried).

Other than that, I think this is ready.

regards

-- 
Tomas Vondra

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-02-13 Thread Melanie Plageman

On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra  wrote:
>
> I reviewed v29 today, and I think it's pretty much ready to go.
>
> The one part where I don't quite get is 0001, which replaces a
> FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong,
> but I don't quite see the benefits / clarity. And I think Thomas might
> be right we may want to make this dynamic, to save memory.
>
> Not a blocker, but I'd probably skip 0001 (unless it's required by the
> later parts, I haven't checked/tried).

So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE
(which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) --
see tbm_begin_private_iterate().

So we always palloc the same amount of memory. The reason I changed it
from a flexible sized array to a fixed size is that we weren't using
the flexibility and having a flexible sized array in the
TBMIterateResult meant it couldn't be nested in another struct. Since
I have to separate the TBMIterateResult and TBMIterator to implement
the read stream API for BHS, once I separate them, I nest the
TBMIterateResult in the GinScanEntry. If the array of offsets is
flexible sized, then I would have to manage that memory separately in
GIN code for the TBMIterateResult..

So, 0001 isn't a change in the amount of memory allocated.

With the read stream API, these TBMIterateResults are palloc'd just
like we palloc'd one in master. However, we have to have more than one
at a time.

- Melanie

Re: Draft for basic NUMA observability

2025-02-13 Thread Bertrand Drouvot

Hi,

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
> As I have promised to Andres on the Discord hacking server some time
> ago, I'm attaching the very brief (and potentially way too rushed)
> draft of the first step into NUMA observability on PostgreSQL that was
> based on his presentation [0]. It might be rough, but it is to get us
> started. The patches were not really even basically tested, they are
> more like input for discussion - rather than solid code - to shake out
> what should be the proper form of this.
> 
> Right now it gives:
> 
> postgres=# select numa_zone_id, count(*) from pg_buffercache group by
> numa_zone_id;
> NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.00
>  numa_zone_id | count
> --+---
>   | 16127
> 6 |   256
> 1 | 1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
Sizeos_page_size;
void**os_page_ptrs;
int *os_pages_status;
-   int os_page_count;
+   uint64  os_page_count;

and

-   os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+   os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.

Observations:

when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from 
pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages 
appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00
 numa_zone_id |  count
--+-
  | 5177310
0 |   65192
   -2 | 378
(3 rows)

postgres=# select pg_backend_pid();
 pg_backend_pid

1662580

In session 2:

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00
 numa_zone_id |  count
--+-
  | 5177301
0 |  85
   -2 |   65494
(3 rows)

 ^
postgres=# select numa_zone_id, count(*) from 
pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00
 numa_zone_id |  count
--+-
  | 5177301
0 |  90
   -2 |   65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from 
pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00
 numa_zone_id |  count
--+-
  | 5177301
0 |   65195
   -2 | 384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using 
pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed 
the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior 
though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

"
One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() 
will allocate a page as  if  the  thread
   had performed a read (load) access to that address, and return the ID of 
the node where that page was allocated.

Forcing the allocation to happen inside a monitoring function is decidedly not 
great.
"

The man page looks correct (verified with "perf record -e 
page-faults,kmem:mm_page_alloc -p ")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e 
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Remove a unnecessary argument from execute_extension_script()

2025-02-13 Thread Fabrízio de Royes Mello

On Thu, Feb 13, 2025 at 1:02 PM Yugo Nagata  wrote:
>
> Hi,
>
> The attached patch is to remove a unnecessary argument "schemaOid"
> from the static function execute_extension_script(). It might have
> been intended to be used some way initially, but actually this is
> not used since schemaName is sufficient.
>

LGTM.

-- 
Fabrízio de Royes Mello

Re: Patch: Log parameter types in detailed query logging

2025-02-13 Thread Tom Lane

=?UTF-8?B?0KHRgtC10L/QsNC9?=  writes:
> This patch adds the ability to log the types of parameters used in queries
> when detailed query logging is enabled.

If there's a patch in what you sent, I'm not seeing it.  Looks like a
webpage dump or something.

However, I suppose that it must add some catalog lookup operations to
the logging operations, and I'm feeling resistant to that, because of
the added cycles and risk-of-failure.  I think you need to make a much
stronger case for the value of this information than you've done here.

regards, tom lane

Re: explain analyze rows=%.0f

2025-02-13 Thread Robert Haas

On Thu, Feb 13, 2025 at 4:05 AM Ilia Evdokimov
 wrote:
> 1. Documentation 
> (v9-0001-Clarify-display-of-rows-as-decimal-fractions-DOC.patch)
>
> One thing that bothers me is that the documentation explains how to compute 
> total time, but it does not clarify how to compute total rows. Maybe this was 
> obvious to others before, but now that we are displaying rows as a fraction, 
> we should explicitly document how to interpret it alongside total time.
>
> I believe it would be helpful to show a non-integer rows value in an example 
> query. However, to achieve this, we need the index scan results to vary 
> across iterations. One way to accomplish this is by using the condition 
> t1.unique2 > t2.unique2. Additionally, I suggest making loops a round number 
> (e.g., 100) for better readability, which can be achieved using t1.thousand < 
> 10. The final query would look like this:
>
> EXPLAIN ANALYZE SELECT *
> FROM tenk1 t1, tenk2 t2
> WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2;
>
> I believe this is an ideal example for the documentation because it not only 
> demonstrates fractional rows, but also keeps the execution plan nearly 
> unchanged. While the estimated and actual average row counts become slightly 
> rounded, I don't see another way to ensure different results for each index 
> scan.

Anyone else have an opinion on this? I see Ilia's point, but a
non-equality join is probably an atypical case.

> 2. Code and tests (v9-0002-Clarify-display-of-rows-as-decimal-fractions.patch)
>
> I left the code and tests unchanged since we agreed on a fixed format of two 
> decimal places.

This still has the HAS_DECIMAL() thing to which I objected.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Get rid of WALBufMappingLock

2025-02-13 Thread Alexander Korotkov

Hi, Pavel!

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov  wrote:
> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov  wrote:
> >
> > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov  
> > wrote:
> > > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov  
> > > > wrote:
> > > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov 
> > > >>>  wrote:
> > >  Good, thank you.  I think 0001 patch is generally good, but needs 
> > >  some
> > >  further polishing, e.g. more comments explaining how does it work.
> > > >>
> > > >> I tried to add more comments. I'm not good at, so recommendations are 
> > > >> welcome.
> > > >>
> > > >>> Two things comes to my mind worth rechecking about 0001.
> > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > > >>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> > > >>> sensitive to that.  If so, I would propose to explicitly comment that
> > > >>> and add corresponding asserts.
> > > >>
> > > >> They're certainly page aligned, since they are page borders.
> > > >> I added assert on alignment of InitializeReserved for the sanity.
> > > >>
> > > >>> 2) Check if there are concurrency issues between
> > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > > >>
> > > >> There are no issues:
> > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > > >> GetXLogBuffer to zero-out WAL page.
> > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion 
> > > >> locks,
> > > >> so switching wal is not concurrent. (Although, there is no need in this
> > > >> exclusiveness, imho.)
> > > >
> > > > Good, thank you.  I've also revised commit message and comments.
> > > >
> > > > But I see another issue with this patch.  In the worst case, we do
> > > > XLogWrite() by ourselves, and it could potentially could error out.
> > > > Without patch, that would cause WALBufMappingLock be released and
> > > > XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> > > > cause other processes infinitely waiting till we finish the
> > > > initialization.
> > > >
> > > > Possible solution would be to save position of the page to be
> > > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > > (everywhere we do LWLockReleaseAll()).  We also must check that on
> > > > error we only set XLogCtl->InitializeReserved to the past, because
> > > > there could be multiple concurrent failures.  Also we need to
> > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> > >
> > > The single place where AdvanceXLInsertBuffer is called outside of critical
> > > section is in XLogBackgroundFlush. All other call stacks will issue server
> > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> > >
> > > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > > opportunistic=true parameter.
> > >
> > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > > since server will shutdown/restart.
> > >
> > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before 
> > > call
> > > to XLogWrite here?
> >
> > You're correct.  I just reflected this in the next revision of the patch.
>
> I've looked at the patchset v6.

Oh, sorry, I really did wrong.  I've done git format-patch for wrong
local branch for v5 and v6.  Patches I've sent for v5 and v6 are
actually the same as my v1.  This is really pity.  Please, find the
right version of patchset attached.

--
Regards,
Alexander Korotkov
Supabase


v7-0001-Get-rid-of-WALBufMappingLock.patch
Description: Binary data


v7-0002-several-attempts-to-lock-WALInsertLocks.patch
Description: Binary data

BackgroundPsql swallowing errors on windows

2025-02-13 Thread Andres Freund

Hi,

One of the remaining tasks for AIO was to convert the tests to be proper tap
tests.  I did that and was thanked by the tests fairly randomly failing on
windows. Which test fails changes from run to run.

The symptom is that BackgroundPsql->query() sometimes would simply swallow
errors that were clearly generated by the backend. Which then would cause
tests to fail, because testing for errors was the whole point of $test.


At first I thought the issue was that psql didn't actually flush stderr after
displaying errors. And while that may be an issue, it doesn't seem to be the
one causing problems for me.

Lots of hair-pulling later, I have a somewhat confirmed theory for what's
happening:

BackgroundPsql::query() tries to detect if the passed in query has executed by
adding a "banner" after the query and using pump_until() to wait until that
banner has been "reached".  That seems to work reasonably well on !windows.

On windows however, it looks like there's no guarantee that if stdout has been
received by IPC::Run, stderr also has been received, even if the stderr
content has been generated first. I tried to add an extra ->pump_nb() call to
query(), thinking that maybe IPC::Run just didn't get input that had actually
arrived, due to waiting on just one pipe. But no success.

My understanding is that IPC::Run uses a proxy process on windows to execute
subprocesses and then communicates with that over TCP (or something along
those lines).  I suspect what's happening is that the communication with the
external process allows for reordering between stdout/stderr.

And indeed, changing BackgroundPsql::query() to emit the banner on both stdout
and stderr and waiting on both seems to fix the issue.


One complication is that I found that just waiting for the banner, without
also its newline, sometimes lead to unexpected newlines causing later queries
to fail. I think that happens if the trailing newline is read separately from
the rest of the string.

However, matching the newlines caused tests to fail on some machines. After a
lot of cursing I figured out that for interactive psql we output \r\n, causing
my regex match to fail. I.e. tests failed whenever IO::PTY was availble...

It's also not correct, as we did before, to just look for the banner, it has
to be anchored to either the start of the output or a newline, otherwise the
\echo (or \warn) command itself will be matched by pump_until() (but then the
replacing the command would fail). Not sure that could cause active problems
without the addition of \warn (which is also echoed on stdout), but it
certainly could after.


The banner being the same between queries made it hard to understand if a
banner that appeared in the output was from the current query or a past
query. Therefore I added a counter to it.


For debugging I added a "note" that shows stdout/stderr after executing the
query, I think it may be worth keeping that, but I'm not sure.


This was a rather painful exercise.


Greetings,

Andres Freund
>From 1aeb60e0687339b9f7524c3b347915bb53ed1284 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 13 Feb 2025 10:09:49 -0500
Subject: [PATCH v1] BackgroundPsql: Fix potential for lost errors on windows

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../perl/PostgreSQL/Test/BackgroundPsql.pm| 55 +++
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index dbfd82e4fac..7e76845307d 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -89,7 +89,8 @@ sub new
 		'stdin' => '',
 		'stdout' => '',
 		'stderr' => '',
-		'query_timer_restart' => undef
+		'query_timer_restart' => undef,
+		'query_cnt' => 1,
 	};
 	my $run;
 
@@ -219,27 +220,57 @@ sub query
 	my ($self, $query) = @_;
 	my $ret;
 	my $output;
+	my $query_cnt = $self->{query_cnt}++;
+
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	note "issuing query via background psql: $query";
+	note "issuing query $query_cnt via background psql: $query";
 
 	$self->{timeout}->start() if (defined($self->{query_timer_restart}));
 
 	# Feed the query to psql's stdin, followed by \n (so psql processes the
 	# line), by a ; (so that psql issues the query, if it doesn't include a ;
-	# itself), and a separator echoed with \echo, that we can wait on.
-	my $banner = "background_psql: QUERY_SEPARATOR";
-	$self->{stdin} .= "$query\n;\n\\echo $banner\n";
-
-	pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner/);
+	# itself), and a separator echoed both with \echo and \warn, that we can
+	# wait on.
+	#
+	# To avoid somehow confusing the separator from separately issued queries,
+	# and to make it easier to debug, we include a per-psql query counter in
+	# the separator.
+	#
+	# We need both \echo (printing to stdout) and \warn (printing to stderr),
+	# because

Re: pg17.3 PQescapeIdentifier() ignores len

2025-02-13 Thread Tom Lane

Justin Pryzby  writes:
> The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len.

Ugh, yes.  Need something like the attached.

FTR, 5dc1e42b4 et al were quite subtle patches done under extreme time
pressure.  I wonder if they have any other issues.  More eyes on those
patches would be welcome, now that they are public.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index e97ad02542..120d4d032e 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4224,7 +4224,7 @@ PQescapeInternal(PGconn *conn, const char *str, size_t len, bool as_ident)
 	char	   *rp;
 	int			num_quotes = 0; /* single or double, depending on as_ident */
 	int			num_backslashes = 0;
-	size_t		input_len = strlen(str);
+	size_t		input_len = strnlen(str, len);
 	size_t		result_size;
 	char		quote_char = as_ident ? '"' : '\'';
 	bool		validated_mb = false;
@@ -4274,7 +4274,7 @@ PQescapeInternal(PGconn *conn, const char *str, size_t len, bool as_ident)
 			if (!validated_mb)
 			{
 if (pg_encoding_verifymbstr(conn->client_encoding, s, remaining)
-	!= strlen(s))
+	!= remaining)
 {
 	libpq_append_conn_error(conn, "invalid multibyte character");
 	return NULL;

Re: Patch: Log parameter types in detailed query logging

2025-02-13 Thread Степан

On Fri, Feb 14, 2025 at 1:03 AM Tom Lane  wrote:

> =?UTF-8?B?0KHRgtC10L/QsNC9?=  writes:
> > This patch adds the ability to log the types of parameters used in
> queries
> > when detailed query logging is enabled.
>
> If there's a patch in what you sent, I'm not seeing it.  Looks like a
> webpage dump or something.
>
> However, I suppose that it must add some catalog lookup operations to
> the logging operations, and I'm feeling resistant to that, because of
> the added cycles and risk-of-failure.  I think you need to make a much
> stronger case for the value of this information than you've done here.
>
> regards, tom lane
>

My apologies, it seems I am still experiencing some difficulty in
accurately transmitting the code change details. I am re-submitting the
diff.patch prepared by Stepan Neretin now, and will ensure the formatting
is correct.

Regarding your concerns, I understand your hesitation about adding catalog
lookups due to potential performance impacts and failure risks. However, I
believe that the benefits of including parameter types in detailed query
logging will significantly outweigh the costs. Often, when analyzing
problematic queries, we lack crucial information about the parameter types
used. This lack of information forces us to request details from the
client, which adds unnecessary delays and complexity to the debugging
process. This patch addresses that directly.

Best regards,
Stepan Neretin
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index 3acc7508e71..49df56f0135 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -21,11 +21,11 @@
 #include "nodes/params.h"
 #include "parser/parse_node.h"
 #include "storage/shmem.h"
+#include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"

-
 static void paramlist_parser_setup(ParseState *pstate, void *arg);
 static Node *paramlist_param_ref(ParseState *pstate, ParamRef *pref);

@@ -367,9 +367,10 @@ BuildParamLogString(ParamListInfo params, char **knownTextValues, int maxlen)
 		ParamExternData *param = ¶ms->params[paramno];

 		appendStringInfo(&buf,
-		 "%s$%d = ",
+		 "%s$%d = (%s)",
 		 paramno > 0 ? ", " : "",
-		 paramno + 1);
+		 paramno + 1,
+		 format_type_extended(param->ptype, -1, FORMAT_TYPE_ALLOW_INVALID));

 		if (param->isnull || !OidIsValid(param->ptype))
 			appendStringInfoString(&buf, "NULL");

Re: pg_stat_statements and "IN" conditions

2025-02-13 Thread Dmitry Dolgov

> On Thu, Feb 13, 2025 at 01:47:01PM GMT, Álvaro Herrera wrote:
> Also, how wed are you to
> "query_id_merge_values" as a name?  It's not in any way obvious that
> this is about values in arrays.  How about query_id_squash_arrays?  Or
> are you thinking in having values in other types of structures squashed
> as well, and that this first patch does it for arrays only but you want
> the GUC to also control some future feature?
>
> (I think I prefer "squash" here as a verb to "merge").

Yeah, when choosing the name I was trying to keep it a bit generic. The
high level goal is to reduce repeated non-essential parts, and arrays of
constants are one clear scenario, but there could be more to it. Having
said that I don't have any particular plans for extending this logic so
far. I've ended up with query_id_squash_values, how does this sound?

> I think calling func_volatile potentially once per array element is not
> good; this might cause dozens/thousands of identical syscache lookups.
> Maybe we can pass an initially NIL list from IsMergeableConstList (as
> List **), which IsMergeableConst fills with OIDs of functions that have
> been checked and found acceptable.  Then the second time around we
> search the list first and only do func_volatile() after not finding a
> match.

Good point, added.

> Another thing I didn't quite understand is why you did this rather
> baroque-looking list scan:

I'm pretty sure there was some reason behind it, but when you pointed it
out that reason has promptly vanished in a puff of confusion. Fixed.
>From 212f5534fb0f99e5daa74ea4464231faec157a58 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 3 Dec 2024 14:55:45 +0100
Subject: [PATCH v24] Prevent jumbling of every element in ArrayExpr

pg_stat_statements produces multiple entries for queries like

SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on the number of parameters, because every element of
ArrayExpr is jumbled. In certain situations it's undesirable, especially
if the list becomes too large.

Make an array of Const expressions contribute only the first/last
elements to the jumble hash. Allow to enable this behavior via the new
pg_stat_statements parameter query_id_squash_values with the default value off.

Reviewed-by: Zhihong Yu, Sergey Dudoladov, Robert Haas, Tom Lane,
Michael Paquier, Sergei Kornilov, Alvaro Herrera, David Geier, Sutou Kouhei,
Sami Imseih
Tested-by: Chengxi Sun, Yasuo Honda
---
 contrib/pg_stat_statements/Makefile   |   2 +-
 .../pg_stat_statements/expected/merging.out   | 432 ++
 contrib/pg_stat_statements/meson.build|   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  47 +-
 contrib/pg_stat_statements/sql/merging.sql| 169 +++
 doc/src/sgml/config.sgml  |  27 ++
 doc/src/sgml/pgstatstatements.sgml|  28 +-
 src/backend/nodes/gen_node_support.pl |  21 +-
 src/backend/nodes/queryjumblefuncs.c  | 165 ++-
 src/backend/postmaster/launch_backend.c   |   3 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/include/nodes/nodes.h |   3 +
 src/include/nodes/primnodes.h |   2 +-
 src/include/nodes/queryjumble.h   |   8 +-
 15 files changed, 894 insertions(+), 26 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/merging.out
 create mode 100644 contrib/pg_stat_statements/sql/merging.sql

diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index 241c02587b..eef8d69cc4 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config 
$(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
user_activity wal entry_timestamp privileges extended \
-   parallel cleanup oldextversions
+   parallel cleanup oldextversions merging
 # Disabled because these tests require 
"shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/merging.out 
b/contrib/pg_stat_statements/expected/merging.out
new file mode 100644
index 00..881174d0ca
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/merging.out
@@ -0,0 +1,432 @@
+--
+-- Const merging functionality
+--
+CREATE EXTENSION pg_stat_statements;
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- No merging is performed, as a baseline result
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9);
+ id | data 
++--
+(0 rows)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
+ id | data 
++-

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2025-02-13 Thread Jacob Champion

On Tue, Feb 11, 2025 at 11:23 PM Michael Paquier  wrote:
> +be_tls_get_peer_subject_name(MyProcPort, lsslstatus.ssl_client_dn, 
> NAMEDATALEN);
> +be_tls_get_peer_serial(MyProcPort, lsslstatus.ssl_client_serial, 
> NAMEDATALEN);
> +be_tls_get_peer_issuer_name(MyProcPort, lsslstatus.ssl_issuer_dn, 
> NAMEDATALEN);
>
> But are these three actually OK to have showing up in the catalogs
> this early?  This data comes from a peer certificate, that we may know
> nothing about, become set at a very early stage with
> secure_initialize().

I guess I'm going to zero in on your definition of "may know nothing
about". If that is true, something is very wrong IMO.

My understanding of the backend code was that port->peer is only set
after OpenSSL has verified that the certificate has been issued by an
explicitly trusted CA. (Our verify_cb() doesn't override the internal
checks to allow failed certificates through.) That may not be enough
to authorize entry into the server, but it also shouldn't be untrusted
data. If a CA is issuing Subject data that is somehow dangerous to the
operation of the server, I think that's a security problem in and of
itself: there are clientcert HBA modes that don't validate the
Subject, but they're still going to push that data into the catalogs,
aren't they?

So if we're concerned that Subject data is dangerous at this point in
the code, I agree that my patch makes it even more dangerous and I'll
modify it -- but then I'm going to split off another thread to try to
fix that underlying issue. A user should not have to be authorized to
access the server in order for signed authentication data from the
transport layer to be considered trustworthy. Being able to monitor
those separately is important for auditability.

> As a whole we still have a gap between what could be OK, what could
> not be OK, and the fact that pgstat_bestart_security() is called twice
> makes that confusing.

My end goal is that all of this _should_ always be OK, so calling it
once or twice or thirty times should be safe. (But again, if that's
not actually true today, I'll remove it for now.)

> > v8-0003 shows this approach. For the record, I think it's materially
> > worse than v7-0003. IMO it increases the cognitive load for very
> > little benefit and makes it more work for a newcomer to refactor the
> > cleanup code for those routines. I think it's enough that you can see
> > a separate LOG message for each failure case, if you want to know why
> > we're unbinding.
>
> That's more verbose, as well.  As Robert said, that makes the output
> easier to debug with a 1:1 mapping between the event and a code path.

I agree with Robert's goal in general. I just think that following
that rule to *this* extent is counterproductive. But I won't die on
that hill; in the end, I just want to be able to see when LDAP calls
hang.

Thanks!
--Jacob

1 2 >

1 - 100 of 110 matches

Mail list logo