RE: Parallel heap vacuum

2024-12-10 Thread Hayato Kuroda (Fujitsu)
Dear Tomas,

> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

+1, so let's keep the proposed style. Or, can we even propose the idea
to table/index access method API?
I've considered bit and the point seemed that which arguments should be 
required.

> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

I love the idea, it is quite helpful for reviewers like me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Wrong results with right-semi-joins

2024-12-10 Thread Richard Guo
On Mon, Dec 9, 2024 at 11:01 PM Melanie Plageman
 wrote:
> Thanks for finding and fixing this. Just for my own benefit, could you
> explain more about the minimal repro? Specifically, if you just need a
> subplan in the hash side of a right semi-join, why do you also need
> the outer part of the query that produces the initplan?
>
>  Seq Scan on tbl_rs t1
>Filter: ((SubPlan 3) >= 0)
>SubPlan 3
>  ->  Limit
>InitPlan 2
>  ->  Hash Right Semi Join

Upon further consideration, I believe the initplan is unnecessary.
What we really want from the plan is to reuse the hash table during
hash-right-semi-join rescans.  To achieve this, we just need to ensure
that it's a single-batch join and that there are no parameter changes
on the inner side.

I spent some time on this and came up with a simpler query to
reproduce the issue.

explain (costs off)
select * from tbl_rs t1 join
  lateral (select * from tbl_rs t2 where t2.a in
(select t1.a+t3.a from tbl_rs t3) and t2.a < 5)
  on true;
QUERY PLAN
---
 Nested Loop
   ->  Seq Scan on tbl_rs t1
   ->  Hash Right Semi Join
 Hash Cond: ((t1.a + t3.a) = t2.a)
 ->  Seq Scan on tbl_rs t3
 ->  Hash
   ->  Seq Scan on tbl_rs t2
 Filter: (a < 5)
(8 rows)

Without the fix, this query returns 3 rows rather than the expected 6.

Maybe I should update the test case introduced in 5668a857d to this
one.

Thanks
Richard




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-12-10 Thread Peter Smith
Hi Nisha.

Here are some review comments for patch v54-0002.

(I had also checked patch v54-0001, but have no further review
comments for that one).

==
doc/src/sgml/config.sgml

1.
+   
+Slot invalidation due to idle timeout occurs during checkpoint.
+If the checkpoint_timeout exceeds
+idle_replication_slot_timeout, the slot
+invalidation will be delayed until the next checkpoint is triggered.
+To avoid delays, users can force a checkpoint to promptly invalidate
+inactive slots. The duration of slot inactivity is calculated
using the slot's
+pg_replication_slots.inactive_since
+value.
+   
+

The wording of "If the checkpoint_timeout exceeds
idle_replication_slot_timeout, the slot invalidation will be delayed
until the next checkpoint is triggered." seems slightly misleading,
because AFAIK it is not conditional on the GUC value differences like
that -- i.e. slot invalidation is *always* delayed until the next
checkpoint occurs.

SUGGESTION:
Slot invalidation due to idle timeout occurs during checkpoint.
Because checkpoints happen at checkpoint_timeout intervals, there can
be some lag between when the idle_replication_slot_timeout was
exceeded and when the slot invalidation is triggered at the next
checkpoint. To avoid such lags, users can force...

===
src/backend/replication/slot.c

2. GENERAL

+/* Invalidate replication slots idle beyond this time; '0' disables it */
+int idle_replication_slot_timeout_ms = 0;

I noticed this patch is using a variety of ways of describing the same thing:
* guc var: Invalidate replication slots idle beyond this time...
* guc_tables: ... the amount of time a replication slot can remain
idle before it will be invalidated.
* docs: means that the slot has remained idle beyond the duration
specified by the idle_replication_slot_timeout parameter
* errmsg: ... slot has been invalidated because inactivity exceeded
the time limit set by ...
* etc..

They are all the same, but they are all worded slightly differently:
* "idle" vs "inactivity" vs ...
* "time" vs "amount of time" vs "duration" vs "time limit" vs ...

There may not be a one-size-fits-all, but still, it might be better to
try to search for all different phrasing and use common wording as
much as possible.

~~~

CheckPointReplicationSlots:

3.
+ * XXX: Slot invalidation due to 'idle_timeout' occurs only for
+ * released slots, based on 'idle_replication_slot_timeout'. Active
+ * slots in use for replication are excluded, preventing accidental
+ * invalidation. Slots where communication between the publisher and
+ * subscriber is down are also excluded, as they are managed by the
+ * 'wal_sender_timeout'.

Maybe a slight rewording like below is better. Maybe not. YMMV.

SUGGESTION:
XXX: Slot invalidation due to 'idle_timeout' applies only to released
slots, and is based on the 'idle_replication_slot_timeout' GUC. Active
slots
currently in use for replication are excluded to prevent accidental
invalidation.  Slots...

==
src/bin/pg_upgrade/server.c

4.
+ /*
+ * Use idle_replication_slot_timeout=0 to prevent slot invalidation due to
+ * inactive_timeout by checkpointer process during upgrade.
+ */
+ if (GET_MAJOR_VERSION(cluster->major_version) >= 1800)
+ appendPQExpBufferStr(&pgoptions, " -c idle_replication_slot_timeout=0");
+

/inactive_timeout/idle_timeout/

==
src/test/recovery/t/043_invalidate_inactive_slots.pl

5.
+# Wait for slot to first become idle and then get invalidated
+sub wait_for_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;

AFAICT this 'idle_timeout' parameter is passed units of "seconds", so
it would be better to call it something like 'idle_timeout_s' to make
the units clear.

~~~

6.
+# Trigger slot invalidation and confirm it in the server log
+sub trigger_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;
+ my $invalidated = 0;

Ditto above review comment #5 -- better to call it something like
'idle_timeout_s' to make the units clear.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Skip collecting decoded changes of already-aborted transactions

2024-12-10 Thread Dilip Kumar
On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada  wrote:
>
> On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar  wrote:

> > >
> > > If the largest transaction is non-streamable, won't the transaction
> > > returned by ReorderBufferLargestTXN() in the other case already
> > > suffice the need?
> >
> > I see your point, but I don’t think it’s quite the same. When
> > ReorderBufferCanStartStreaming() is true, the function
> > ReorderBufferLargestStreamableTopTXN() looks for the largest
> > transaction among those that have a base_snapshot. So, if the largest
> > transaction is aborted but hasn’t yet received a base_snapshot, it
> > will instead select the largest transaction that does have a
> > base_snapshot, which could be significantly smaller than the largest
> > aborted transaction.
>
> IIUC the transaction entries in reorderbuffer have the base snapshot
> before decoding the first change (see SnapBuildProcessChange()). In
> which case the transaction doesn't have the base snapshot and has the
> largest amount of changes? Subtransaction entries could transfer its
> base snapshot to its parent transaction entry but such subtransactions
> will be picked by ReorderBufferLargestTXN().
>
IIRC, there could be cases where reorder buffers of transactions can
grow in size without having a base snapshot, I think transactions
doing DDLs and generating a lot of INVALIDATION messages could fall in
such a category.  And that was one of the reasons why we were using
txns_by_base_snapshot_lsn inside
ReorderBufferLargestStreamableTopTXN().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Shlok Kyal
On Tue, 8 Oct 2024 at 11:11, Shlok Kyal  wrote:
>
> Hi Kuroda-san,
>
> > > I have also modified the tests in 0001 patch. These changes are only
> > > related to syntax of writing tests.
> >
> > LGTM. I found small improvements, please find the attached.
>
> I have applied the changes and updated the patch.
>
 Patches needed a rebase. Attached the rebased patch.

Thanks and Regards,
Shlok Kyal


v15-0001-Distribute-invalidatons-if-change-in-catalog-tab.patch
Description: Binary data


v15-0002-Selective-Invalidation-of-Cache.patch
Description: Binary data


Re: [PoC] Reducing planning time when tables have many partitions

2024-12-10 Thread Yuya Watari
Hello Ashutosh and Alvaro,

I appreciate you replying to the email.

On Tue, Dec 3, 2024 at 7:38 PM Alvaro Herrera  wrote:
>
> I don't think planning time in assert-enabled builds is something we
> should worry about, at all.  Planning time in production builds is the
> important one.

Thank you for your reply. Making debug builds too slow is not good for
developers, so I'd like to see how these patches behave with
assert-enabled builds.

On Tue, Dec 3, 2024 at 1:12 PM Ashutosh Bapat
 wrote:
>
> Hi Yuya,
> For one of the earlier versions, I had reported a large memory
> consumption in all cases and increase in planning time for Assert
> enabled builds. How does the latest version perform in those aspects?

1. Experimental results

I ran experiments to measure memory consumption during planning. These
are done with the release build. In the experiments, I used the
rebased version of your patch [1], which is attached to this email.

Table 1: Memory consumption when planning query A (without
partition-wise join (PWJ), MiB)
-
n | Master |v23 |v28
-
1 |  0.132 |  0.137 |  0.137
2 |  0.158 |  0.166 |  0.166
4 |  0.220 |  0.234 |  0.235
8 |  0.347 |  0.375 |  0.375
   16 |  0.596 |  0.652 |  0.653
   32 |  1.104 |  1.221 |  1.223
   64 |  2.126 |  2.392 |  2.396
  128 |  4.245 |  4.917 |  4.925
  256 |  8.742 | 10.651 | 10.663
  384 | 13.603 | 17.159 | 17.176
  512 | 18.758 | 24.827 | 24.850
  640 | 23.924 | 32.223 | 32.253
  768 | 30.050 | 41.843 | 41.879
  896 | 36.224 | 51.937 | 51.978
 1024 | 42.923 | 64.058 | 64.105
-

Table 2: Memory consumption when planning query A (with PWJ, MiB)

n |  Master | v23 | v28

1 |   0.190 |   0.194 |   0.195
2 |   0.276 |   0.284 |   0.284
4 |   0.461 |   0.475 |   0.475
8 |   0.844 |   0.871 |   0.871
   16 |   1.584 |   1.640 |   1.641
   32 |   3.085 |   3.202 |   3.204
   64 |   6.099 |   6.365 |   6.369
  128 |  12.261 |  12.934 |  12.941
  256 |  25.061 |  26.970 |  26.982
  384 |  38.542 |  42.098 |  42.116
  512 |  52.541 |  58.610 |  58.633
  640 |  66.579 |  74.878 |  74.908
  768 |  82.421 |  94.214 |  94.250
  896 |  98.483 | 114.196 | 114.237
 1024 | 115.074 | 136.208 | 136.255


Table 3: Memory consumption when planning query B (without PWJ, MiB)
--
   n | Master | v23 | v28
--
   1 | 16.019 |  16.404 |  16.404
   2 | 15.288 |  15.708 |  15.708
   4 | 15.674 |  16.360 |  16.360
   8 | 16.554 |  17.784 |  17.786
  16 | 18.221 |  19.954 |  19.958
  32 | 21.630 |  25.609 |  25.617
  64 | 28.913 |  39.419 |  39.427
 128 | 45.331 |  77.015 |  77.030
 256 | 86.127 | 192.884 | 192.916
--

Table 4: Memory consumption when planning query B (with PWJ, MiB)
--
   n |   Master |  v23 |  v28
--
   1 |   33.623 |   34.008 |   34.008
   2 |   50.285 |   50.705 |   50.705
   4 |   85.562 |   86.247 |   86.247
   8 |  156.465 |  157.695 |  157.697
  16 |  298.692 |  300.424 |  300.428
  32 |  585.713 |  589.692 |  589.699
  64 | 1169.396 | 1179.901 | 1179.909
 128 | 2375.592 | 2407.275 | 2407.291
 256 | 4942.295 | 5049.053 | 5049.084
--

Next, I measured the planning times using the debug build with
assertions. In this experiment, I set CFLAGS to "-O0" and also used
the attached patch that removes assertions in Bitmapset-based indexes.

Table 5: Planning time of query A (debug build, ms)
-
n |  Master | v28 | v28 w/ patch
-
1 |   0.648 |   0.664 |0.665
2 |   0.788 |   0.810 |0.800
4 |   0.891 |   0.936 |0.931
8 |   1.202 |   1.301 |1.268
   16 |   1.973 |   2.145 |2.042
   32 |   3.668 |   4.000 |3.638
   64 |   8.093 |   8.597 |7.167
  128 |  20.015 |  19.641 |   14.274
  256 |  57.634 |  51.008 |   29.930
  384 | 114.280 |  94.760 |   46.449
  512 | 196.492 | 154.230 |   63.758
  640 | 315.037 | 240.142 |   82.476
  768 | 466.149 | 338.043 |  101.318
  896 | 679.029 | 511.097 |  134.854
 1024 | 897.806 | 592.823 |  141.852
-

Table 6: Planning time of query B (debug build, ms)
--
   n |   Master |  v28 | v28 w/ patch
--
   1 |   43.788 |   46.364 |   45.418
   2 |   42.637 |   45.750 |   44.093
   4 |   43.842 |   48.109 |   45.000
   8 |   47.504 |   54.410 |   48.199
  16 |   55.682 |   67.242 |   53.895
  32 |   77.736 |   98.507 |   66.877
  64 |  144.772 |  185.697 | 

Re: Conflict detection for update_deleted in logical replication

2024-12-10 Thread Nisha Moond
Here are the test steps and analysis for epoch-related handling
(Tested with v15 Patch-Set).

In the 'update_deleted' detection design, the launcher process
compares XIDs to track minimum XIDs, and the apply workers maintain
the oldest running XIDs. The launcher also requests publisher status
at regular intervals which also includes the epoch info. So, proper
epoch handling is required for the smooth functioning during XID
wraparound.

Functions requiring epoch handling:
1)  'get_candidate_xid()': Tracks the node's oldest running XID and
identifies the next candidate XID to advance the oldest non-removable
XID of an apply worker.
2) 'wait_for_publisher_status()': Tracks the publisher’s oldest and
next XIDs for monitoring concurrent remote transactions.
-- To test the epoch handling, I added extra LOG statements in above
functions and the launcher code. The patches for these changes are
attached (applies atop v15-0005).

The tests confirmed that epoch handling works correctly during XID
wraparound on both the publisher and subscriber sides. Detailed test
steps and results are provided below.


Setup:
- Created two nodes, 'Pub' and 'Sub', with logical replication.
- On both nodes, configured 'autovacuum_naptime = 1s' to allow
frequent vacuuming while consuming XIDs rapidly.
- On Sub, created a subscription for a table subscribed to all changes
from Pub.
- Installed and enabled the 'xid_wraparound' extension on both nodes:
  CREATE EXTENSION xid_wraparound;


-
Case-1: When XID Wraparound Happens on Sub
-
Scenario:
In 'get_candidate_xid()', 'oldest_running_xid' and 'next_full_xid'
have different epochs, meaning, an old epoch transaction is running,
and a xid-wraparound happens on the subscriber.

Test Steps:
Perform below steps at Sub node:
1. Consume 4 Billion XIDs in Batches (400M each).
  -- script "consume_4B_xids.sh" is attached which is used to consume the xids.
2. Set 'detect_update_deleted=ON' for the subscription.
3. Hold a transaction with an old XID before Wraparound:
  -- Start a new session, begin a transaction, and leave it open. This
transaction will have an XID close to 4 billion.

4. In another session, trigger wraparound by consuming remaining XIDs
(2^32 - 4B):
 SELECT consume_xids('294966530');

-- At Sub, the newly added log will show that wraparound happened and
epoch was handled correctly by choosing the correct
candidate_full_xid.
LOG: XXX: oldest_running_xid = 400762
LOG: XXX: next_full_xid = 766
LOG: XXX: xid WRAPAROUND happened!!!
LOG: XXX: candidate full_xid = 400762


5. Confirm launcher updates "pg_conflict_detection" slot's xmin with new epoch:
  - End the open transaction.
  - Verify that the oldest running xid is now updated to the new epoch xid

LOG: XXX: oldest_running_xid = 766
LOG: XXX: next_full_xid = 766
LOG: XXX: candidate full_xid = 766

  - Confirm the launcher updates the new epoch xid as xmin:

LOG: XXX: launcher new_xid = 766
LOG: XXX: launcher current slot xmin = 400762
LOG: XXX: launcher full_xmin = 400762
LOG: XXX: launcher updated xmin = 766

 postgres=# SELECT slot_name, slot_type, active, xmin FROM
pg_replication_slots;
slot_name   | slot_type | active | xmin
 ---+---++--
  pg_conflict_detection | physical  | t  |  766
 (1 row)


-
Case-2: When XID Wraparound Happens on Pub
-
Scenario:
In 'wait_for_publisher_status()', 'data->last_phase_at' (oldest
commiting remote XID) and 'remote_next_xid' have different epochs,
meaning, an old epoch transaction is in commit phase on remote(Pub),
and wraparound happens on the publisher node.

Test Steps:
1. Consume 4 Billion XIDs in Batches (400M each) on the Publisher node.
  -- script "consume_4B_xids.sh" is attached which is used to consume the xids.
2. At sub, set 'detect_update_deleted=ON' for the subscription.
3. Confirm the latest remote XID are updated on Sub:

LOG: XXX: last_phase_at = 400796
LOG: XXX: remote_oldestxid = 400796
LOG: XXX: remote_nextxid = 400796
LOG: XXX: remote_full_xid = 400796
LOG: XXX: remote concurrent txn completed

4. Hold a transaction in the commit phase:
  - Attach a debugger to a session, start a transaction, and hold it
at 'XactLogCommitRecord()'.
  - This step is required because the launcher at sub tracks remote
concurrent transactions which are currently committing.

5. In another session, trigger wraparound by consuming remaining XIDs
(2^32 - 4B):
 SELECT consume_xids('294966530');

  -- At sub, the logs confirm that wraparound happened on Pub node:

LOG: XXX: last_phase_at = 400797
LOG: XXX: remote_oldestxid = 400796
LOG

Memory leak in pg_logical_slot_{get,peek}_changes

2024-12-10 Thread vignesh C
Hi,

I'm starting a new thread for one of the issues reported by Sawada-san at [1].

This is a memory leak on CacheMemoryContext when using pgoutput via SQL APIs:
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);

entry->attrmap = build_attrmap_by_name_if_req(indesc, outdesc, false);

MemoryContextSwitchTo(oldctx);
RelationClose(ancestor);

entry->attrmap is pfree'd only when validating the RelationSyncEntry
so remains even after logical decoding API calls.

This issue can be reproduced with the following test:
-- Create table
create table t1( c1 int, c2 int, c3 int) PARTITION BY LIST (c1);
create table t1_1( c2 int, c1 int, c3 int);
ALTER TABLE t1 ATTACH PARTITION t1_1 FOR VALUES IN (0, 1, 2, 3);

-- Create publication
create publication pub1 FOR TABLE t1, t1_1 WITH
(publish_via_partition_root = true);

-- Create replication slot
SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput');

-- Run the below steps between Test start and Test end many times to
see the memory usage getting increased
-- Test start
insert into t1 values( 1);
SELECT count(*) FROM pg_logical_slot_get_binary_changes('test', NULL,
NULL, 'proto_version', '4', 'publication_names', 'pub1');

-- Memory increases after each invalidation and insert
SELECT * FROM pg_backend_memory_contexts where name = 'CacheMemoryContext';
-- Test end

The attached patch resolves a memory leak by ensuring that the
attribute map is properly freed during plugin shutdown. This process
is triggered by the SQL API when the decoding context is being
released. Additionally, I am conducting a review to identify and
address any similar memory leaks that may exist elsewhere in the code.

[1] - 
https://www.postgresql.org/message-id/CAD21AoDkAhQVSukOfH3_reuF-j4EU0-HxMqU3dU%2BbSTxsqT14Q%40mail.gmail.com

Regards,
Vignesh
From 36e5c10105d934da0d51474b0ad7c5bd9087e2aa Mon Sep 17 00:00:00 2001
From: Vignesh 
Date: Wed, 11 Dec 2024 08:57:06 +0530
Subject: [PATCH v1] Fix memory leak in pgoutput relation attribute map

The pgoutput module caches relation attribute map and frees it upon
invalidation.  However, this was not getting freed incase of SQL functions like
pg_logical_slot_{get,peek}_changes() which would bloat its cache memory usage.
Every pg_logical_slot_{get,peek}_changes() call for changes on partition table
creates more bloat. To address this, this relation attribute map is freed while
the output plugin is shutdown.
---
 src/backend/replication/pgoutput/pgoutput.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b50b3d62e3..f1ef13d313 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1747,6 +1747,16 @@ pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
 	if (RelationSyncCache)
 	{
+		RelationSyncEntry *entry;
+		HASH_SEQ_STATUS status;
+
+		hash_seq_init(&status, RelationSyncCache);
+		while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
+		{
+			if (entry->attrmap)
+free_attrmap(entry->attrmap);
+		}
+
 		hash_destroy(RelationSyncCache);
 		RelationSyncCache = NULL;
 	}
-- 
2.43.0



Re: attndims, typndims still not enforced, but make the value within a sane threshold

2024-12-10 Thread Michael Paquier
On Thu, Dec 05, 2024 at 06:34:31PM -0500, Tom Lane wrote:
> You have a mighty optimistic view of what will happen.  I predict
> that if we do step (1), exactly nothing will happen in applications,
> and step (2) will remain just as painful for them.  (Assuming that
> we remember to do step (2), which is no sure thing given our track
> record for following through "in a few years".)

FWIW, my first thought after reading this paragraph is that you sound
too dramatic here, especially after looking at codesearch to note that
the PHP core code stores attndims but does not actually use it.

Now, I've looked at github and noted that there's a large number of
repositories (in the hundreds) that rely on a zero or non-zero value
for *attndims* to check if they're dealing with an array, saving in
what looks like catalog lookups.

The same lookups done for typndims offer an opposite conclusion:
there's no real code in the open that uses this value for similar
checks, most likely because domains are less used in the wild.  So I
get the drama for the removal of attndims and I'd agree to keep it
around, but I also see a very good point in removing typndims.
--
Michael


signature.asc
Description: PGP signature


Re: Changing the state of data checksums in a running cluster

2024-12-10 Thread Michael Paquier
On Tue, Nov 26, 2024 at 11:07:12PM +0100, Tomas Vondra wrote:
> I spent a bit more time doing some testing on the last version of the
> patch from [1]. And I ran into this assert in PostmasterStateMachine()
> when stopping the cluster:
> 
>   /* All types should be included in targetMask or remainMask */
>   Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
> 
> At first I was puzzled as this happens on every shutdown, but that's
> because these checks were introduced by a78af0427015 a week ago. So it's
> more a matter of rebasing.

Looks like the CI is not really happy about this point..  (Please make
sure to refresh the patch status after a review.)
--
Michael


signature.asc
Description: PGP signature


Re: Subscription sometimes loses txns after initial table sync

2024-12-10 Thread Shlok Kyal
On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, December 9, 2024 9:21 PM Pritam Baral  
> wrote:
> > To: pgsql-hackers 
> > Subject: Subscription sometimes loses txns after initial table sync
> >
> > This was discovered when testing the plan for a major version upgrade via
> > logical replication. Said plan requires that some tables be synced before
> > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> > followed
> > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> > revealed
> > that sometimes, for some tables added this way, txns after the initial data 
> > copy
> > are lost by the subscription.
> >
> > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> > even 12.22 (on either side of the replication setup). The script runs at a
> > default scale of 100 tables with 10k inserts each. This scale is enough to
> > demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
> >
> > In attempts to analyse why this happens, it has been observed that the 
> > sender
> > sometimes does not pick up a published table, even when the receiver that
> > started the sender process has seen the table as available (as returned by
> > pg_get_publication_tables()) and has thus begun COPYing its data. When the
> > COPY
> > finishes (and the tablesync worker is finished), the apply loop on the 
> > receiver
> > expects to receive (and apply) subsequent changes for such tables, but 
> > simply
> > isn't sent any. This was observed by dumping every CopyData message sent
> > over
> > the wire.
> >
> > The attached script (like the original migration plan) uses a single 
> > publication
> > and adds tables to it successively. Curiously, when the script was changed 
> > to
> > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> > ADD
> > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> > the no. of
> > tables with data loss jumped to 100%.
>
> Thanks for reporting the issue.
>
> The described behavior looks similar to another bug discussed in [1]. If
> possible, could you please check if the latest patch in that thread can fix 
> the
> bug you reported ?
>
> If it does, it would be helpful to share the feedback in that thread.
>
> [1] 
> https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
>

Hi,

I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I
found that the issue is intermittent for me. I ran the script,
provided in [1], 50 times on both branches and I was able to reproduce
the issue 4 times and 5 times respectively.
Then I tested both the branches after applying patches in [2] and ran
the script 50 times. I was not able to reproduce the issue with patch.

I think as Hou-san suggested, the patches in [2] can fix this issue.

[1]: 
https://www.postgresql.org/message-id/8b595156-d8b6-4b53-a788-7d945726cd2f%40pritambaral.com
[2]: 
https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Thanks and Regards,
Shlok Kyal




Re: Conflict detection for update_deleted in logical replication

2024-12-10 Thread Amit Kapila
On Fri, Dec 6, 2024 at 1:28 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, December 5, 2024 6:00 PM Amit Kapila  
> wrote:
> >
> >
> > A few more comments:
> > 1.
> > +static void
> > +wait_for_local_flush(RetainConflictInfoData *data)
> > {
> > ...
> > +
> > + data->phase = RCI_GET_CANDIDATE_XID;
> > +
> > + maybe_advance_nonremovable_xid(data);
> > +}
> >
> > Isn't it better to reset all the fields of data before the next round of
> > GET_CANDIDATE_XID phase? If we do that then we don't need to reset
> > data->remote_lsn = InvalidXLogRecPtr; and data->last_phase_at =
> > InvalidFullTransactionId; individually in request_publisher_status() and
> > get_candidate_xid() respectively. Also, it looks clean and logical to me 
> > unless I
> > am missing something.
>
> The remote_lsn was used to determine whether a status is received, so was 
> reset
> each time in request_publisher_status. To make it more straightforward, I 
> added
> a new function parameter 'status_received', which would be set to true when
> calling maybe_advance_nonremovable_xid() on receving the status. After this
> change, there is no need to reset the remote_lsn.
>

As part of the above comment, I had asked for three things (a) avoid
setting data->remote_lsn = InvalidXLogRecPtr; in
request_publisher_status(); (b) avoid setting data->last_phase_at
=InvalidFullTransactionId; in get_candidate_xid(); (c) reset data in
wait_for_local_flush() after wait is over. You only did (a) in the
patch and didn't mention anything about (b) or (c). Is that
intentional? If so, what is the reason?

*
+static bool
+can_advance_nonremovable_xid(RetainConflictInfoData *data)
+{
+

Isn't it better to make this an inline function as it contains just one check?

*
+ /*
+ * The non-removable transaction ID for a subscription is centrally
+ * managed by the main apply worker.
+ */
+ if (!am_leader_apply_worker())

I have tried to improve this comment in the attached.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/worker.c 
b/src/backend/replication/logical/worker.c
index a75a3691dc..b1b77e4a1e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4078,8 +4078,9 @@ maybe_advance_nonremovable_xid(RetainConflictInfoData 
*data,
   bool status_received)
 {
/*
-* The non-removable transaction ID for a subscription is centrally
-* managed by the main apply worker.
+* It is sufficient to manage non-removable transaction ID for a
+* subscription by the main apply worker to detect update_deleted 
conflict
+* even for table sync or parallel apply workers.
 */
if (!am_leader_apply_worker())
return;


Re: warn if GUC set to an invalid shared library

2024-12-10 Thread Michael Paquier
On Mon, Nov 11, 2024 at 12:09:15PM +0200, Heikki Linnakangas wrote:
> I had a quick glance at this, and I agree with Robert's comment earlier that
> this requires a lot of context to understand.

All this feedback was 4 weeks ago, and is unanswered, so I've marked
the CF entry as returned with feedback.  Feel free to resubmit if
necessary.
--
Michael


signature.asc
Description: PGP signature


Re: attndims, typndims still not enforced, but make the value within a sane threshold

2024-12-10 Thread Tom Lane
Michael Paquier  writes:
> FWIW, my first thought after reading this paragraph is that you sound
> too dramatic here, especially after looking at codesearch to note that
> the PHP core code stores attndims but does not actually use it.

It appeared to me that it fetches it in order to return it in an
application inquiry function [1].  So to really gauge the damage,
you'd have to find out how many PHP applications pay attention
to the "array dims" field of pg_meta_data().  Maybe it's a trivial
number, or maybe not --- I didn't have the energy to search.

regards, tom lane

[1] 
https://sources.debian.org/src/php8.2/8.2.26-4/ext/pgsql/pgsql.c/?hl=4245#L4245




Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-10 Thread vignesh C
On Tue, 10 Dec 2024 at 23:36, Masahiko Sawada  wrote:
>
> On Mon, Dec 9, 2024 at 6:52 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 10, 2024 at 2:17 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Dec 9, 2024 at 2:06 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Dec 5, 2024 at 2:56 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > >
> > > > > > I realized that this patch cannot be backpatched because it 
> > > > > > introduces a new
> > > > > > field into the public PGOutputData structure. Therefore, I think we 
> > > > > > may need to
> > > > > > use Alvaro's version [1] for the back branches.
> > > > >
> > > > > FWIW for back branches, I prefer using the foreach-pfree pattern
> > > > > Michael first proposed, just in case. It's not elegant but it can
> > > > > solve the problem while there is no risk of breaking non-core
> > > > > extensions.
> > > > >
> > > >
> > > > It couldn't solve the problem completely even in back-branches. The
> > > > SQL API case I mentioned and tested by Hou-San in the email [1] won't
> > > > be solved.
> > >
> > > True. There seems another place where we possibly leak memory on
> > > CacheMemoryContext when using pgoutput via SQL APIs:
> > >
> > > /* Map must live as long as the session does. */
> > > oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > >
> > > entry->attrmap = build_attrmap_by_name_if_req(indesc, outdesc, 
> > > false);
> > >
> > > MemoryContextSwitchTo(oldctx);
> > > RelationClose(ancestor);
> > >
> > > entry->attrmap is pfree'd only when validating the RelationSyncEntry
> > > so remains even after logical decoding API calls.
> > >
> >
> > We have also noticed this but it needs more analysis on the fix which
> > one of my colleagues is doing. I think we can fix this as a separate
> > issue unless you think otherwise.
>
> I agree to fix this as a separate patch.

Thanks Sawada-san, I have started a new thread with a test case which
can reproduce this issue at [1]:
[1] - 
https://www.postgresql.org/message-id/flat/CALDaNm1hewNAsZ_e6FF52a%3D9drmkRJxtEPrzCB6-9mkJyeBBqA%40mail.gmail.com

Regards,
Vignesh




VACUUM cleaning of CF 2024-11

2024-12-10 Thread Michael Paquier
Hi all,

I've done a pass over the CF app, and did some routine vacuuming of
the whole.  Here are the final scores:
Committed: 78
Moved to next CF: 236.
Withdrawn: 11
Rejected: 2
Returned with Feedback: 9
Total: 336

As usual, a lot of entries had an outdated status, more than the last
time I've looked at all that (no track of actual numbers, just an
impression).  If you feel that your patch has not been classified
correctly, feel free to complain on this thread, to contact me or just
to change the status of your patch(es) to reflect what its correct
status should be.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: Fix bank selection logic in SLRU

2024-12-10 Thread Dilip Kumar
On Tue, 10 Dec 2024 at 7:24 PM, Andrey M. Borodin 
wrote:

>
>
> > On 10 Dec 2024, at 16:26, Dilip Kumar  wrote:
> >
> > IIUC, we do check that it should be in multiple of bank size (i.e.)
> > which is multiple of 2, right?  Am I missing something?
>
> Bank selection code assumes that number of buffers is power of 2.
> If the number of buffers is not power of 2 - only subset of buffers will
> be used. In worst case, e.g. 65 buffers, everything will be buffered only
> in bank 64.


But why that would be the case? the acceptable values for GUC to configure
the slru buffers are in multiple of 16(bank size) we have that check to
check the GUC values.

—
Dilip

>


Re: XMLDocument (SQL/XML X030)

2024-12-10 Thread Andrew Dunstan



On 2024-12-10 Tu 2:48 AM, Jim Jones wrote:

On 04.12.24 17:18, Jim Jones wrote:

I'd like to propose the implementation of XMLDocument (SQL/XML X030).
It basically returns an XML document from a given XML expression, e.g.

SELECT
   xmldocument(
     xmlelement(NAME foo,
   xmlattributes(42 AS att),
   xmlelement(NAME bar,
     xmlconcat('va', 'lue'))
     )
   );

  xmldocument
--
  value
(1 row)

v1 attached attempts to implement XMLDocument() as described above.

Feedback welcome.



LGTM at a first glance.


Please add this to the next CommitFest if you haven't done already.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fix bank selection logic in SLRU

2024-12-10 Thread Robert Haas
On Tue, Dec 10, 2024 at 8:58 AM Dilip Kumar  wrote:
>> Bank selection code assumes that number of buffers is power of 2.
>> If the number of buffers is not power of 2 - only subset of buffers will be 
>> used. In worst case, e.g. 65 buffers, everything will be buffered only in 
>> bank 64.
>
> But why that would be the case? the acceptable values for GUC to configure 
> the slru buffers are in multiple of 16(bank size) we have that check to check 
> the GUC values.

"Must be a multiple of 16" and "must be a power of 2" are different
criteria. For example, 48 is a multiple of 16 but it is not a power of
2. If the code assumes that we have an actual power of 2, the check
you quoted in your previous email is insufficient.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




timeline problem when recovery

2024-12-10 Thread Fabrice Chapuis
Hi, I got this message on a standby after a FO of PG14 cluster with 3 nodes.

user=,db=,client=,application= LOG:  new timeline 20 forked off current
database system timeline 19 before current recovery point CC8/164E9350

this message come from xlog.c rescanLatestTimeLine function:

if (currentTle->end < EndRecPtr)
{
ereport(LOG,
(errmsg("new timeline %u forked off current database system timeline %u
before current recovery point %X/%X",
newtarget,
ThisTimeLineID,
LSN_FORMAT_ARGS(EndRecPtr;
return false;
}

To complete recovery of standby, rewind was used to come back to the fork.
Is that means that standby by was before hand compare to the new primary
(new TL)  and potentially loosing data on this primary.

Regards,

Fabrice


Re: Pass ParseState as down to utility functions.

2024-12-10 Thread jian he
add parser_errposition to some places in
transformTableConstraint, transformColumnDefinition
where v8 didn't.
From b49e1b74b5d479b854599c0f9d6b6df1e61aa67c Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 10 Dec 2024 22:36:45 +0800
Subject: [PATCH v9 2/2] Print out error position for number of DDL commands.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this patch, the following functions will printout the error position for certain error cases.
This can be particularly helpful when working with a sequence of DML commands,
such as `create schema create schema_element`.
It also makes it easier to quickly identify the relevant error area in a single DDL command

For example, error positinion reporting is now supported for CREATE
DOMAIN, ALTER TABLE ... ALTER COLUMN ..., ALTER TABLE ... OF ... and
other.
More cases can be found in regression tests.

CREATE DOMAIN testdomain_i AS int COLLATE "sv_SE";
ERROR:  collations are not supported by type integer
LINE 1: CREATE DOMAIN testdomain_i AS int COLLATE "sv_SE";
  ^

ATExecAlterColumnType code change maybe not necessary, since
ATPrepAlterColumnType will catach most of the error.  AlterType function changes
no effect since struct AlterTypeStmt don't have location info.

ATExecAlterColumnType, AlterType because of the above mentioned reason, don't
have regress test.  all other have tests.

Author: Kirill Reshke 
Author: Jian He 
Reviewed-By: Michaël Paquier 
Reviewed-By: Álvaro Herrera 

discussion: https://postgr.es/m/CALdSSPhqfvKbDwqJaY=yeepi_aq61gmmpw88i6zh7cmg_2z...@mail.gmail.com
---
 src/backend/commands/tablecmds.c  | 43 +
 src/backend/commands/typecmds.c   | 64 +++
 src/backend/parser/parse_utilcmd.c| 21 --
 src/backend/tcop/utility.c|  4 +-
 src/include/commands/typecmds.h   |  4 +-
 src/test/regress/expected/alter_table.out | 10 +++
 .../regress/expected/collate.icu.utf8.out |  2 +
 .../regress/expected/collate.linux.utf8.out   |  2 +
 src/test/regress/expected/collate.out |  2 +
 .../expected/collate.windows.win1252.out  |  2 +
 src/test/regress/expected/constraints.out | 14 
 src/test/regress/expected/domain.out  | 36 +++
 src/test/regress/expected/float8.out  |  2 +
 src/test/regress/expected/identity.out|  2 +
 src/test/regress/expected/typed_table.out |  4 ++
 15 files changed, 160 insertions(+), 52 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6ccae4cb4a..efa38b1470 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -593,7 +593,8 @@ static void ATPrepAlterColumnType(List **wqueue,
   AlterTableUtilityContext *context);
 static bool ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno);
 static ObjectAddress ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
-		   AlterTableCmd *cmd, LOCKMODE lockmode);
+		   AlterTableCmd *cmd, LOCKMODE lockmode,
+		   AlterTableUtilityContext *context);
 static void RememberAllDependentForRebuilding(AlteredTableInfo *tab, AlterTableType subtype,
 			  Relation rel, AttrNumber attnum, const char *colName);
 static void RememberConstraintForRebuilding(Oid conoid, AlteredTableInfo *tab);
@@ -639,7 +640,9 @@ static ObjectAddress ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCK
 static ObjectAddress ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode);
 static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid,
    DependencyType deptype);
-static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode);
+static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename,
+ LOCKMODE lockmode,
+ AlterTableUtilityContext *context);
 static void ATExecDropOf(Relation rel, LOCKMODE lockmode);
 static void ATExecReplicaIdentity(Relation rel, ReplicaIdentityStmt *stmt, LOCKMODE lockmode);
 static void ATExecGenericOptions(Relation rel, List *options);
@@ -5413,7 +5416,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			break;
 		case AT_AlterColumnType:	/* ALTER COLUMN TYPE */
 			/* parse transformation was done earlier */
-			address = ATExecAlterColumnType(tab, rel, cmd, lockmode);
+			address = ATExecAlterColumnType(tab, rel, cmd, lockmode, context);
 			break;
 		case AT_AlterColumnGenericOptions:	/* ALTER COLUMN OPTIONS */
 			address =
@@ -5537,7 +5540,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			address = ATExecDropInherit(rel, (RangeVar *) cmd->def, lockmode);
 			break;
 		case AT_AddOf:
-			address = ATExecAddOf(rel, (TypeName *) cmd->def, lockmode);
+			address = ATExecAddOf(rel, (TypeName *) cmd->def, lockmode, context);
 			break;
 		case AT_DropOf:
 			ATExecDropOf(rel, lockmode);
@@ -13218,10 +132

Re: SQL:2011 application time

2024-12-10 Thread Peter Eisentraut

On 07.12.24 20:29, Paul Jungwirth wrote:

These five patches all look good to me.

Note that my tests already include a section for REPLICA IDENTITY FULL, 
which passed. But the subscriber was using a SeqScan to look up tuples 
to update.


Here are the steps (mostly just because it was confusing for me at 
first): First in FindUsableIndexForReplicaIdentityFull, we would call 
IsIndexUsableForReplicaIdentityFull, get back false, and decide there 
was no index to use. Then in FindReplTupleInLocalRel, localidxoid was 0, 
so we woudln't call IsIndexUsableForReplicaIdentityFull at all.


After applying the five patches, I can see that we choose the index and 
call IsIndexUsableForReplicaIdentityFull from both sites. This should 
make applying changes a lot faster.


I have committed these.  I will continue with reviewing v45-0002 and 
following now.






Re: Fix bank selection logic in SLRU

2024-12-10 Thread Andrey M. Borodin



> On 10 Dec 2024, at 16:26, Dilip Kumar  wrote:
> 
> IIUC, we do check that it should be in multiple of bank size (i.e.)
> which is multiple of 2, right?  Am I missing something?

Bank selection code assumes that number of buffers is power of 2.
If the number of buffers is not power of 2 - only subset of buffers will be 
used. In worst case, e.g. 65 buffers, everything will be buffered only in bank 
64.


Best regards, Andrey Borodin.



Re: [PATCH] Support Int64 GUCs

2024-12-10 Thread Pavel Borisov
On Tue, 10 Dec 2024 at 19:13, Tom Lane  wrote:

> Pavel Borisov  writes:
> > I see Aleksander worked on this patch (with many other hackers) and
> marked
> > it as rejected. And now Evgeniy, a new developer wants to continue and
> > rebase this patch and also the patches in another 64-xid thread
> > disregarding the fact that the patch is rejected. It's not a crime. It's
> > rejected, true.
>
> No, but it's a waste of time.  Just rebasing the patch does nothing to
> counter the arguments that made us reject it before.
>

Tom, I agree! Just wrote this because of probable confusion of names by
Wenhui above.

Regards,
Pavel


Re: Support for unsigned integer types

2024-12-10 Thread Peter Eisentraut

On 06.12.24 19:45, Jack Bay wrote:

Unsigned integers are a widely used type and can be important for
compact and well-aligned data structures, but are not currently
available in PostgreSQL.

In particular unsigned 64-bit integers and unsigned 32-bit integers
are desirable as identifiers. They unambiguously correspond to
specific hexadecimal or other human-readable encoded representations.
Although signed integers can be used for this purpose, there is the
potential for human error in confusing a positive value for a negative
value, corner cases around maximum and minimum values (which are
statistically certain to be encountered when random bits are used for
the integer), the potential for human error in interconverting hex and
other encoded representations, text representation nonuniformity (the
need for a space for the minus sign), and a variety of associated
nuisances.

Would it be possible to add support for unsigned 64-bit and unsigned
32-bit integers to postgresql?


Here is an extension that implements this: https://github.com/petere/pguint

You can use this for production use and perhaps also as the basis for 
experimentation about different behaviors and trade-off that have been 
mentioned.






Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Robert Haas
On Mon, Dec 9, 2024 at 7:31 PM Andres Freund  wrote:
> Pretty unexcited about all of these - XFS is fairly widely used for PG, but
> this problem doesn't seem very common. It seems to me that we're missing
> something that causes this to only happen in a small subset of cases.

I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.
It's probably workload dependent somehow, but for all we know today it
seems like the workload could be as simple as "do enough file
extension and you'll get into trouble eventually" or maybe "do enough
file extension[with some level of concurrency and you'll get into
trouble eventually".

> I think the source of this needs to be debugged further before we try to apply
> workarounds in postgres.

Why? It seems to me that this has to be a filesystem bug, and we
should almost certainly adopt one of these ideas from Michael Harris:

 - Providing a way to configure PG not to use posix_fallocate at runtime

 - In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

Maybe we need some more research to figure out which of those two
things we should do -- I suspect the second one is better but if that
fails then we might need to do the first one -- but I doubt that we
can wait for XFS to fix whatever the issue is here. Our usage of
posix_fallocate doesn't look to be anything more than plain vanilla,
so as between these competing hypotheses:

(1) posix_fallocate is and always has been buggy and you can't rely on it, or
(2) we use posix_fallocate in a way that nobody else has and have hit
an incredibly obscure bug as a result, which will be swiftly patched

...the first seems much more likely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Andres Freund
Hi,

On 2024-12-10 12:36:40 -0500, Robert Haas wrote:
> On Mon, Dec 9, 2024 at 7:31 PM Andres Freund  wrote:
> > Pretty unexcited about all of these - XFS is fairly widely used for PG, but
> > this problem doesn't seem very common. It seems to me that we're missing
> > something that causes this to only happen in a small subset of cases.
>
> I wonder if this is actually pretty common on XFS. I mean, we've
> already hit this with at least one EDB customer, and Michael's report
> is, as far as I know, independent of that; and he points to a
> pgsql-general thread which, AFAIK, is also independent. We don't get
> three (or more?) independent reports of that many bugs, so I think
> it's not crazy to think that the problem is actually pretty common.

Maybe. I think we would have gotten a lot more reports if it were common. I
know of quite a few very busy installs using xfs.

I think there must be some as-of-yet-unknown condition gating it. E.g. that
the filesystem has been created a while ago and has some now-on-by-default
options disabled.


> > I think the source of this needs to be debugged further before we try to 
> > apply
> > workarounds in postgres.
>
> Why? It seems to me that this has to be a filesystem bug,

Adding workarounds for half-understood problems tends to lead to code that we
can't evolve in the future, as we a) don't understand b) can't reproduce the
problem.

Workarounds could also mask some bigger / worse issues.  We e.g. have blamed
ext4 for a bunch of bugs that then turned out to be ours in the past. But we
didn't look for a long time, because it was convenient to just blame ext4.


> and we should almost certainly adopt one of these ideas from Michael Harris:
>
>  - Providing a way to configure PG not to use posix_fallocate at runtime

I'm not strongly opposed to that. That's testable without access to an
affected system.  I wouldn't want to automatically do that when detecting an
affected system though, that'll make behaviour way less predictable.


>  - In the case of posix_fallocate failing with ENOSPC, fall back to
> FileZero (worst case that will fail as well, in which case we will
> know that we really are out of space)

I doubt that that's a good idea. What if fallocate failing is an indicator of
a problem? What if you turn on AIO + DIO and suddenly get a much more
fragmented file?

Greetings,

Andres Freund




Re: Assert failure on running a completed portal again

2024-12-10 Thread Tom Lane
Robert Haas  writes:
> On Tue, Dec 10, 2024 at 2:53 PM Tom Lane  wrote:
>> I'm thinking about something like this:

> That seems pretty good, although the last sentence seems like it might
> be a little too alarming. Maybe add "although we know of no specific
> problem" or something like that.

OK, I'll tone it down a bit.

> The portal mechanism is a hot mess, IMHO, and needs some serious
> redesign, or at least cleanup. For example, the fact that
> portal->cleanup is always either PortalCleanup or NULL is an
> "interesting" design choice.

While I'm not here to defend that code particularly, that part
doesn't seem totally insane to me.  The intent clearly was to
support more than one form of cleanup.  Maybe after 25 years
we can conclude that we'll never need more than one form, but
I'm not going to fault whoever wrote it for not having an
operational crystal ball.

> MarkPortalDone() and MarkPortalFailed()
> looked like they do the same amount of cleanup but, ha ha, no they
> don't, because the one and only cleanup hook peeks behind the curtain
> to figure out who is calling it.

If you're talking about

 * Shut down executor, if still running.  We skip this during error abort,
 * since other mechanisms will take care of releasing executor resources,
 * and we can't be sure that ExecutorEnd itself wouldn't fail.

it's hardly the fault of the Portal logic that ExecutorEnd is unsafe
to call during abort.  (ExecutorEnd shares that property with a
boatload of other code, too.)

Anyway, if you feel like rewriting that stuff, step right up.
My feeling about it is that the law of conservation of cruft
will prevent a replacement from being all that much cleaner,
but maybe I'm wrong.

regards, tom lane




Re: Vectored IO in XLogWrite()

2024-12-10 Thread Michael Paquier
On Thu, Aug 08, 2024 at 01:25:47PM -0400, Robert Haas wrote:
> I wondered whether the regression tests actually hit the iovcnt == 2
> case, and it turns out that they do, rather frequently actually.
> Making that case a FATAL causes ~260 regression test failure. However,
> on larger systems, we'll often end up with wal_segment_size=16MB and
> wal_buffers=16MB and then it seems like we don't hit the iovcnt==2
> case. Which I guess just reinforces the point that this is
> theoretically better but practically not much different.
> 
> Any other votes on what to do here?

Perhaps some micro-benchmarking to prove that the patch shows benefits
when this code path is taken alone?  I could fall behind that.

I'd also document the maths of the patch a bit more, it is always kind
of hard to figure out if iovec is set up right or not.  And the
formulas used in the patch look magical.
--
Michael


signature.asc
Description: PGP signature


Re: Add ExprState hashing for GROUP BY and hashed SubPlans

2024-12-10 Thread David Rowley
On Wed, 11 Dec 2024 at 11:36, David Rowley  wrote:
> I've now pushed v7-0001.

... and, after a few comment tweaks, I've also pushed the v7-0002 patch.

Thanks for your reviews, Andrei.

David




Re: Fix early elog(FATAL)

2024-12-10 Thread Nathan Bossart
On Sat, Dec 07, 2024 at 07:46:14PM -0800, Noah Misch wrote:
> main() says:
> 
>   /*
>* Fire up essential subsystems: error and memory management
>*
>* Code after this point is allowed to use elog/ereport, though
>* localization of messages may not work right away, and messages won't 
> go
>* anywhere but stderr until GUC settings get loaded.
>*/
>   MemoryContextInit();
> 
> However, appending elog(ERROR, "whoops") breaks like:
> 
> $ initdb -D discard_me
> FATAL:  whoops
> PANIC:  proc_exit() called in child process
> no data was returned by command ""/home/nm/sw/nopath/pghead/bin/postgres" -V"
> child process was terminated by signal 6: Aborted
> 
> So does the ereport(FATAL) in ClosePostmasterPorts().  The "called in child
> process" check (added in commit 97550c0 of 2023-10) reads MyProcPid, which we
> set later.  Three ways to fix this:

I noticed that you committed a fix for this.  Sorry for not responding
earlier.

> 1. Call InitProcessGlobals() earlier.  This could also reduce the total call
>sites from 3 to 2 (main() and post-fork).
> 
> 2. Move MyProcPid init out of InitProcessGlobals(), to main() and post-fork.
>This has less to go wrong in back branches.  While probably irrelevant,
>this avoids calling pg_prng_strong_seed() in processes that will exit after
>help() or GucInfoMain().
> 
> 3. Revert 97550c0, as commit 3b00fdb anticipated.

I did partially revert 97550c0 in commit 8fd0498, but we decided to leave
some of the checks [0].

> I don't think the choice matters much, so here is (2).

FWIW I'd probably vote for option 1.  That keeps the initialization of the
globals together, reduces the call sites, and fixes the bug.  I'd worry a
little about moving the MyProcPid assignments out of that function without
adding a bunch of commentary to explain why.

[0] https://postgr.es/m/2023115945.3kgclsgz5lqmtnan%40awork3.anarazel.de

-- 
nathan




Re: [18] Fix a few issues with the collation cache

2024-12-10 Thread Jeff Davis
On Tue, 2024-12-10 at 15:44 +0900, Michael Paquier wrote:
> On Fri, Sep 20, 2024 at 05:28:48PM -0700, Jeff Davis wrote:
> > Updated and rebased.
> 
> The patch has been failing to apply for a couple of weeks now.  Could
> you rebase please?

I committed some of the patches and fixed problem #1.

The way I used ResourceOwners to fix problems #2 and #3 is a bit
awkward. I'm not sure if it's worth the ceremony to try to avoid leaks
during OOM. And other paths that leak could be fixed more simply by
freeing it directly rather than relying on the resowner mechanism.
I think I'll withdraw this patch and submit a separate patch to do it
the simpler way.

Regards,
Jeff Davis





Re: [18] Fix a few issues with the collation cache

2024-12-10 Thread Michael Paquier
On Tue, Dec 10, 2024 at 03:34:50PM -0800, Jeff Davis wrote:
> I committed some of the patches and fixed problem #1.
> 
> The way I used ResourceOwners to fix problems #2 and #3 is a bit
> awkward. I'm not sure if it's worth the ceremony to try to avoid leaks
> during OOM. And other paths that leak could be fixed more simply by
> freeing it directly rather than relying on the resowner mechanism.
> I think I'll withdraw this patch and submit a separate patch to do it
> the simpler way.

Okay, thanks for the update.
--
Michael


signature.asc
Description: PGP signature


Re: fix deprecation mention for age() and mxid_age()

2024-12-10 Thread John Naylor
On Wed, Nov 20, 2024 at 12:23 PM Michael Paquier  wrote:
>
> On Tue, Nov 19, 2024 at 06:52:40AM +, Bertrand Drouvot wrote:
> > Thanks for the feedback!
>
> Done.  The section mistake in REL_16_STABLE was..  Interesting.

Hi, there is a CF entry for this -- is it ready to mark committed?

-- 
John Naylor
Amazon Web Services




Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Michael Paquier
On Tue, Dec 10, 2024 at 01:50:16PM -0800, Masahiko Sawada wrote:
> Sorry I lost track of this thread. I'll check the test results and
> patch soon.

Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Michael Harris
Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
 wrote:
> Yay, reflink=0, that's pretty old fs ?!

This particular filesystem was created on Centos 7, and retained when
the system was upgraded to RL9. So yes probably pretty old!

> Could you get us maybe those below commands too? (or from any other directory 
> exhibiting such errors)
>
> stat pg_tblspc/16401/PG_16_202307071/17643/
> ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
> time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess timing 
> of getdents() call as that may something about that directory indirectly

# stat pg_tblspc/16402/PG_16_202307071/49163/
  File: pg_tblspc/16402/PG_16_202307071/49163/
  Size: 5177344 Blocks: 14880  IO Block: 4096   directory
Device: fd02h/64770dInode: 4299946593  Links: 2
Access: (0700/drwx--)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2024-12-11 09:39:42.467802419 +0900
Modify: 2024-12-11 09:51:19.813948673 +0900
Change: 2024-12-11 09:51:19.813948673 +0900
 Birth: 2024-11-25 17:37:11.812374672 +0900

# time ls -1 pg_tblspc/16402/PG_16_202307071/49163/ | wc -l
179000

real0m0.474s
user0m0.439s
sys 0m0.038s

> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and 
> delayed XFS's dynamic speculative preallocation from many processes all 
> writing into different partitions ? Maybe try "allocsize=1m" mount option for 
> that /fs and see if that helps.  I'm going to speculate about XFS speculative 
> :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS 
> might know to abort its own speculation about streaming write ? (multiply 
> that up to potentially the number of opened fds to get an avalanche of 
> "preallocations").

I will try to organize that. They are production systems so it might
take some time.

> 4. You can also try compiling with patch from Alvaro from [2] 
> "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up 
> having more clarity in offsets involved. If not then you could use 'strace -e 
> fallocate -p ' to get the exact syscall.

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

> 5. Another idea could be catching the kernel side stacktrace of fallocate() 
> when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer 
> I could get the source of the problem in my artificial reproducer, e.g

OK, I will look into that also.

Cheers
Mike




Re: Expand applicability of aggregate's sortop optimization

2024-12-10 Thread Michael Paquier
On Fri, Nov 08, 2024 at 01:05:04PM +0700, Andrei Lepikhov wrote:
> Rebase onto current master and slight improvement.

Your patch is failing in the CI, please rebase.  I have it to the next
CF for now, waiting on author.
--
Michael


signature.asc
Description: PGP signature


Re: Separate memory contexts for relcache and catcache

2024-12-10 Thread Michael Paquier
On Tue, Nov 26, 2024 at 05:11:16PM +0300, Melih Mutlu wrote:
> That said, these changes come with a cost, and it may not be worth it to
> separate every single cache into its own context. IIUC, introducing
> contexts for heavily used caches results in much less fragmentation. If
> that’s the case, then I believe we should focus on RelCache and CatCache,
> as they are heavily used since the backend starts. I see that you and
> Ashutosh [2] mentioned that PlanCacheContext is less likely to be heavily
> used, so we could consider leaving that context out for now.

Looking at the tail of this thread, some feedback has been provided.
These are not that easy to look at, so I've marked the patch as
returned with feedback for now, and there were a few concerns raised
as well.  Feel free to adjust as you feel if you think this is not
adapted.
--
Michael


signature.asc
Description: PGP signature


Re: Serverside SNI support in libpq

2024-12-10 Thread Michael Paquier
On Wed, Dec 04, 2024 at 02:44:18PM +0100, Daniel Gustafsson wrote:
> No worries, I know you have a big path on your plate right now.  The attached
> v3 fixes a small buglet in the tests and adds silly reload testing to try and
> stress out any issues.

Looks like this still fails quite heavily in the CI..  You may want to
look at that.
--
Michael


signature.asc
Description: PGP signature


Re: Document NULL

2024-12-10 Thread David G. Johnston
v5 Attached, v5-0001 is just v4-0001 rebased; v5-0002 is the rework over
v4-0001.  There is no formatting-only patch this round.

Wiki tracker: https://wiki.postgresql.org/wiki/Documenting_NULL#ToDo_Note

On Tue, Dec 10, 2024 at 11:52 AM Jeff Davis  wrote:

> On Mon, 2024-12-09 at 15:27 -0700, David G. Johnston wrote:
> This results in falsifying the common-sense
> > notion
> > that "p OR NOT p" is always true.
> >
>
> Thank you.
>
> I might reword the final sentence as more of an example, like: "Unknown
> values can lead to surprising behavior, for instance "NULL OR NOT NULL"
> evaluates to the null value."
>

I went with an example instead.


I got rid of the row counts on the examples and did \pset null NULL for the
newly added example.  Then I wrote the part about sometimes NULL is the
empty string and sometimes it is NULL.  When I finalize the examples I'm
probably going to \pset null .

I'm a bit hung up on the "section" terminology.  The automatic naming calls
everything a Section but I'm calling the sect1 the Section, sect2s
Sub-Sections and am rewording things to avoid having to make a choice for
sect3s.  Am I overthinking this?  Related, right now I have a mix of
Section links and "natural language" links.  I'm inclined to make a pass
converting probably everything to "natural language" links - or maybe those
pointing within the same sect1 but leaving outbound links more formal.

David J.
From c453d82a425f4d0655233e410dd9a864e3e3fa6d Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Mon, 9 Dec 2024 09:58:33 -0700
Subject: [PATCH 2/2] v5 - changes over v4

---
 doc/src/sgml/nullvalues.sgml | 371 +--
 1 file changed, 223 insertions(+), 148 deletions(-)

diff --git a/doc/src/sgml/nullvalues.sgml b/doc/src/sgml/nullvalues.sgml
index 6c4c29e305..8750404ce1 100644
--- a/doc/src/sgml/nullvalues.sgml
+++ b/doc/src/sgml/nullvalues.sgml
@@ -12,6 +12,13 @@
   can be executed so long as the following table and rows are created first.
  
 
+ 
+  Throughout this section the discussion of null values will be limited to
+  the SQL language unless otherwise noted.  The JSON-related data types, and the
+  non-SQL procedural languages, have their own behaviors documented in their
+  respective areas.
+ 
+
  
   CREATE TABLE null_examples (
 id bigint PRIMARY KEY,
@@ -30,58 +37,175 @@
even possible.  The null value also takes on a literal meaning of "not found"
when produced as the result of an outer join.
   
+  
+   In different programming languages, some of which are accessible in the server
+   as procedural languages, the null value is represented in other ways.
+   Of those included in PostgreSQL,
+   these are:
+   None in Python,
+   undefined in Perl,
+   and the empty string in TCL.
+  
  
 
  
   Usage
   
A null value, like all values, must have a data type, and is valid for all data types.
+   It must also be printed as text.  This section discusses null values at the boundaries
+   of the system as well as how they can come into existence due to the design of a query.
   
-  
-   As noted in the synatx chapter,
-   a null value literal is written using the NULL keyword.
-   Its type is the pseudo-type unknown
-   but can be cast to any concrete data type.
-  
-  
-   
-   SELECT
-NULL AS "Literal Null Value",
-pg_typeof(null) AS "Type of Null",
-pg_typeof(NuLl::text) AS "Type of Cast null",
-cast(null as text) AS "Cast null value";
-   
-   
- Literal Null Value | Type of Null | Type of Cast null | Cast null value
-+--+---+-
-| unknown  | text  |
-(1 row)
-   
-  
-  
-   
-   SELECT text NULL;
-   
-   
-   ERROR:  column "text" does not exist
-   LINE 1: select text NUll;
-   
-  
-  
-   The presence of null values in the system results in three-valued logic.
-   In binary logic every outcome is either true or false.  In
-   three-valued logic unknown, represented using a null value, is
-   also an outcome.  Put a bit more formally, the
-   Law of the Excluded Middle does not hold: i.e.,
-   p OR NOT(p) != true; for all p.
-  
-  
-   Aspects of the system that branch based upon
-   whether a condition variable is true or false must therefore
-   decide how to behave when the condition is a null value.
-   The remaining sub-sections summarize these decisions, as well
-   as other behaviors.
-  
+  
+   Null Value Input
+   
+A null value can be used as input to any function, operator, or expression.
+The system will then decide how to behave based on the rules described in the
+rest of this section.  The system will also decide how to behave when a null value
+is used as a parameter to a function that does not accept null values.
+   
+   
+As noted in the syntax chapter,
+a null value literal is written using the NULL keyword.
+Its type is the pseudo-type unknown
+but can b

RE: Memory leak in WAL sender with pgoutput (v10~)

2024-12-10 Thread Zhijie Hou (Fujitsu)
On Wednesday, December 11, 2024 2:14 AM Masahiko Sawada  
wrote:
> 
> On Tue, Dec 10, 2024 at 1:16 AM Amit Kapila 
> wrote:
> >
> > On Tue, Dec 10, 2024 at 11:24 AM Amit Kapila 
> wrote:
> > >
> > > On Tue, Dec 10, 2024 at 8:54 AM vignesh C 
> wrote:
> > > >
> > > > On Tue, 10 Dec 2024 at 04:56, Michael Paquier 
> wrote:
> > > > >
> > > > > On Mon, Dec 09, 2024 at 03:36:15PM +0530, Amit Kapila wrote:
> > > > > > It couldn't solve the problem completely even in back-branches. The
> > > > > > SQL API case I mentioned and tested by Hou-San in the email [1]
> won't
> > > > > > be solved.
> > > > > >
> > > > > > [1] -
> https://www.postgresql.org/message-id/OS0PR01MB57166A4DA0ABBB94F
> 2FBB28694362%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> > > > >
> > > > > Yeah, exactly (wanted to reply exactly that yesterday but lacked time,
> > > > > thanks!).
> > > >
> > > > Yes, that makes sense. How about something like the attached patch.
> > > >
> > >
> > > - oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > > - if (data->publications)
> > > - {
> > > - list_free_deep(data->publications);
> > > - data->publications = NIL;
> > > - }
> > > + static MemoryContext pubctx = NULL;
> > > +
> > > + if (pubctx == NULL)
> > > + pubctx = AllocSetContextCreate(CacheMemoryContext,
> > > +"logical replication publication list context",
> > > +ALLOCSET_SMALL_SIZES);
> > > + else
> > > + MemoryContextReset(pubctx);
> > > +
> > > + oldctx = MemoryContextSwitchTo(pubctx);
> > >
> > > Considering the SQL API case, why is it okay to allocate this context
> > > under CacheMemoryContext?
> > >
> >
> > On further thinking, we can't allocate it under
> > LogicalDecodingContext->context because once that is freed at the end
> > of SQL API pg_logical_slot_get_changes(), pubctx will be pointing to a
> > dangling memory. One idea is that we use
> > MemoryContextRegisterResetCallback() to invoke a reset callback
> > function where we can reset pubctx but not sure if we want to go there
> > in back branches. OTOH, the currently proposed fix won't leak memory
> > on repeated calls to pg_logical_slot_get_changes(), so that might be
> > okay as well.
> >
> > Thoughts?
> 
> Alternative idea is to declare pubctx as a file static variable. And
> we create the memory context under LogicalDecodingContext->context in
> the startup callback and free it in the shutdown callback.

I think when an ERROR occurs during the execution of the pg_logical_slot_xx()
API, the shutdown callback function is not invoked. This would result in the
static variable not being reset, which, I think, is why Amit mentioned the use
of MemoryContextRegisterResetCallback().

Best Regards,
Hou zj


advanced patch feedback session at FOSDEM, responses needed

2024-12-10 Thread Peter Eisentraut

Hi all,

I'm trying to organize an advanced patch feedback session adjacent to
FOSDEM.  Right now, I'm looking to gauge interest so that concrete
plans can be made.  Specifically, if you would like to participate and
have your patches reviewed, please respond to this email.

The session would be similar to [0] in that we construct groups each
containing one or more committer volunteers and several or fewer
non-committer participants who have patches about which they want
feedback.  The committers look at the patches from the non-committers
in their group in advance of the session, and then give whatever
feedback they have.  Non-committers could also review each other's
patches in a similar manner.  All this can be for long-term general
education but also for short-term tactical purposes, given the time in
the development cycle.

[0]: 
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/224-advanced-patch-feedback-session-invite-only/


None of the logistics are set or confirmed, and the whole thing might
not happen at all, but if it were, the idea would be to have it on
Friday, the 31st of January 2025, at the same venue as and
concurrently to FOSDEM PGDay [1] in Brussels.  That means, you would
need to register for the conference (and pay or otherwise gain
entrance), and you would probably miss at least one session of the
regularly scheduled program.  (I think the exact times and lengths
could be flexible and negotiated between the participants.)

[1]: https://2025.fosdempgday.org/

So, if you would like to participate in this as either a patch author
who wants their patch reviewed or as a reviewer/committer/group
leader, please respond to this email (privately).  If applying as a
patch author, please also include a commitfest link or two or similar
about what patches you have pending and would like to focus on.  To be
considered, a patch should have some nontrivial level of size or
complexity or difficulty so that it's worth having a meeting about it.

If you have any logistical caveats, like you are still waiting for
travel approval, or you can only participate in the morning, or you
are an accepted speaker in the regular program and therefore have
constrained availability, please include that.  Beyond those kinds of
things, please don't sign up unless you really can and want to come
and make your own way there.  It would be very wasteful and
disappointing if a bunch of people signed up and then didn't show up.

Note again: This event is not confirmed, none of the logistics are
confirmed, it might not happen, I also don't know if we could
accommodate everyone who applies if we got overwhelmed.

Given that FOSDEM is just about 7 weeks away, I would like to get this
first round of feedback in the next 7 days so that planning can
proceed.  Thanks.




Re: fix deprecation mention for age() and mxid_age()

2024-12-10 Thread Bertrand Drouvot
Hi,

On Wed, Dec 11, 2024 at 09:01:24AM +0900, Michael Paquier wrote:
> On Wed, Dec 11, 2024 at 06:54:23AM +0700, John Naylor wrote:
> > Hi, there is a CF entry for this -- is it ready to mark committed?
> 
> Oops.  I've missed that there was an entry in CF 51.  Updated that
> now.  Thanks for the poke.

D'oh I missed it too! 

> Thanks for the poke.

+1

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Make use of pg_memory_is_all_zeros() in more places

2024-12-10 Thread Bertrand Drouvot
Hi,

On Wed, Dec 11, 2024 at 03:03:34PM +0900, Michael Paquier wrote:
> On Tue, Dec 10, 2024 at 02:18:33PM +, Bertrand Drouvot wrote:
> > While searching for memcmp() calls in "*stat*.c" files (due to [1]), it 
> > appeared
> > that we could $SUBJECT. Please find attached a patch doing so.
> 
> -SockAddrzero_clientaddr;
> -memset(&zero_clientaddr, 0, sizeof(zero_clientaddr));
> -if (memcmp(&(beentry->st_clientaddr), &zero_clientaddr,
> -   sizeof(zero_clientaddr)) == 0)
> +if (pg_memory_is_all_zeros(&(beentry->st_clientaddr),
> +   sizeof(beentry->st_clientaddr)))
> 
> It also means that you're removing these variables used only for the
> all-zero comparisons.  Makes sense to me.

Yeap and also given its size (136 bytes), it looks like that it could provide
some performance benefits too (see check_SockAddr_is_zeroes.c attached):

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gcc -O2 check_SockAddr_is_zeroes.c -o check_SockAddr_is_zeroes; 
./check_SockAddr_is_zeroes
memcmp: done in 21460 nanoseconds
pg_memory_is_all_zeros: done in 1307 nanoseconds (16.4193 times faster than 
memcmp)

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 check_SockAddr_is_zeroes.c -o 
check_SockAddr_is_zeroes; ./check_SockAddr_is_zeroes
memcmp: done in 21012 nanoseconds
pg_memory_is_all_zeros: done in 4039 nanoseconds (5.20228 times faster than 
memcmp)

That's not a hot path but still interesting to see.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
#include 
#include 
#include 
#include 
#include 
#include  

#define LOOPS 1000

typedef struct
{
struct sockaddr_storage addr;
socklen_t   salen;
} SockAddr;

static inline bool
pg_memory_is_all_zeros(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));

if (len < sizeof(size_t))
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}

/* "len" in the [sizeof(size_t), sizeof(size_t) * 8 - 1] range */
if (len < sizeof(size_t) * 8)
{
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}

/*
 * Compare remaining size_t-aligned chunks.
 *
 * There is no risk to read beyond the memory area, as "aligned_end"
 * cannot be higher than "end".
 */
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}

/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}

/* "len" in the [sizeof(size_t) * 8, inf) range */

/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;

if (*p++ != 0)
return false;
}

/*
 * Compare 8 * sizeof(size_t) chunks at once.
 *
 * For performance reasons, we manually unroll this loop and purposefully
 * use bitwise-ORs to combine each comparison.  This prevents boolean
 * short-circuiting and lets the compiler know that it's safe to access
 * all 8 elements regardless of the result of the other comparisons.  This
 * seems to be enough to coax a few compilers into using SIMD
 * instructions.
 */
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}

/*
 * Compare remaining size_t-aligned chunks.
 *
 * There is no risk to read beyond the memory area, as "aligned_end"
 * cannot be higher than "end".
 */
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}

/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}

return true;
}

#define NANOSEC_PER_SEC 10

// Returns difference i

Re: Fix comments related to pending statistics

2024-12-10 Thread Bertrand Drouvot
Hi,

On Wed, Dec 11, 2024 at 02:56:08PM +0900, Michael Paquier wrote:
> On Tue, Dec 10, 2024 at 02:16:14PM +, Bertrand Drouvot wrote:
> > 
> > 2. One linked to PgStat_FunctionCounts pending stats, mentioning the use of
> > memcmp() against zeroes to detect whether there are any pending stats: I 
> > think
> > it has never been the case.
> 
>   * PgStat_FunctionCounts The actual per-function counts kept by a backend
>   *
> - * This struct should contain only actual event counters, because we memcmp
> - * it against zeroes to detect whether there are any pending stats.
> + * This struct should contain only actual event counters, because pending 
> stats
> + * always has non-zero content.
> 
> pgstat_function_flush_cb() in pgstat_function.c says since
> 5891c7a8ed8f, so that seems like a remnant from the original patch
> that has introduced this inconsistency across its reviews:
> "/* localent always has non-zero content */"

Thanks for looking at it!

Yeah and same in pgstat_subscription_flush_cb(). 

> Your suggestion does not look completely right to me.  There is
> nothing preventing us from using something else than event counters
> since we don't use memcpy() and there is no comparison work, no?  It
> seems to me that we could remove the entire sentence instead.

Do you mean also remove the comments in pgstat_function_flush_cb() and
pgstat_subscription_flush_cb()? Those comments look fine to me given
the places where those pending entries are created meaning in
pgstat_init_function_usage() for the functions and 
pgstat_report_subscription_error()
and pgstat_report_subscription_conflict() for the subscriptions.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Avoid unnecessary wrapping for more complex expressions

2024-12-10 Thread Richard Guo
In commit f64ec81a8 we introduced an optimization that avoids wrapping
for Vars and PHVs if they are lateral references to something outside
the subquery, and the referenced rel is under the same lowest nulling
outer join.  It could be beneficial to get rid of such PHVs because
they imply lateral dependencies, which force us to resort to nestloop
joins.

As mentioned in that thread, I feel that we can apply a similar
optimization to more complex non-var expressions: if a strict
expression contains any variables of rels that are under the same
lowest nulling outer join as the subquery, we can also avoid wrapping
it.

The rationale behind is that if the subquery variable is forced to
NULL by the outer join, the variables of rels that are under the same
lowest nulling outer join will also be forced to NULL, resulting in
the expression evaluating to NULL as well.  So it's not necessary to
force the expression to be evaluated below the outer join.  As an
example, consider

explain (costs off)
select * from t t1 left join
 (t t2 inner join
  lateral (select t2.a+1 as x, * from t t3) s on t2.a = s.a)
 on t1.b = t2.b;
 QUERY PLAN

 Hash Right Join
   Hash Cond: (t2.b = t1.b)
   ->  Hash Join
 Hash Cond: (t2.a = t3.a)
 ->  Seq Scan on t t2
 ->  Hash
   ->  Seq Scan on t t3
   ->  Hash
 ->  Seq Scan on t t1
(9 rows)

If s.x is forced to NULL by the left join, t2.a will also be forced to
NULL, and 't2.a+1' will come out as NULL as well because of the
restriction to strict constructs, so we do not need to wrap it in a
PHV.

Any thoughts?

Thanks
Richard


v1-0001-Avoid-unnecessary-wrapping-for-more-complex-expressions.patch
Description: Binary data


Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Masahiko Sawada
On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal  wrote:
>
> On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada  wrote:
> >
> > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C  wrote:
> > > >
> > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C  wrote:
> > > > >
> > > > > BTW, I noticed that we don't take any table-level locks for Create
> > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create
> > > > > a similar problem? I haven't tested so not sure but even if there is a
> > > > > problem for the Create case, it should lead to some ERROR like missing
> > > > > publication.
> > > >
> > > > I tested these scenarios, and as you expected, it throws an error for
> > > > the create publication case:
> > > > 2024-07-17 14:50:01.145 IST [481526] 481526  ERROR:  could not receive
> > > > data from WAL stream: ERROR:  publication "pub1" does not exist
> > > > CONTEXT:  slot "sub1", output plugin "pgoutput", in the change
> > > > callback, associated LSN 0/1510CD8
> > > > 2024-07-17 14:50:01.147 IST [481450] 481450  LOG:  background worker
> > > > "logical replication apply worker" (PID 481526) exited with exit code
> > > > 1
> > > >
> > > > The steps for this process are as follows:
> > > > 1) Create tables in both the publisher and subscriber.
> > > > 2) On the publisher: Create a replication slot.
> > > > 3) On the subscriber: Create a subscription using the slot created by
> > > > the publisher.
> > > > 4) On the publisher:
> > > > 4.a) Session 1: BEGIN; INSERT INTO T1;
> > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
> > > > 4.c) Session 1: COMMIT;
> > > >
> > > > Since we are throwing out a "publication does not exist" error, there
> > > > is no inconsistency issue here.
> > > >
> > > > However, an issue persists with DROP ALL TABLES publication, where
> > > > data continues to replicate even after the publication is dropped.
> > > > This happens because the open transaction consumes the invalidation,
> > > > causing the publications to be revalidated using old snapshot. As a
> > > > result, both the open transactions and the subsequent transactions are
> > > > getting replicated.
> > > >
> > > > We can reproduce this issue by following these steps in a logical
> > > > replication setup with an "ALL TABLES" publication:
> > > > On the publisher:
> > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
> > > > In another session on the publisher:
> > > > Session 2: DROP PUBLICATION
> > > > Back in Session 1 on the publisher:
> > > > COMMIT;
> > > > Finally, in Session 1 on the publisher:
> > > > INSERT INTO T1 VALUES (val2);
> > > >
> > > > Even after dropping the publication, both val1 and val2 are still
> > > > being replicated to the subscriber. This means that both the
> > > > in-progress concurrent transaction and the subsequent transactions are
> > > > being replicated.
> > > >
> > > > I don't think locking all tables is a viable solution in this case, as
> > > > it would require asking the user to refrain from performing any
> > > > operations on any of the tables in the database while creating a
> > > > publication.
> > > >
> > >
> > > Indeed, locking all tables in the database to prevent concurrent DMLs
> > > for this scenario also looks odd to me. The other alternative
> > > previously suggested by Andres is to distribute catalog modifying
> > > transactions to all concurrent in-progress transactions [1] but as
> > > mentioned this could add an overhead. One possibility to reduce
> > > overhead is that we selectively distribute invalidations for
> > > catalogs-related publications but I haven't analyzed the feasibility.
> > >
> > > We need more opinions to decide here, so let me summarize the problem
> > > and solutions discussed. As explained with an example in an email [1],
> > > the problem related to logical decoding is that it doesn't process
> > > invalidations corresponding to DDLs for the already in-progress
> > > transactions. We discussed preventing DMLs in the first place when
> > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
> > > progress. The solution discussed was to acquire
> > > ShareUpdateExclusiveLock for all the tables being added via such
> > > commands. Further analysis revealed that the same handling is required
> > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
> > > the tables in the specified schemas. Then DROP PUBLICATION also seems
> > > to have similar symptoms which means in the worst case (where
> > > publication is for ALL TABLES) we have to lock all the tables in the
> > > database. We are not sure if that is good so the other alternative we
> > > can pursue is to distribute invalidations in logical decoding
> > > infrastructure [1] which has its downsides.
> > >
> > > Thoughts?
> >
> > Thank you for summarizing the problem and solutions!
> >
> > I think it's worth

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-10 Thread John Naylor
I wrote:
>
> 1. I looked at a couple implementations of this idea, and found that
> the constants used in the carryless multiply are tied to the length of
> the blocks. With a lookup table we can do the 3-way algorithm on any
> portion of a full block length, rather than immediately fall to doing
> CRC serially. That would be faster on average. See for example
> https://github.com/komrad36/CRC/tree/master , but I don't think we
> actually have to fully unroll the loop like they do there.
>
> 2. With the above, we can use a larger full block size, and so on
> average less time would be spent in the carryless multiply. With that,
> we could possibly get away with an open coded loop in normal C rather
> than a new intrinsic (also found in the above repo). That would be
> more portable.

I added a port to x86 and poked at it, with the intent to have an easy
on-ramp to that at least accelerates computation of CRCs on FPIs.

The 0008 patch only worked on chunks of 1024 at a time. At that size,
the presence of hardware carryless multiplication is not that
important. I removed the hard-coded constants in favor of a lookup
table, so now it can handle anything up to 8400 bytes in a single
pass.

There are still some "taste" issues, but I like the overall shape here
and how light it was. With more hardware support, we can go much lower
than 1024 bytes, but that can be left for future work.
-- 
John Naylor
Amazon Web Services
From b889235866ad2593e7d093e6a74181a821b68430 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 11 Dec 2024 11:30:33 +0700
Subject: [PATCH v9 2/2] Implement interleaved CRC calculation, combined via
 carryless multiplication

Changes from 0008:
- Use a lookup table to get the precomputed CRC instead of using constants.
This allows a single pass to handle up to 8400 bytes before combining CRCs.
On large inputs, we can just use a simple loop to emulate CLMUL
- Both Arm and x86 support, with no additional config or runtime checks.

Xiang Gao and John Naylor
---
 src/include/port/pg_crc32c.h | 29 +++
 src/port/pg_crc32c_armv8.c   | 34 +
 src/port/pg_crc32c_sb8.c | 94 
 src/port/pg_crc32c_sse42.c   | 44 -
 4 files changed, 199 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..aa06bc201a 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -33,6 +33,7 @@
 #ifndef PG_CRC32C_H
 #define PG_CRC32C_H
 
+#include "port/pg_bitutils.h"
 #include "port/pg_bswap.h"
 
 typedef uint32 pg_crc32c;
@@ -107,4 +108,32 @@ extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len)
 
 #endif
 
+
+/* semi-private to files in src/port that interleave CRC instructions */
+/* WIP: separate header in src/port? */
+
+#define CRC_BYTES_PER_ITER (3 * sizeof(uint64))
+#define CRC_MAX_BLOCK_LEN 350 /* can compute 8k inputs in a single pass */
+
+extern PGDLLIMPORT const uint64 combine_crc_lookup[CRC_MAX_BLOCK_LEN];
+
+/*
+ * Fallback for platforms that don't support intrinsics for carryless multiplication.
+ */
+static inline uint64
+pg_clmul(uint32 a, uint32 b)
+{
+	uint64		result = 0;
+
+	while (a)
+	{
+		int			pos = pg_rightmost_one_pos32(a);
+
+		result ^= (uint64) b << (pos);
+		a &= a - 1;
+	}
+
+	return result;
+}
+
 #endif			/* PG_CRC32C_H */
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d47d838c50..f9f155b65e 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -23,6 +23,8 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
+	const size_t min_blocklen = 42; /* Min size to consider interleaving */
+	const pg_crc32c orig_crc = crc; // XXX not for commit
 
 	/*
 	 * ARMv8 doesn't require alignment, but aligned memory access is
@@ -48,6 +50,36 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	/* See pg_crc32c_sse42.c for explanation */
+	while (p + min_blocklen * CRC_BYTES_PER_ITER <= pend)
+	{
+		const size_t block_len = Min(CRC_MAX_BLOCK_LEN, (pend - p) / CRC_BYTES_PER_ITER);
+		const uint64 *in64 = (const uint64 *) (p);
+		pg_crc32c	crc0 = crc,
+	crc1 = 0,
+	crc2 = 0;
+		uint64		mul0,
+	mul1,
+	precompute;
+
+		for (int i = 0; i < block_len; i++, in64++)
+		{
+			crc0 = __crc32cd(crc0, *(in64));
+			crc1 = __crc32cd(crc1, *(in64 + block_len));
+			crc2 = __crc32cd(crc2, *(in64 + block_len * 2));
+		}
+
+		precompute = combine_crc_lookup[block_len - 1];
+		mul0 = pg_clmul(crc0, (uint32) precompute);
+		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+
+		crc0 = __crc32cd(0, mul0);
+		crc1 = __crc32cd(0, mul1);
+		crc = crc0 ^ crc1 ^ crc2;
+
+		p += block_len * CRC_BYTES_PER_ITER;
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -71,5 +103,7 @@ pg_comp_crc32c_armv8(pg_crc32c crc, c

Re: Track the amount of time waiting due to cost_delay

2024-12-10 Thread Bertrand Drouvot
Hi,

On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
> On Mon, Dec 09, 2024 at 04:41:03PM +, Bertrand Drouvot wrote:
> > +   time_delayed bigint
> 
> I think it's also worth considering names like total_delay and
> cumulative_delay.

That's fine by me. Then I think that total_delay is the way to go (I don't see
any existing "cumulative_").

> > +   Total amount of time spent in milliseconds waiting during  > linkend="guc-vacuum-cost-delay"/>
> > +   or . In case of 
> > parallel
> > +   vacuum the reported time is across all the workers and the leader. 
> > The
> > +   workers update the column no more frequently than once per second, 
> > so it
> > +   could show slightly old values.
> 
> I wonder if it makes sense to provide this value as an interval instead of
> the number of milliseconds to make it more human-readable. 

Yeah we could do so, but that would mean:

1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. Indeed,
the current pg_stat_get_progress_info() is shared across multiple "commands" and
then we wouldn't be able to change it's output types in pg_proc.dat.

Or

2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.

I don't like 1. that much and given that that would be as simple as: 

"
select make_interval(secs => time_delayed / 1000) from pg_stat_progress_vacuum; 
"

for an end user to display an interval, I'm not sure we should provide an 
interval
by default.

That said, I agree that milliseconds is not really human-readable and
does not provide that much added value (except flexibility), so I'd vote for 2.
if you feel we should provide an interval by default.

> I might also
> suggest some changes to the description:
> 
>   Total accumulated time spent sleeping due to the cost-based vacuum
>   delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit).  This
>   includes the time that any associated parallel workers have slept, too.
>   However, parallel workers report their sleep time no more frequently
>   than once per second, so the reported value may be slightly stale.
> 

Yeah I like it, thanks! Now, I'm wondering if we should not also add something
like this:

"
Since multiple workers can sleep simultaneously, the total sleep time can exceed
the actual duration of the vacuum operation.
"

As that could be surprising to see this behavior in action.

Thoughts?

I'll provide an updated patch version once we agree on the above points.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: Conflict detection for update_deleted in logical replication

2024-12-10 Thread Zhijie Hou (Fujitsu)
On Wednesday, December 11, 2024 1:06 PM Amit Kapila  
wrote:
> On Fri, Dec 6, 2024 at 1:28 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Thursday, December 5, 2024 6:00 PM Amit Kapila
>  wrote:
> > >
> > >
> > > A few more comments:
> > > 1.
> > > +static void
> > > +wait_for_local_flush(RetainConflictInfoData *data)
> > > {
> > > ...
> > > +
> > > + data->phase = RCI_GET_CANDIDATE_XID;
> > > +
> > > + maybe_advance_nonremovable_xid(data);
> > > +}
> > >
> > > Isn't it better to reset all the fields of data before the next
> > > round of GET_CANDIDATE_XID phase? If we do that then we don't need
> > > to reset
> > > data->remote_lsn = InvalidXLogRecPtr; and data->last_phase_at =
> > > InvalidFullTransactionId; individually in request_publisher_status()
> > > and
> > > get_candidate_xid() respectively. Also, it looks clean and logical
> > > to me unless I am missing something.
> >
> > The remote_lsn was used to determine whether a status is received, so
> > was reset each time in request_publisher_status. To make it more
> > straightforward, I added a new function parameter 'status_received',
> > which would be set to true when calling
> > maybe_advance_nonremovable_xid() on receving the status. After this
> change, there is no need to reset the remote_lsn.
> >
> 
> As part of the above comment, I had asked for three things (a) avoid setting
> data->remote_lsn = InvalidXLogRecPtr; in request_publisher_status(); (b)
> avoid setting data->last_phase_at =InvalidFullTransactionId; in
> get_candidate_xid(); (c) reset data in
> wait_for_local_flush() after wait is over. You only did (a) in the patch and 
> didn't
> mention anything about (b) or (c). Is that intentional? If so, what is the 
> reason?

I think I misunderstood the intention, so will address in next version.

> 
> *
> +static bool
> +can_advance_nonremovable_xid(RetainConflictInfoData *data) {
> +
> 
> Isn't it better to make this an inline function as it contains just one check?

Agreed. Will address in next version.

> 
> *
> + /*
> + * The non-removable transaction ID for a subscription is centrally
> + * managed by the main apply worker.
> + */
> + if (!am_leader_apply_worker())
> 
> I have tried to improve this comment in the attached.

Thanks, will check and merge the next version.

Best Regards,
Hou zj



Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Michael Harris
Hi again

On Wed, 11 Dec 2024 at 12:09, Michael Harris  wrote:
> But another system I can access has multiple databases with ongoing
> imports, yet all the errors bar one relate to one directory.
> I will collect some data from that system and post it shortly.

I've attached the same set of data collected from an RHEL8 system.

Unfortunately the 'agresv' subcommand does not exist in the version of
xfs_db that is available on RHEL8, so I was not able to implement that
suggestion.

I thought I had one *L8 system that had an XFS filesystem and had not
experienced this issue, but it turns out it had - just at a much lower
frequency.

Cheers
Mike


rhel8_fallocate_fail.log
Description: Binary data


Re: Assert failure on running a completed portal again

2024-12-10 Thread Robert Haas
On Tue, Dec 10, 2024 at 12:14 PM Tom Lane  wrote:
> > Did you look at the commit message for
> > 691b8d59281b5177f16fe80858df921f77a8e955? IMHO, that's pretty clear
> > about what the goals of this were.
>
> Well, it's not very clear about what implementation limitations we
> are trying to protect.

OK, I see that I was vague about that. The core problem here is that
when we start parallel workers, we do a one-time state transfer from
the leader to the workers. If the leader makes any changes to any of
that state, or the workers do, then they're out of sync, and very bad
things will happen. To prevent that, we need a mechanism to keep any
of that state from changing while there are active workers, and that
mechanism is parallel mode. The leader enters parallel mode (via
EnterParallelMode()) before launching any workers and must only exit
parallel mode after they're all gone; the workers run in parallel mode
the whole time. Hence, we are (hopefully) guaranteed that the relevant
state can't change under us.

Hence, the most fundamental restriction here is that ExecutePlan() had
better not reach the call to ExitParallelMode() at the end of the
function while there are still workers active. Since workers don't
exit until they've run their portion of the plan to completion (except
in case of error), we need to run the plan to completion in the leader
too before reaching ExecutePlan() escapes from the loop; otherwise,
some worker could have filled up its output queue with tuples and then
blocked waiting for us to read them and we never did. In the case at
issue, if the plan was already run to completion, then there shouldn't
be any workers active any more and trying to re-execute the completed
tree should not launch any new ones, so in theory there's no problem,
but...

> Hmm.  As committed, the re-execution will happen with
> use_parallel_mode disabled, which might make that safer, or maybe it's
> less safe?  Do you think we need something closer to the upthread
> proposal to not run the executor at all during the "re-execution"?
> (I'd still be inclined to put that short-circuit into ExecutePlan
> not anything higher-level.)

...hypothetically, there could be code which doesn't enjoy being
called on the same plan tree first in parallel mode and then later in
non-parallel mode. I can't really see any good reason for such code to
actually exist. For example, ExecGather() bases its decision about
whether to launch workers on gather->num_workers > 0 &&
estate->es_use_parallel_mode, but it looks to me like all the
decisions about whether to read from workers or close down workers are
based on what workers actually got launched without further reference
to the criteria that ExecGather() used to decide whether to launch
them in the first place. That seems like the right way for things to
be coded, and I think that probably all the things are actually coded
that way. I just wanted to point out that it wasn't theoretically
impossible for this change to run into some lower-level problem.

> My recollection is that even before parallelism we had some plan node
> types that didn't cope well with being called again after they'd once
> returned EOF (ie a null tuple).  So maybe a defense against trying to
> do that would be wise.  It could probably be as simple as checking
> estate->es_finished.

It's not a crazy idea, but it might be better to fix those node types.
It might just be me, but I find the comments in the executor to be
pretty lacking and there's a good amount of boilerplate that seems to
have been copy-and-pasted around without the author totally
understanding what was happening. That never gets better if we just
paper over the bugs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Assert failure on running a completed portal again

2024-12-10 Thread Tom Lane
I wrote:
> My recollection is that even before parallelism we had some plan node
> types that didn't cope well with being called again after they'd once
> returned EOF (ie a null tuple).  So maybe a defense against trying to
> do that would be wise.

I spent a little time trying to run that recollection to ground,
and after awhile arrived at this comment in heapam.c:

 * Note: when we fall off the end of the scan in either direction, we
 * reset rs_inited.  This means that a further request with the same
 * scan direction will restart the scan, which is a bit odd, but a
 * request with the opposite scan direction will start a fresh scan
 * in the proper direction.  The latter is required behavior for cursors,
 * while the former case is generally undefined behavior in Postgres
 * so we don't care too much.

So indeed nodes types as basic as SeqScan will misbehave if that
happens.  However, there is also logic in pquery.c that intends
to provide defenses against this, for instance in PortalRunSelect:

 * Determine which direction to go in, and check to see if we're already
 * at the end of the available tuples in that direction.  If so, set the
 * direction to NoMovement to avoid trying to fetch any tuples.  (This
 * check exists because not all plan node types are robust about being
 * called again if they've already returned NULL once.)  Then call the
 * executor (we must not skip this, because the destination needs to see a
 * setup and shutdown even if no tuples are available).  Finally, update
 * the portal position state depending on the number of tuples that were
 * retrieved.
 */
if (forward)
{
if (portal->atEnd || count <= 0)
{
direction = NoMovementScanDirection;
count = 0;/* don't pass negative count to executor */
}

And ExecutorRun skips ExecutePlan if direction is
NoMovementScanDirection.  So unless there are gaps in pquery.c's EOF
checks, I think we're OK: the "re-execution" call will not reach
ExecutePlan.  This could probably do with more commentary though.

regards, tom lane




Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-10 Thread Masahiko Sawada
On Tue, Dec 10, 2024 at 1:16 AM Amit Kapila  wrote:
>
> On Tue, Dec 10, 2024 at 11:24 AM Amit Kapila  wrote:
> >
> > On Tue, Dec 10, 2024 at 8:54 AM vignesh C  wrote:
> > >
> > > On Tue, 10 Dec 2024 at 04:56, Michael Paquier  wrote:
> > > >
> > > > On Mon, Dec 09, 2024 at 03:36:15PM +0530, Amit Kapila wrote:
> > > > > It couldn't solve the problem completely even in back-branches. The
> > > > > SQL API case I mentioned and tested by Hou-San in the email [1] won't
> > > > > be solved.
> > > > >
> > > > > [1] - 
> > > > > https://www.postgresql.org/message-id/OS0PR01MB57166A4DA0ABBB94F2FBB28694362%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> > > >
> > > > Yeah, exactly (wanted to reply exactly that yesterday but lacked time,
> > > > thanks!).
> > >
> > > Yes, that makes sense. How about something like the attached patch.
> > >
> >
> > - oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > - if (data->publications)
> > - {
> > - list_free_deep(data->publications);
> > - data->publications = NIL;
> > - }
> > + static MemoryContext pubctx = NULL;
> > +
> > + if (pubctx == NULL)
> > + pubctx = AllocSetContextCreate(CacheMemoryContext,
> > +"logical replication publication list context",
> > +ALLOCSET_SMALL_SIZES);
> > + else
> > + MemoryContextReset(pubctx);
> > +
> > + oldctx = MemoryContextSwitchTo(pubctx);
> >
> > Considering the SQL API case, why is it okay to allocate this context
> > under CacheMemoryContext?
> >
>
> On further thinking, we can't allocate it under
> LogicalDecodingContext->context because once that is freed at the end
> of SQL API pg_logical_slot_get_changes(), pubctx will be pointing to a
> dangling memory. One idea is that we use
> MemoryContextRegisterResetCallback() to invoke a reset callback
> function where we can reset pubctx but not sure if we want to go there
> in back branches. OTOH, the currently proposed fix won't leak memory
> on repeated calls to pg_logical_slot_get_changes(), so that might be
> okay as well.
>
> Thoughts?

Alternative idea is to declare pubctx as a file static variable. And
we create the memory context under LogicalDecodingContext->context in
the startup callback and free it in the shutdown callback.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Separate GUC for replication origins

2024-12-10 Thread Euler Taveira
Hi,

We usually use max_replication_slots as the maximum number of replication
origins. It predates the logical replication infrastructure (commit
5aa2350426c). However, it is confusing (if it is the first time you are dealing
with logical replication. Why do I need to set replication slots on subscriber
if the replication slots are stored on publisher?) and it has a limitation (you
cannot have a setup where you want to restrict the number of replication slots
and have a high number of subscriptions or vice-versa). The initial commit also
mentions that it is not adequate (origin.c).

/*
* XXX: max_replication_slots is arguably the wrong thing to use, as here
* we keep the replay state of *remote* transactions. But for now it seems
* sufficient to reuse it, rather than introduce a separate GUC.
*/

and another comment suggests that we should have a separate GUC for it.

/*  
* Base address into a shared memory array of replication states of size
* max_replication_slots.   
*  
* XXX: Should we use a separate variable to size this rather than  
* max_replication_slots?   
*/

The commit a8500750ca0 is an attempt to clarify that the max_replication_slots
can have different meanings depending on the node role (publisher or
subscriber).

I'm attaching a patch that adds max_replication_origins. It basically replaces
all of the points that refers to max_replication_slots on the subscriber. It
uses the same default value as max_replication_slots (10). I did nothing to
keep the backward compatibility. I mean has a special value (like -1) that
means same value as max_replication_slots. Is it worth it? (If not, it should
be mentioned in the commit message.)


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 5f9e43ce48b523e3bca8e1ef8ee9a427c676381c Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 3 Sep 2024 12:10:20 -0300
Subject: [PATCH v1] Separate GUC for replication origins

This feature already exists but it is provided by an existing GUC:
max_replication_slots. The new GUC (max_replication_origins) defines the
maximum number of replication origins that can be tracked
simultaneously. The max_replication_slots was used for this purpose but
it is confusing (when you are learning about logical replication) and
introduces a limitation (you cannot have a small number of replication
slots and a high number of subscriptions). It uses the same default
value as max_replication_slots.
---
 doc/src/sgml/config.sgml  | 26 ++--
 doc/src/sgml/logical-replication.sgml |  6 +-
 src/backend/replication/logical/launcher.c|  4 +-
 src/backend/replication/logical/origin.c  | 65 +--
 src/backend/utils/misc/guc_tables.c   | 12 
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/bin/pg_basebackup/pg_createsubscriber.c   | 18 ++---
 .../t/040_pg_createsubscriber.pl  |  4 +-
 src/bin/pg_upgrade/check.c| 14 ++--
 src/bin/pg_upgrade/t/004_subscription.pl  | 14 ++--
 src/include/replication/origin.h  |  3 +
 11 files changed, 83 insertions(+), 85 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e0c8325a39..779fcddd34 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4493,13 +4493,6 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
  to replica or higher to allow replication slots to
  be used.
 
-
-
- Note that this parameter also applies on the subscriber side, but with
- a different meaning. See 
- in  for more
- details.
-

   
 
@@ -5202,10 +5195,10 @@ ANY num_sync ( 
-  max_replication_slots (integer)
+ 
+  max_replication_origins (integer)

-max_replication_slots configuration parameter
+max_replication_origins configuration parameter
 in a subscriber

   
@@ -5217,18 +5210,13 @@ ANY num_sync ( pg_replication_origin_status)
-will prevent the server from starting.
-max_replication_slots must be set to at least the
+will prevent the server from starting. The default is 10. This parameter
+can only be set at server start.
+
+max_replication_origins must be set to at least the
 number of subscriptions that will be added to the subscriber, plus some
 reserve for table synchronization.

-
-   
-Note that this parameter also applies on a sending server, but with
-a different meaning. See 
-in  for more
-details.
-   
   
  
 
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replic

Re: Assert failure on running a completed portal again

2024-12-10 Thread Robert Haas
On Tue, Dec 10, 2024 at 1:37 PM Tom Lane  wrote:
> And ExecutorRun skips ExecutePlan if direction is
> NoMovementScanDirection.  So unless there are gaps in pquery.c's EOF
> checks, I think we're OK: the "re-execution" call will not reach
> ExecutePlan.  This could probably do with more commentary though.

Thanks for the research, and +1 if you feel like adding more commentary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Document NULL

2024-12-10 Thread Jeff Davis
On Mon, 2024-12-09 at 15:27 -0700, David G. Johnston wrote:
> I have not done this.  This is already a large patch and this kind of
> example doesn't seem like our norm.  I'm not opposed to more content
> like this but for now would leave considering it as something an
> interested party can propose once this goes in.

Fair enough

Though I think it's a great example and I'd like to find some place to
put it.

> 
>    
>     The presence of null values in the system results in three-valued
> logic.
>     In conventional two-valued (binary) logic every outcome is either
> true or false.
>     In three-valued logic the concept of unknown, represented using
> the null value, is
>     also an outcome.  This results in falsifying the common-sense
> notion
>     that "p OR NOT p" is always true.
>    

Thank you.

I might reword the final sentence as more of an example, like: "Unknown
values can lead to surprising behavior, for instance "NULL OR NOT NULL"
evaluates to the null value."

> 
> When executing an aggregate or window function the state tracking
> component
>    (which may be initialized to a non-null value, e.g., 0 for the
> count function)
>    will remain unchanged even if the underlying processing
>    function returns a null value, whether from being defined strict
>    or it simply returns a null value upon execution.

Thank you.

> Since count doesn't have an input function to check the only way to
> see zero such rows is if the underlying thing being counted is empty.

While true for COUNT(*), technically that's incorrect for COUNT(x),
which counts the rows for which x is non-null. That doesn't invalidate
your point, though: the initial state is unchanged either way.

Regards,
Jeff Davis





Re: [PATCH] Support Int64 GUCs

2024-12-10 Thread Pavel Borisov
Hi, Wenhui!

On Tue, 10 Dec 2024 at 17:32, wenhui qiu  wrote:

> Hi aleksander
>Didn't you mark this patch as rejected last time? Do we still need to
> continue this path?
>
>
> Thanks
>
> On Sun, Dec 8, 2024 at 10:16 PM Evgeny Voropaev 
> wrote:
>
>> Hi hackers!
>>
>> Upgraded the "Int64 GUC" patch in order to conform to f3f06b13308e3
>> updates. Rebased and tested upon the current master (3f9b9621766). The
>> patch is still needed to be up to date as a part of the xid64 solution.
>>
>> Best regards,
>> Evgeny Voropaev,
>> TantorLabs, LLC.
>> 
>> 
>
>
I see Aleksander worked on this patch (with many other hackers) and marked
it as rejected. And now Evgeniy, a new developer wants to continue and
rebase this patch and also the patches in another 64-xid thread
disregarding the fact that the patch is rejected. It's not a crime. It's
rejected, true.

Regards,
Pavel Borisov


Re: XMLDocument (SQL/XML X030)

2024-12-10 Thread Jim Jones
Hi Andrew

On 10.12.24 14:59, Andrew Dunstan wrote:
> LGTM at a first glance.
>
>
> Please add this to the next CommitFest if you haven't done already. 

Thanks!

This is the CF entry: https://commitfest.postgresql.org/51/5431/

Best, Jim





Fix comments related to pending statistics

2024-12-10 Thread Bertrand Drouvot
Hi hackers,

while working on [1], I came across 2 comments in pgstat.h that I think are 
not correct.

1. One linked to PgStat_TableCounts pending stats, mentioning the use of
memcmp() against zeroes to detect whether there are any stats updates to apply.

This is not true anymore as of 07e9e28b56.

2. One linked to PgStat_FunctionCounts pending stats, mentioning the use of
memcmp() against zeroes to detect whether there are any pending stats: I think
it has never been the case.

Please find attached a patch to fix those comments.

[1]: 
https://www.postgresql.org/message-id/flat/ZlGYokUIlERemvpB%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a7a026fa0f183bf4d66d85ea05463b69422d20a8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 10 Dec 2024 12:11:35 +
Subject: [PATCH v1] Fix comments related to pending statistics

The comment linked to the PgStat_TableCounts pending stats mentioning the use of
memcmp() against zeroes to detect whether there are any stats updates to apply
is not true anymore as of 07e9e28b56.

The one linked to memcmp() usage for the PgStat_FunctionCounts pending stats has
probably never been correct.
---
 src/include/pgstat.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)
 100.0% src/include/

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 59c28b4aca..795e45653e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -128,8 +128,8 @@ typedef int64 PgStat_Counter;
 /* --
  * PgStat_FunctionCounts	The actual per-function counts kept by a backend
  *
- * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any pending stats.
+ * This struct should contain only actual event counters, because pending stats
+ * always has non-zero content.
  *
  * Note that the time counters are in instr_time format here.  We convert to
  * microseconds in PgStat_Counter format when flushing out pending statistics.
@@ -172,8 +172,9 @@ typedef struct PgStat_BackendSubEntry
 /* --
  * PgStat_TableCounts			The actual per-table counts kept by a backend
  *
- * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any stats updates to apply.
+ * This struct should contain only actual event counters, because we make use
+ * of pg_memory_is_all_zeros() to detect whether there are any stats updates to
+ * apply.
  * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
-- 
2.34.1



Make use of pg_memory_is_all_zeros() in more places

2024-12-10 Thread Bertrand Drouvot
Hi hackers,

While searching for memcmp() calls in "*stat*.c" files (due to [1]), it appeared
that we could $SUBJECT. Please find attached a patch doing so.

While at it, I did a quick check on all the memcmp() calls (even those not in
"*stat*.c" files) and it looks to me that there is no others that could be 
replaced
with pg_memory_is_all_zeros().

[1]: 
https://www.postgresql.org/message-id/flat/Z1hNLvcPgVLPxCoc%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 9b58b74506e27f2cd3a0e7d00fa1c32c7aaf2212 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 10 Dec 2024 11:01:40 +
Subject: [PATCH v1] Make use of pg_memory_is_all_zeros() in more places

Some places are using a memset()/memcmp() combination to check that a structure
contains all zeros: make use of pg_memory_is_all_zeros() instead.
---
 src/backend/utils/adt/pgstatfuncs.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)
 100.0% src/backend/utils/adt/

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 60a397dc56..089c8ad965 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -361,7 +361,6 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 		/* Values only available to role member or pg_read_all_stats */
 		if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		{
-			SockAddr	zero_clientaddr;
 			char	   *clipped_activity;
 
 			switch (beentry->st_state)
@@ -483,9 +482,8 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 nulls[11] = true;
 
 			/* A zeroed client addr means we don't know */
-			memset(&zero_clientaddr, 0, sizeof(zero_clientaddr));
-			if (memcmp(&(beentry->st_clientaddr), &zero_clientaddr,
-	   sizeof(zero_clientaddr)) == 0)
+			if (pg_memory_is_all_zeros(&(beentry->st_clientaddr),
+	   sizeof(beentry->st_clientaddr)))
 			{
 nulls[12] = true;
 nulls[13] = true;
@@ -880,7 +878,6 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 {
 	int32		procNumber = PG_GETARG_INT32(0);
 	PgBackendStatus *beentry;
-	SockAddr	zero_clientaddr;
 	char		remote_host[NI_MAXHOST];
 	int			ret;
 
@@ -891,9 +888,8 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 		PG_RETURN_NULL();
 
 	/* A zeroed client addr means we don't know */
-	memset(&zero_clientaddr, 0, sizeof(zero_clientaddr));
-	if (memcmp(&(beentry->st_clientaddr), &zero_clientaddr,
-			   sizeof(zero_clientaddr)) == 0)
+	if (pg_memory_is_all_zeros(&(beentry->st_clientaddr),
+			   sizeof(beentry->st_clientaddr)))
 		PG_RETURN_NULL();
 
 	switch (beentry->st_clientaddr.addr.ss_family)
@@ -925,7 +921,6 @@ pg_stat_get_backend_client_port(PG_FUNCTION_ARGS)
 {
 	int32		procNumber = PG_GETARG_INT32(0);
 	PgBackendStatus *beentry;
-	SockAddr	zero_clientaddr;
 	char		remote_port[NI_MAXSERV];
 	int			ret;
 
@@ -936,9 +931,8 @@ pg_stat_get_backend_client_port(PG_FUNCTION_ARGS)
 		PG_RETURN_NULL();
 
 	/* A zeroed client addr means we don't know */
-	memset(&zero_clientaddr, 0, sizeof(zero_clientaddr));
-	if (memcmp(&(beentry->st_clientaddr), &zero_clientaddr,
-			   sizeof(zero_clientaddr)) == 0)
+	if (pg_memory_is_all_zeros(&(beentry->st_clientaddr),
+			   sizeof(beentry->st_clientaddr)))
 		PG_RETURN_NULL();
 
 	switch (beentry->st_clientaddr.addr.ss_family)
-- 
2.34.1



Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Andres Freund
On 2024-12-10 11:34:15 -0500, Andres Freund wrote:
> On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
> >  On Tue, Dec 10, 2024 at 7:34 AM Michael Harris  wrote:
> > 2.
> > 
> > > # xfs_info /dev/mapper/ippvg-ipplv
> > > meta-data=/dev/mapper/ippvg-ipplv isize=512agcount=4,
> > agsize=262471424 blks
> > > =   sectsz=512   attr=2, projid32bit=1
> > > =   crc=1finobt=0, sparse=0, rmapbt=0
> > > =   reflink=0bigtime=0 inobtcount=0
> > nrext64=0
> > 
> > Yay, reflink=0, that's pretty old fs ?!
> 
> I think that only started to default to on more recently (2019, plus time to
> percolate into RHEL). The more curious cases is finobt=0 (turned on by default
> since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).

One thing that might be interesting is to compare xfs_info of affected and
non-affected servers...




Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Andres Freund
Hi,

On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
>  On Tue, Dec 10, 2024 at 7:34 AM Michael Harris  wrote:
> 1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
> with a huge number of AGs with small space in them) reporting such errors
> after upgrading to 16, but not for earlier versions (somehow
> posix_fallocate() had to be the culprit).

Given the workload expires old partitions, I'm not sure we conclude a whole
lot from the current state :/


> 2.
> 
> > # xfs_info /dev/mapper/ippvg-ipplv
> > meta-data=/dev/mapper/ippvg-ipplv isize=512agcount=4,
> agsize=262471424 blks
> > =   sectsz=512   attr=2, projid32bit=1
> > =   crc=1finobt=0, sparse=0, rmapbt=0
> > =   reflink=0bigtime=0 inobtcount=0
> nrext64=0
> 
> Yay, reflink=0, that's pretty old fs ?!

I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).


> > ERROR:  could not extend file
> "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
> space left on device
> 
> 2. This indicates it was allocating 1GB for such a table (".1"), on
> tablespace that was created more than a year ago. Could you get us maybe
> those below commands too? (or from any other directory exhibiting such
> errors)

The date in the directory is the catversion of the server, which is just
determined by the major version being used, not the creation time of the
tablespace.

andres@awork3:~/src/postgresql$ git grep CATALOG_VERSION_NO 
upstream/REL_16_STABLE src/include/catalog/catversion.h
upstream/REL_16_STABLE:src/include/catalog/catversion.h:#define 
CATALOG_VERSION_NO   202307071

Greetings,

Andres Freund




Re: Assert failure on running a completed portal again

2024-12-10 Thread Robert Haas
On Thu, Dec 5, 2024 at 8:09 PM Tom Lane  wrote:
> After looking at this further, I think this whole "run_once"
> business is badly designed, worse documented, and redundant
> (as well as buggy).  It can be replaced with three self-contained
> lines in ExecutePlan, as per the attached.

Did you look at the commit message for
691b8d59281b5177f16fe80858df921f77a8e955? IMHO, that's pretty clear
about what the goals of this were.

I didn't consider the scenario mentioned by Nagata-san in his original
post, namely, executing a portal with a row count of 0 and then doing
the same thing again. So in that sense the assertion fulfilled its
intended purpose: alerting somebody that there is a case which is
reachable but which the original programmer believed to be
unreachable.

I haven't tested, but I'm guessing that what happens in Nagata-san's
case with the committed fix is that we execute the plan once to
completion in parallel mode; and then the second execute message
executes the same plan tree again in non-parallel mode but this time
it ends up returning zero tuples because it's already been executed to
completion. If that is correct, then I think this is subtly changing
the rules for parallel Plan trees: before, the rule as I understood
it, was "only complete execution must supported" but now it's
"complete executed must be supported and, in addition, a no-op
re-execution after complete execution must be supported". Off the top
of my head that seems like it should be fine.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fix bank selection logic in SLRU

2024-12-10 Thread Dilip Kumar
On Tue, Dec 10, 2024 at 6:32 PM Andrey M. Borodin  wrote:
>
>
>
> > On 10 Dec 2024, at 15:39, Yura Sokolov  wrote:
> >
> > It is not critical bug, since it doesn't hurt correctness just performance. 
> > In worst case only one bank will be used.
>
> Ugh... yeah. IMO the problem is that we do not have protection that rejects 
> values that are not power of 2.
> If other values given system operates as if there are 2^(popcount(n)-1) 
> banks. So if we just round down value to nearest power of 2 - we will help 
> incorrectly configured systems to use proper amount of memory and keep 
> performance of properly configured systems.
>
> IMO doing modulo is not necessary. And hash function is pure waste of CPU 
> cycles.


IIUC, we do check that it should be in multiple of bank size (i.e.)
which is multiple of 2, right?  Am I missing something?

/*
* Helper function for GUC check_hook to check whether slru buffers are in
* multiples of SLRU_BANK_SIZE.
*/
bool
check_slru_buffers(const char *name, int *newval)
{
/* Valid values are multiples of SLRU_BANK_SIZE */
if (*newval % SLRU_BANK_SIZE == 0)
return true;

GUC_check_errdetail("\"%s\" must be a multiple of %d", name,
SLRU_BANK_SIZE);
return false;
}



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Sample rate added to pg_stat_statements

2024-12-10 Thread Ilia Evdokimov

Hi everyone,

I attached previous sampling patch for pg_stat_statements (v4). 
Suggestions like sampling based on execution time remain unfeasible, as 
pg_stat_statements can track query during query planning, not execution.


To evaluate the implementation, I ran a benchmark creating 1,000 random 
tables and executing randomized JOIN queries  on a small machine. When 
pg_stat_statements enabled, performance decreases, but reducing the 
sampling rate helps mitigate the impact and improves performance.


I’d be interested in hearing your new thoughts. Are there areas where 
this patch could be improved, or other ideas worth exploring?


--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC.
From 68f5451019b261bf03a222f5a05ac57cd0fb9a24 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Thu, 21 Nov 2024 11:24:03 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/pg_stat_statements.c   | 23 ++-
 doc/src/sgml/pgstatstatements.sgml| 18 +++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..d06e0d8a44 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -294,12 +295,16 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 			 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +419,19 @@ _PG_init(void)
 			NULL,
 			NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+			 "Fraction of queries to process.",
+			 NULL,
+			 &pgss_sample_rate,
+			 1.0,
+			 0.0,
+			 1.0,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 			 "Selects which statements are tracked by pg_stat_statements.",
 			 NULL,
@@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	if (nesting_level == 0)
+		current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..d06349d097 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
 

 
+   
+
+ pg_stat_statements.sample_rate (real)
+ 
+  pg_stat_statements.sample_rate configuration parameter
+ 
+
+
+
+ 
+  pg_stat_statements.sample_rate causes pg_stat_statements to only
+  track a fraction of the statements in each session.  The default is 1,
+  meaning track all the queries. In case of nested statements,
+  either all will be tracked or none. Only superusers can change this setting.
+ 
+
+   
+

 
  pg_stat_statements.save (boolean)
-- 
2.34.1



Re: Fix bank selection logic in SLRU

2024-12-10 Thread Andrey M. Borodin



> On 10 Dec 2024, at 16:58, Dilip Kumar  wrote:
> 
> slru buffers are in multiple of 16(bank size)

Yes, my example with 64 buffers is incorrect.
The worst case scenario is when user configures 80, 144, 528 or 1040 buffers, 
but only two banks (32 buffers) will be used.


Best regards, Andrey Borodin.



Re: Assert failure on running a completed portal again

2024-12-10 Thread Tom Lane
Robert Haas  writes:
> On Thu, Dec 5, 2024 at 8:09 PM Tom Lane  wrote:
>> After looking at this further, I think this whole "run_once"
>> business is badly designed, worse documented, and redundant
>> (as well as buggy).  It can be replaced with three self-contained
>> lines in ExecutePlan, as per the attached.

> Did you look at the commit message for
> 691b8d59281b5177f16fe80858df921f77a8e955? IMHO, that's pretty clear
> about what the goals of this were.

Well, it's not very clear about what implementation limitations we
are trying to protect.

> I haven't tested, but I'm guessing that what happens in Nagata-san's
> case with the committed fix is that we execute the plan once to
> completion in parallel mode; and then the second execute message
> executes the same plan tree again in non-parallel mode but this time
> it ends up returning zero tuples because it's already been executed to
> completion. If that is correct, then I think this is subtly changing
> the rules for parallel Plan trees: before, the rule as I understood
> it, was "only complete execution must supported" but now it's
> "complete executed must be supported and, in addition, a no-op
> re-execution after complete execution must be supported". Off the top
> of my head that seems like it should be fine.

Hmm.  As committed, the re-execution will happen with
use_parallel_mode disabled, which might make that safer, or maybe it's
less safe?  Do you think we need something closer to the upthread
proposal to not run the executor at all during the "re-execution"?
(I'd still be inclined to put that short-circuit into ExecutePlan
not anything higher-level.)

My recollection is that even before parallelism we had some plan node
types that didn't cope well with being called again after they'd once
returned EOF (ie a null tuple).  So maybe a defense against trying to
do that would be wise.  It could probably be as simple as checking
estate->es_finished.

regards, tom lane




Re: Refactoring postmaster's code to cleanup after child exit

2024-12-10 Thread Heikki Linnakangas

On 09/12/2024 22:55, Heikki Linnakangas wrote:
Not sure how to fix this. A small sleep in the test would work, but in 
principle there's no delay that's guaranteed to be enough. A more robust 
solution would be to run a "select count(*) from pg_stat_activity" and 
wait until the number of connections are what's expected. I'll try that 
and see how complicated that gets..


Checking pg_stat_activity doesn't help, because the backend doesn't 
register itself in pg_stat_activity until later. A connection that's 
rejected due to connection limits never shows up in pg_stat_activity.


Some options:

0. Do nothing

1. Add a small sleep to the test

2. Move the pgstat_bestart() call earlier in the startup sequence, so 
that a backend shows up in pg_stat_activity before it acquires a PGPROC 
entry, and stays visible until after it has released its PGPROC entry. 
This would give more visibility to backends that are starting up.


3. Rearrange the FATAL error handling so that the process removes itself 
from PGPROC before sending the error to the client. That would be kind 
of nice anyway. Currently, if sending the rejection error message to the 
client blocks, you are holding up a PGPROC slot until the message is 
sent. The error message packet is short, so it's highly unlikely to 
block, but still.


Option 3 seems kind of nice in principle, but looking at the code, it's 
a bit awkward to implement.  Easiest way to implement it would be to 
modify send_message_to_frontend() to not call pq_flush() on FATAL 
errors, and flush the data in socket_close() instead. Not a lot of code, 
but it's a pretty ugly special case.


Option 2 seems nice too, but seems like a lot of work.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-12-10 Thread Guillaume Lelarge
Hello,

Le mar. 10 déc. 2024 à 03:57, David Rowley  a écrit :

> On Thu, 21 Nov 2024 at 08:30, Guillaume Lelarge 
> wrote:
> > OK, I'm fine with this. v4 patch attached with one plan showing read,
> written, and dirtied buffers.
>
> Today I spent more time polishing this patch. There were a few cases
> in the docs that displayed EXPLAIN ANALYZE output that you'd not
> adjusted to include the buffers output or adjusted to do BUFFERS OFF.
> I think I've got all these now. Tom went to some effort to fix some
> outdated EXPLAIN outputs for v17 in 5caa05749, so I think we owe it to
> him not to let these go out of date so soon after that change.
>
>
You're right and I completely forgot to check the whole documentation. I
just looked at perform.sgml which was the obvious file for explain plans.
Anyway, sorry about this, and thanks a lot for your work on this patch.

I also was thinking again about what Robert mentioned about
> auto_explain.log_buffers should now also be on by default.  I'm less
> certain than him about this change. It seems like a separate
> consideration that we could apply many of the same arguments for the
> main change to.  In any case, I extracted that change from the 0001
> patch and put it in a 0002 patch as it doesn't seem like something
> that should be a sidenote in the commit message. I felt doing that
> increases the chances that it would be overlooked in the release
> notes.
>
> I was very close to pushing 0001 today, but will hold off until
> tomorrow to see if anyone has final comments.
>
>
No more comments. I'm fine with both patches.


> For 0002, I'd really like to see a bit more justification for it.  For
> the record, I'm not against 0002, it's just that my personal arguments
> for wanting 0001 don't apply to 0002.
>
>
I guess consistency is the key word here. But I agree that 0001 is the one
that's really important to me.

Thanks again for your work on this.


-- 
Guillaume.


Fix bank selection logic in SLRU

2024-12-10 Thread Yura Sokolov

Good day, hackers.

Due to long discussion about SLRU size configuration [1] and code 
evolution, non-serious bug were introduced:


- intermediate versions assumed cache size is always power of 2,
- therefore to determine bank simple binary-and with mask were used:

bankno = pageno & ctl->bank_mask;

- final merged version allows arbitrary cache size for every cache
  type, but code for bankno were not fixed.

It is not critical bug, since it doesn't hurt correctness just 
performance. In worst case only one bank will be used.


I attach the patch, that changes SlruCtlData->bank_mask to ->nbanks,
and changes calculation to modulo operation.

bankno = pageno % ctl->nbanks;

Probably, instead of modulo operation could be used multiplication:

bankno = ((uint64) murmurhash32(pageno) * ctl->nbanks) >> 32;

But I didn't bother to measure does it pay for or not.

[1] 
https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com


Regards,
Yura Sokolov aka funny-falconFrom 4b438e67a79d77bf2caf3e5a0386bb7700d329ba Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Tue, 10 Dec 2024 15:38:02 +0300
Subject: [PATCH v1] Fix SLRU bank selection.

Due to code evolution, nbanks is not power of two any more. But still
binary & were used to obtain bankno.
---
 src/backend/access/transam/slru.c | 6 +++---
 src/include/access/slru.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index e7f73bf4275..afedb5c039f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -343,7 +343,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	ctl->shared = shared;
 	ctl->sync_handler = sync_handler;
 	ctl->long_segment_names = long_segment_names;
-	ctl->bank_mask = (nslots / SLRU_BANK_SIZE) - 1;
+	ctl->nbanks = nbanks;
 	strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
@@ -606,7 +606,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int64 pageno, TransactionId xid)
 {
 	SlruShared	shared = ctl->shared;
 	LWLock	   *banklock = SimpleLruGetBankLock(ctl, pageno);
-	int			bankno = pageno & ctl->bank_mask;
+	int			bankno = pageno % ctl->nbanks;
 	int			bankstart = bankno * SLRU_BANK_SIZE;
 	int			bankend = bankstart + SLRU_BANK_SIZE;
 
@@ -1177,7 +1177,7 @@ SlruSelectLRUPage(SlruCtl ctl, int64 pageno)
 		int			bestinvalidslot = 0;	/* keep compiler quiet */
 		int			best_invalid_delta = -1;
 		int64		best_invalid_page_number = 0;	/* keep compiler quiet */
-		int			bankno = pageno & ctl->bank_mask;
+		int			bankno = pageno % ctl->nbanks;
 		int			bankstart = bankno * SLRU_BANK_SIZE;
 		int			bankend = bankstart + SLRU_BANK_SIZE;
 
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 97e612cd100..fabea220290 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -129,9 +129,9 @@ typedef struct SlruCtlData
 	SlruShared	shared;
 
 	/*
-	 * Bitmask to determine bank number from page number.
+	 * Number of banks to determine bank number from page number.
 	 */
-	bits16		bank_mask;
+	bits16		nbanks;
 
 	/*
 	 * If true, use long segment file names.  Otherwise, use short file names.
@@ -179,7 +179,7 @@ SimpleLruGetBankLock(SlruCtl ctl, int64 pageno)
 {
 	int			bankno;
 
-	bankno = pageno & ctl->bank_mask;
+	bankno = pageno % ctl->nbanks;
 	return &(ctl->shared->bank_locks[bankno].lock);
 }
 
-- 
2.43.0



Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-10 Thread Masahiko Sawada
On Mon, Dec 9, 2024 at 6:52 PM Amit Kapila  wrote:
>
> On Tue, Dec 10, 2024 at 2:17 AM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 9, 2024 at 2:06 AM Amit Kapila  wrote:
> > >
> > > On Thu, Dec 5, 2024 at 2:56 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I realized that this patch cannot be backpatched because it 
> > > > > introduces a new
> > > > > field into the public PGOutputData structure. Therefore, I think we 
> > > > > may need to
> > > > > use Alvaro's version [1] for the back branches.
> > > >
> > > > FWIW for back branches, I prefer using the foreach-pfree pattern
> > > > Michael first proposed, just in case. It's not elegant but it can
> > > > solve the problem while there is no risk of breaking non-core
> > > > extensions.
> > > >
> > >
> > > It couldn't solve the problem completely even in back-branches. The
> > > SQL API case I mentioned and tested by Hou-San in the email [1] won't
> > > be solved.
> >
> > True. There seems another place where we possibly leak memory on
> > CacheMemoryContext when using pgoutput via SQL APIs:
> >
> > /* Map must live as long as the session does. */
> > oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> >
> > entry->attrmap = build_attrmap_by_name_if_req(indesc, outdesc, 
> > false);
> >
> > MemoryContextSwitchTo(oldctx);
> > RelationClose(ancestor);
> >
> > entry->attrmap is pfree'd only when validating the RelationSyncEntry
> > so remains even after logical decoding API calls.
> >
>
> We have also noticed this but it needs more analysis on the fix which
> one of my colleagues is doing. I think we can fix this as a separate
> issue unless you think otherwise.

I agree to fix this as a separate patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-12-10 Thread Nathan Bossart
On Mon, Dec 09, 2024 at 04:41:03PM +, Bertrand Drouvot wrote:
> +   time_delayed bigint

I think it's also worth considering names like total_delay and
cumulative_delay.

> +   Total amount of time spent in milliseconds waiting during  linkend="guc-vacuum-cost-delay"/>
> +   or . In case of 
> parallel
> +   vacuum the reported time is across all the workers and the leader. The
> +   workers update the column no more frequently than once per second, so 
> it
> +   could show slightly old values.

I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable.  I might also
suggest some changes to the description:

Total accumulated time spent sleeping due to the cost-based vacuum
delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit).  This
includes the time that any associated parallel workers have slept, too.
However, parallel workers report their sleep time no more frequently
than once per second, so the reported value may be slightly stale.

-- 
nathan




Re: Add Pipelining support in psql

2024-12-10 Thread Anthonin Bonnefoy
An improved version with simplifications and refinements.

num_queries (2nd element in the pipeline status prompt) is now used to
track queued queries that were not flushed (with a flush request or
sync) to the server. It used to count both unflushed queries and
flushed queries.

Code in ExecQueryAndProcessResults should be simpler now.
- DiscardAbortedPipelineResults function handles both discarding of
results until a synchronisation point is reached or discarding of
results until there's no more pending results.
- The logic to process the pipeline's results and getting the next
results fit more with the existing flow.
- Tests didn't cover chunk results so I've added additional tests to
cover use of pipelining + FETCH_COUNT


v04-0001-Add-pipelining-support-in-psql.patch
Description: Binary data


Re: Proposal to add a new URL data type.

2024-12-10 Thread Victor Yegorov
чт, 5 дек. 2024 г. в 17:02, Alexander Borisov :

> My name is Alexander Borisov, I want to propose/discuss
> adding a new URL type as an extension in PostgreSQL contrib.
> I think everyone here knows/seen what URL/URI and their basics.
> If someone is interested in the basics, you can read the original
> RFC 1738 [1].
>
> ...
>
> I am willing to take care of the implementation of the new data type
> and its further support.  If it's of interest to the community.
>

Hey, I had a look at this patch and found its functionality mature and
performant.

As Peter mentioned pguri, I used it to compare with the proposed extension.
This brought up
the following differences:
- pguri (uriparser 0.9.8) doesn't support Chinese symbols in the host part
of URI (uri_test1.sh):

ERROR:  invalid input syntax for type uri at or near "事例.com#comments
"

  Therefore, I avoided Chinese or Cyrillic symbols in the pguri test script.
- There are no SET functions in the pguri, changing specific portions of
URI is troublesome. I used
  replace() in the test, but this is an error prone approach.
- It's even more troublesome to set parts of the URI that are not initially
there. Probably, a full decomposition
  into parts and the following wrap up is needed
Suggested extension has no such limitations. Additionally, pguri extracts
userinfo as a whole,
while suggested extension can get/set user and password individually.


Running tests (attached) I got the following numbers:
$ ./url_test.sh
NOTICE:  extension "url" already exists, skipping
tps = 13068.287423 (without initial connection time)
tps = 12888.937747 (without initial connection time)
tps = 12830.642558 (without initial connection time)
tps = 12846.341411 (without initial connection time)
tps = 13187.955601 (without initial connection time)

$ ./uri_test2.sh
NOTICE:  extension "uri" already exists, skipping
tps = 2441.934308 (without initial connection time)
tps = 2513.277660 (without initial connection time)
tps = 2484.641673 (without initial connection time)
tps = 2519.312395 (without initial connection time)
tps = 2512.364492 (without initial connection time)

So it's 12.9k vs 2.5k, or 6x faster for a case where we replace 5 parts of
the original URL.

Given its performance and functionality, I find the suggested URL extension
better than pguri.


Now to the review part.

1. Applying patch causes indentation warning, please, bring spacing to the
project's policy

$ git apply ~/0001-Add-url-data-type-to-contrib.patch
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:837: indent with
spaces.
return lexbor_calloc(1, sizeof(lexbor_array_t));
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:843: indent with
spaces.
if (array == NULL) {
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:844: indent with
spaces.
return LXB_STATUS_ERROR_OBJECT_IS_NULL;
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:845: indent with
spaces.
}
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:847: indent with
spaces.
if (size == 0) {
warning: squelched 6350 whitespace errors
warning: 6355 lines add whitespace errors.

2. There's a lexbor/ library that contains core and url parts. Feels like
some commentary about what's
  inside is required.

3. Do you think it's possible to adjust your code to use existing postgres
infrastructure instead? I don't
  think having its own infrastructure for a single extension is a good
thing. Also, this might be a source
  for performance improvements in the core.

4. There's no user visible documentation, please, add one.

I've created a commitfest entry for the patch:
https://commitfest.postgresql.org/51/5432/
I was not able to find you, please, register a community account and set
yourself as an author for the patch.

-- 
Victor Yegorov
#!/bin/bash

export PATH=$PATH:$PWD/bin
dbname=postgres
secs=30
psql -c "alter system set max_parallel_workers_per_gather = 0;" $dbname > 
/dev/null
psql -c "select pg_reload_conf();" $dbname > /dev/null
psql -c "create extension if not exists url;" $dbname > /dev/null


cat << EOSQL > bench.sql
WITH u(url) AS MATERIALIZED (
  SELECT 'https://example.com#comments'::url
)
SELECT url_fragment_set(
   url_path_set(
 url_port_set(
   url_host_set(
 url_password_set(
   url_username_set(
 url_scheme_set( url, 'wss' )
   , 'guest' )
 , '12345' )
   , '事例.com' )
 , '8080' )
   , '/а/б/в' )
   , 'comment' )
  FROM u;
EOSQL

for i in {1..5}
do
pgbench -n -f bench.sql -M prepared -T $secs $dbname | grep tps
done
#!/bin/bash

export PATH=$PATH:$PWD/bin
dbname=postgres
secs=30
psql -c "alter system set max_parallel_workers_per_gather = 0;" $dbname > 
/dev/null
psql -c "select pg_reload_conf();" $dbname > /dev/null
psql -c "create extension if not exists uri;" $dbname > /dev/null


cat << EOSQL > bench.sql

Re: [PATCH] Support Int64 GUCs

2024-12-10 Thread Tom Lane
Pavel Borisov  writes:
> I see Aleksander worked on this patch (with many other hackers) and marked
> it as rejected. And now Evgeniy, a new developer wants to continue and
> rebase this patch and also the patches in another 64-xid thread
> disregarding the fact that the patch is rejected. It's not a crime. It's
> rejected, true.

No, but it's a waste of time.  Just rebasing the patch does nothing to
counter the arguments that made us reject it before.

regards, tom lane




Re: Fix bank selection logic in SLRU

2024-12-10 Thread Dilip Kumar
On Tue, 10 Dec 2024 at 7:30 PM, Robert Haas  wrote:

> On Tue, Dec 10, 2024 at 8:58 AM Dilip Kumar  wrote:
> >> Bank selection code assumes that number of buffers is power of 2.
> >> If the number of buffers is not power of 2 - only subset of buffers
> will be used. In worst case, e.g. 65 buffers, everything will be buffered
> only in bank 64.
> >
> > But why that would be the case? the acceptable values for GUC to
> configure the slru buffers are in multiple of 16(bank size) we have that
> check to check the GUC values.
>
> "Must be a multiple of 16" and "must be a power of 2" are different
> criteria. For example, 48 is a multiple of 16 but it is not a power of
> 2. If the code assumes that we have an actual power of 2, the check
> you quoted in your previous email is insufficient.


Yeah I see it’s an issue.  Thanks for clarifying.

—
Dilip

>


Re: Define STATS_MIN_ROWS for minimum rows of stats in ANALYZE

2024-12-10 Thread Ilia Evdokimov


On 09.12.2024 16:10, Ilia Evdokimov wrote:

Hi hackers,

The repeated use of the number 300 in the ANALYZE-related code creates 
redundancy and relies on scattered, sometimes unclear, comments to 
explain its purpose. This can make the code harder to understand, 
especially for new contributors who might not immediately understand 
its significance. To address this, I propose introducing a macro 
STATS_MIN_ROWS to represent this value and consolidating its 
explanation in a single place, making the code more consistent and 
readable.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.



Hi everyone,

Currently, the value 300 is used as the basis for determining the number 
of rows sampled during ANALYZE, both for single-column and extended 
statistics. While this value has a well-established rationale for 
single-column statistics, its suitability for extended statistics 
remains uncertain, as no specific research has confirmed that this is an 
optimal choice for them. To better reflect this distinction, I propose 
introducing two macros: STATS_MIN_ROWS for single-column statistics and 
EXT_STATS_MIN_ROWS for extended statistics.


This change separates the concerns of single-column and extended 
statistics sampling, making the code more explicit and easier to adapt 
if future research suggests a different approach for extended 
statistics. The values remain the same for now, but the introduction of 
distinct macros improves clarity and prepares the codebase for potential 
refinements.


Does this seem like a reasonable approach to handling these differences?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From 164535616040447207c36ea47c52b206f4d77dcf Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Tue, 10 Dec 2024 16:25:29 +0300
Subject: [PATCH v2] Define macros for minimum rows of stats in ANALYZE

This introduces two macros, STATS_MIN_ROWS and EXT_STATS_MIN_ROWS,
to represent the default minimum number of rows sampled in ANALYZE.
STATS_MIN_ROWS is used for single-column statistics,
while EXT_STATS_MIN_ROWS is intended for extended statistics.
Both macros replace the hardcoded value of 300, improving clarity.
---
 src/backend/commands/analyze.c| 31 +--
 src/backend/statistics/extended_stats.c   |  2 +-
 src/backend/tsearch/ts_typanalyze.c   |  4 +--
 src/backend/utils/adt/rangetypes_typanalyze.c |  7 ++---
 .../statistics/extended_stats_internal.h  | 11 +++
 src/include/statistics/statistics.h   | 23 ++
 6 files changed, 47 insertions(+), 31 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282..e358b0d828 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1897,42 +1897,25 @@ std_typanalyze(VacAttrStats *stats)
 	{
 		/* Seems to be a scalar datatype */
 		stats->compute_stats = compute_scalar_stats;
-		/*
-		 * The following choice of minrows is based on the paper
-		 * "Random sampling for histogram construction: how much is enough?"
-		 * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
-		 * Proceedings of ACM SIGMOD International Conference on Management
-		 * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
-		 * says that for table size n, histogram size k, maximum relative
-		 * error in bin size f, and error probability gamma, the minimum
-		 * random sample size is
-		 *		r = 4 * k * ln(2*n/gamma) / f^2
-		 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
-		 *		r = 305.82 * k
-		 * Note that because of the log function, the dependence on n is
-		 * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
-		 * bin size error with probability 0.99.  So there's no real need to
-		 * scale for n, which is a good thing because we don't necessarily
-		 * know it at this point.
-		 *
-		 */
-		stats->minrows = 300 * stats->attstattarget;
 	}
 	else if (OidIsValid(eqopr))
 	{
 		/* We can still recognize distinct values */
 		stats->compute_stats = compute_distinct_stats;
-		/* Might as well use the same minrows as above */
-		stats->minrows = 300 * stats->attstattarget;
 	}
 	else
 	{
 		/* Can't do much but the trivial stuff */
 		stats->compute_stats = compute_trivial_stats;
-		/* Might as well use the same minrows as above */
-		stats->minrows = 300 * stats->attstattarget;
 	}
 
+	/*
+	 * For the scalar types, STATS_MIN_ROWS is derived from research on
+	 * optimal sample sizes for histograms. For other cases,
+	 * we assume the same value.
+	 */
+	stats->minrows = (STATS_MIN_ROWS * stats->attstattarget);
+
 	return true;
 }
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 99fdf208db..8cd024b5ce 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -320,7 +320,7 @@ ComputeExtStatisticsRows(Relation onerel,
 	MemoryContextDelete(cxt);
 
 	/* compute sampl

Re: [PATCH] Support Int64 GUCs

2024-12-10 Thread wenhui qiu
Hi aleksander
   Didn't you mark this patch as rejected last time? Do we still need to
continue this path?


Thanks

On Sun, Dec 8, 2024 at 10:16 PM Evgeny Voropaev 
wrote:

> Hi hackers!
>
> Upgraded the "Int64 GUC" patch in order to conform to f3f06b13308e3
> updates. Rebased and tested upon the current master (3f9b9621766). The
> patch is still needed to be up to date as a part of the xid64 solution.
>
> Best regards,
> Evgeny Voropaev,
> TantorLabs, LLC.
> 
> 


Re: Fix bank selection logic in SLRU

2024-12-10 Thread Dilip Kumar
On Tue, 10 Dec 2024 at 6:32 PM, Andrey M. Borodin 
wrote:

>
>
> > On 10 Dec 2024, at 15:39, Yura Sokolov  wrote:
> >
> > It is not critical bug, since it doesn't hurt correctness just
> performance. In worst case only one bank will be used.
>
> Ugh... yeah. IMO the problem is that we do not have protection that
> rejects values that are not power of 2.
> If other values given system operates as if there are 2^(popcount(n)-1)
> banks. So if we just round down value to nearest power of 2 - we will help
> incorrectly configured systems to use proper amount of memory and keep
> performance of properly configured systems.


+1


>
> IMO doing modulo is not necessary. And hash function is pure waste of CPU
> cycles.


I agree

—
Dilip

>


Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Andres Freund
Hi,

On 2024-12-10 17:28:21 +1100, Michael Harris wrote:
> On Tue, 10 Dec 2024 at 11:31, Andres Freund  wrote:
> > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> > xfs_spaceman -c 'freesp -s' /mountpoint
> > xfs_spaceman -c 'health' /mountpoint
> > and df.
>
> I gathered this info from one of the systems that is currently on RL9.
> This system is relatively small compared to some of the others that
> have exhibited this issue, but it is the only one I can access right
> now.

I think it's implied, but I just want to be sure: This was one of the affected
systems?


> # uname -a
> Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
> 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512agcount=4, agsize=262471424 
> blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=0, sparse=0, rmapbt=0
>  =   reflink=0bigtime=0 inobtcount=0 nrext64=0
> data =   bsize=4096   blocks=1049885696, imaxpct=5
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> log  =internal log   bsize=4096   blocks=512639, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0

It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all
affect space allocation to some degree and more recently created filesystems
will have them to different values, which could explain why you but not that
many others hit this issue.

Any chance to get df output? I'm mainly curious about the number of used
inodes.

Could you show the mount options that end up being used?
   grep /var/opt /proc/mounts

I rather doubt it is, but it'd sure be interesting if inode32 were used.


I assume you have never set XFS options for the PG directory or files within
it?  Could you show
  xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc
?


> # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; 
> done
>from  to extents  blockspct
>   1   1   37502   37502   0.15
>   2   3   62647  148377   0.59
>   4   7   87793  465950   1.85
>   8  15  135529 1527172   6.08
>  16  31  184811 3937459  15.67
>  32  63  165979 7330339  29.16
>  64 127  101674 8705691  34.64
> 128 255   15123 2674030  10.64
> 256 511 973  307655   1.22
> total free extents 792031
> total free blocks 25134175
> average free extent size 31.7338
>from  to extents  blockspct
>   1   1   43895   43895   0.22
>   2   3   59312  141693   0.70
>   4   7   83406  443964   2.20
>   8  15  120804 1362108   6.75
>  16  31  133140 2824317  14.00
>  32  63  118619 5188474  25.71
>  64 127   77960 6751764  33.46
> 128 255   16383 2876626  14.26
> 256 5111763  546506   2.71
> total free extents 655282
> total free blocks 20179347
> average free extent size 30.7949
>from  to extents  blockspct
>   1   1   72034   72034   0.26
>   2   3   98158  232135   0.83
>   4   7  126228  666187   2.38
>   8  15  169602 1893007   6.77
>  16  31  180286 3818527  13.65
>  32  63  164529 7276833  26.01
>  64 127  109687 9505160  33.97
> 128 255   22113 3921162  14.02
> 256 5111901  592052   2.12
> total free extents 944538
> total free blocks 27977097
> average free extent size 29.6199
>from  to extents  blockspct
>   1   1   51462   51462   0.21
>   2   3   98993  233204   0.93
>   4   7  131578  697655   2.79
>   8  15  178151 1993062   7.97
>  16  31  175718 3680535  14.72
>  32  63  145310 6372468  25.48
>  64 127   89518 7749021  30.99
> 128 255   18926 3415768  13.66
> 256 5112640  813586   3.25
> total free extents 892296
> total free blocks 25006761
> average free extent size 28.0252

So there's *some*, but not a lot, of imbalance in AG usage. Of course that's
as of this moment, and as you say below, you expire old partitions on a
regular basis...

My understanding of XFS's space allocation is that by default it continues to
use the same AG for allocations within one directory, until that AG is full.
For a write heavy postgres workload that's of course not optimal, as all
activity will focus on one AG.

I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full.  'freesp' is probably too
expensive for that, but it looks like
   xfs_db -r -c agresv /dev/nvme6n1
should work?

Actually that output might be interesting to see, ev

Re: Assert failure on running a completed portal again

2024-12-10 Thread Tom Lane
Robert Haas  writes:
> Thanks for the research, and +1 if you feel like adding more commentary.

I'm thinking about something like this:

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13f3fcdaee..7ebb17fc2e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -348,7 +348,15 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rStartup(dest, operation, queryDesc->tupDesc);
 
/*
-* run plan
+* Run plan, unless direction is NoMovement.
+*
+* Note: pquery.c selects NoMovement if a prior call already reached
+* end-of-data in the user-specified fetch direction.  This is important
+* because various parts of the executor can misbehave if called again
+* after reporting EOF.  For example, heapam.c would actually restart a
+* heapscan and return all its data afresh.  There is also reason to be
+* concerned about whether parallel query mode would operate properly 
if a
+* fresh execution request arrives after completing the query.
 */
if (!ScanDirectionIsNoMovement(direction))
ExecutePlan(queryDesc,

It's slightly annoying that the executor is depending on the Portal
machinery to guarantee that it won't do something wrong.  However,
that was true in the previous formulation that relied on
Portal.run_once, too.   I don't see another way that wouldn't
amount to duplicating the Portal-level state in the executor,
and I don't think it's worth doing that.

regards, tom lane




Re: [PATCH] Add sortsupport for range types and btree_gist

2024-12-10 Thread Bernd Helmle
Am Montag, dem 09.12.2024 um 23:26 +0500 schrieb Andrey M. Borodin:
> Besides unnecessary indentation changes in
> contrib/btree_gist/Makefile, the patch seems good to me.

Oh yes, fixed.


-- 
Thanks,
Bernd

From 2d2c3b5940a8ec15b1c6bd6b829c8d44c0e615e0 Mon Sep 17 00:00:00 2001
From: Bernd Helmle 
Date: Tue, 10 Dec 2024 11:17:06 +0100
Subject: [PATCH] Add support for sorted gist index builds to btree_gist

This enables sortsupport in the btree_gist extension for faster
builds of gist indexes. Additionally sortsupport is also added for range types
to the backend so that gist indexes on range types also benefit from the
speed up.

Sorted gist index build strategy is the new default now. Regression tests are
unchanged and are using the sorted build strategy instead.
---
 contrib/btree_gist/Makefile |   2 +-
 contrib/btree_gist/btree_bit.c  |  46 +
 contrib/btree_gist/btree_bool.c |  22 +++
 contrib/btree_gist/btree_bytea.c|  35 
 contrib/btree_gist/btree_cash.c |  27 +++
 contrib/btree_gist/btree_date.c |  25 +++
 contrib/btree_gist/btree_enum.c |  33 
 contrib/btree_gist/btree_float4.c   |  28 +++
 contrib/btree_gist/btree_float8.c   |  27 +++
 contrib/btree_gist/btree_gist--1.8--1.9.sql | 192 
 contrib/btree_gist/btree_gist.control   |   2 +-
 contrib/btree_gist/btree_inet.c |  27 +++
 contrib/btree_gist/btree_int2.c |  26 +++
 contrib/btree_gist/btree_int4.c |  33 +++-
 contrib/btree_gist/btree_int8.c |  26 +++
 contrib/btree_gist/btree_interval.c |  25 +++
 contrib/btree_gist/btree_macaddr.c  |  29 +++
 contrib/btree_gist/btree_macaddr8.c |  24 +++
 contrib/btree_gist/btree_numeric.c  |  33 
 contrib/btree_gist/btree_oid.c  |  28 +++
 contrib/btree_gist/btree_text.c |  34 
 contrib/btree_gist/btree_time.c |  26 +++
 contrib/btree_gist/btree_ts.c   |  25 +++
 contrib/btree_gist/btree_utils_var.h|  12 +-
 contrib/btree_gist/btree_uuid.c |  25 +++
 contrib/btree_gist/meson.build  |   1 +
 doc/src/sgml/btree-gist.sgml|   7 +
 src/backend/utils/adt/rangetypes_gist.c |  70 +++
 src/include/catalog/pg_amproc.dat   |   3 +
 src/include/catalog/pg_proc.dat |   3 +
 30 files changed, 888 insertions(+), 8 deletions(-)
 create mode 100644 contrib/btree_gist/btree_gist--1.8--1.9.sql

diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 7ac2df26c10..68190ac5e46 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -34,7 +34,7 @@ DATA = btree_gist--1.0--1.1.sql \
btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \
btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \
btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \
-   btree_gist--1.7--1.8.sql
+   btree_gist--1.7--1.8.sql btree_gist--1.8--1.9.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/btree_gist/btree_bit.c b/contrib/btree_gist/btree_bit.c
index f346b956fa9..35aa5578f83 100644
--- a/contrib/btree_gist/btree_bit.c
+++ b/contrib/btree_gist/btree_bit.c
@@ -6,6 +6,7 @@
 #include "btree_gist.h"
 #include "btree_utils_var.h"
 #include "utils/fmgrprotos.h"
+#include "utils/sortsupport.h"
 #include "utils/varbit.h"
 
 
@@ -18,10 +19,33 @@ PG_FUNCTION_INFO_V1(gbt_bit_picksplit);
 PG_FUNCTION_INFO_V1(gbt_bit_consistent);
 PG_FUNCTION_INFO_V1(gbt_bit_penalty);
 PG_FUNCTION_INFO_V1(gbt_bit_same);
+PG_FUNCTION_INFO_V1(gbt_bit_sortsupport);
+PG_FUNCTION_INFO_V1(gbt_varbit_sortsupport);
 
 
 /* define for comparison */
 
+static int
+gbt_bit_ssup_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	GBT_VARKEY *key1 = PG_DETOAST_DATUM(x);
+	GBT_VARKEY *key2 = PG_DETOAST_DATUM(y);
+
+	GBT_VARKEY_R arg1 = gbt_var_key_readable(key1);
+	GBT_VARKEY_R arg2 = gbt_var_key_readable(key2);
+	Datum result;
+
+	/* lower is always equal to upper, so just compare those fields */
+	result = DirectFunctionCall2(byteacmp,
+ PointerGetDatum(arg1.lower),
+ PointerGetDatum(arg2.lower));
+
+	GBT_FREE_IF_COPY(key1, x);
+	GBT_FREE_IF_COPY(key2, y);
+
+	return DatumGetInt32(result);
+}
+
 static bool
 gbt_bitgt(const void *a, const void *b, Oid collation, FmgrInfo *flinfo)
 {
@@ -207,3 +231,25 @@ gbt_bit_penalty(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(gbt_var_penalty(result, o, n, PG_GET_COLLATION(),
 	  &tinfo, fcinfo->flinfo));
 }
+
+Datum
+gbt_bit_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = gbt_bit_ssup_cmp;
+	ssup->ssup_extra = NULL;
+
+	PG_RETURN_VOID();
+}
+
+Datum
+gbt_varbit_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSuppo

Re: Extensible storage manager API - SMGR hook Redux

2024-12-10 Thread Xun Gong
Thank you for your detailed review and insights. I share your view that a
registration-based approach for custom storage managers (smgr) is more
versatile. This method allows for the implementation of custom table access
methods, facilitating the use of various storage services (such as file
services or object storage), different file organization formats (files in
one directory  or many sub directories), and flexible deletion logic
(direct deletion or mark-and-sweep).

While I acknowledge that the registration-based approach requires more
modifications, I believe the benefits in terms of extensibility and
functionality are significant. I have seen similar explorations in the
implementation of AO tables in Greenplum, where a dedicated smgr was
created due to its distinct file organization compared to heap tables.

I look forward to further discussions on this topic

Nitin Jadhav  于2024年12月10日周二 11:25写道:

> Hi,
>
> I reviewed the discussion and took a look at the patch sets. It seems
> like many things are combined here. Based on the subject, I initially
> thought it aimed to provide the infrastructure to easily extend
> storage managers. This would allow anyone to create their own storage
> managers using this infrastructure. While it addresses this, it also
> includes additional features like fsync_checker, which I believe
> should be a separate feature. Even though it might use the same
> infrastructure, it appears to be a different functionality. I think we
> should focus solely on providing the infrastructure here.
>
> We need to decide on our approach—whether to use a hook-based method
> or a registration-based method—and I believe this requires further
> discussion.
>
> The hook-based approach is simple and works well for anyone writing
> their own storage manager. However, it has its limitations as we can
> either use the default storage manager or a custom-built one for all
> the work load, but we cannot choose between multiple storage managers.
> On the other hand, the registration-based approach allows choosing
> between multiple storage managers based on the workload, though it
> requires a lot of changes.
>
> Are we planning to support other storage managers in PostgreSQL in the
> near future? If not, it is better to go with the hook-based approach.
> Otherwise, the registration-based approach is preferable as it offers
> more flexibility to users and enhances PostgreSQL’s functionality.
>
> Could you please share your thoughts on this? Also, let me know if
> this topic has already been discussed and if any conclusions were
> reached.
>
> Best Regards,
> Nitin Jadhav
> Azure Database for PostgreSQL
> Microsoft
>
> On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin  wrote:
> >
> > Thought I would show off what is possible with this patchset.
> >
> > Heikki, a couple of months ago in our internal Slack, said:
> >
> > > [I would like] a debugging tool that checks that we're not missing any
> > > fsyncs. I bumped into a few missing fsync bugs with unlogged tables
> > > lately and a tool like that would've been very helpful.
> >
> > My task was to create such a tool, and off I went. I started with the
> > storage manager extension patch that Matthias sent to the list last
> > year[0].
> >
> > Andres, in another thread[1], said:
> >
> > > I've been thinking that we need a validation layer for fsyncs, it's
> too hard
> > > to get right without testing, and crash testing is not likel enough to
> catch
> > > problems quickly / resource intensive.
> > >
> > > My thought was that we could keep a shared hash table of all files
> created /
> > > dirtied at the fd layer, with the filename as key and the value the
> current
> > > LSN. We'd delete files from it when they're fsynced. At checkpoints we
> go
> > > through the list and see if there's any files from before the redo
> that aren't
> > > yet fsynced.  All of this in assert builds only, of course.
> >
> > I took this idea and ran with it. I call it the fsync_checker™️. It is an
> > extension that prints relations that haven't been fsynced prior to
> > a CHECKPOINT. Note that this idea doesn't work in practice because
> > relations might not be fsynced, but they might be WAL-logged, like in
> > the case of createdb. See log_smgrcreate(). I can't think of an easy way
> > to solve this problem looking at the codebase as it stands.
> >
> > Here is a description of the patches:
> >
> > 0001:
> >
> > This is essentially just the patch that Matthias posted earlier, but
> > rebased and adjusted a little bit so storage managers can "inherit" from
> > other storage managers.
> >
> > 0002:
> >
> > This is an extension of 0001, which allows for extensions to set
> > a global storage manager. This is pretty hacky, and if it was going to
> > be pulled into mainline, it would need some better protection. For
> > instance, only one extension should be able to set the global storage
> > manager. We wouldn't want extensions stepping over each other, etc.
> >
> > 0003:
> >
>

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-10 Thread Shubham Khanna
On Mon, Dec 9, 2024 at 7:43 PM vignesh C  wrote:
>
> On Mon, 9 Dec 2024 at 16:25, Shubham Khanna  
> wrote:
> >
> > Hi all,
> >
> > I am writing to propose the addition of the two_phase option in
> > pg_createsubscriber. As discussed in [1], supporting this feature
> > during the development of pg_createsubscriber was planned for this
> > version.
>
> Yes this will be useful.
>
> > Currently, pg_createsubscriber creates subscriptions with the
> > two_phase option set to false. Enabling the two_phase option after a
> > subscription has been created is not straightforward, as it requires
> > the subscription to be disabled first. This patch aims to address this
> > limitation by incorporating the two_phase option into
> > pg_createsubscriber which will help create subscription with two_phase
> > option and make use of the advantages of creating subscription with
> > two_phase option.
> > The attached patch has the changes for the same.
>
> Few comments:
> 1) You can keep it with no argument similar to how dry-run is handled:
> @@ -1872,6 +1881,7 @@ main(int argc, char **argv)
> {"publisher-server", required_argument, NULL, 'P'},
> {"socketdir", required_argument, NULL, 's'},
> {"recovery-timeout", required_argument, NULL, 't'},
> +   {"two_phase", optional_argument, NULL, 'T'},
> {"subscriber-username", required_argument, NULL, 'U'},
> {"verbose", no_argument, NULL, 'v'},
> {"version", no_argument, NULL, 'V'},
>

Changed the argument to 'no_argument'.

> 2) After making it to no argument option, if this option is set, just
> set the value, strcmp's are not required:
> +   case 'T':
> +   if (optarg != NULL)
> +   {
> +   if (strcmp(optarg, "on") == 0)
> +   opt.two_phase = true;
> +   else if (strcmp(optarg, "off") == 0)
> +   opt.two_phase = false;
> +   else
> +   pg_fatal("invalid
> value for --two-phase: must be 'on' or 'off'");
> +   }
> +   else
> +   opt.two_phase = true;   /*
> Default to true if no argument
> +
>   * is provided */
> +   break;
>

Updated the switch case according to the modified argument.

> 3) You can add a link to create subscription documentation page of
> two_phase option here:
> +   Enables or disables two-phase commit for the subscription.
> +   When the option is provided without a value, it defaults to
> +   on. Specify on to enable or
> +   off to disable.
> +   Two-phase commit ensures atomicity in logical replication for prepared
> +   transactions. By default, this option is enabled unless explicitly set
> +   to off.
>

Added the link to create subscription in the documentation of
pg_createsubscriber.

> 4) Instead of adding new tests, can we include the prepare transaction
> and prepare transaction verification to the existing tests itself?
> +# Set up node A as primary
> +my $node_a = PostgreSQL::Test::Cluster->new('node_a');
> +my $aconnstr = $node_a->connstr;
> +$node_a->init(allows_streaming => 'logical');
> +$node_a->append_conf(
> +   'postgresql.conf', qq[
> +   autovacuum = off
> +   wal_level = logical
> +   max_wal_senders = 10
> +   max_worker_processes = 8
> +   max_prepared_transactions = 10
> +]);
> +
> +$node_a->start;
>

Removed the new test case and added it to the existing test cases.

The attached Patch contains the required changes.

Thanks and regards,
Shubham Khanna.


v2-0001-Add-support-for-enabling-disabling-two-phase-comm.patch
Description: Binary data


Re: Refactoring postmaster's code to cleanup after child exit

2024-12-10 Thread Tomas Vondra



On 12/10/24 11:00, Heikki Linnakangas wrote:
> On 09/12/2024 22:55, Heikki Linnakangas wrote:
>> Not sure how to fix this. A small sleep in the test would work, but in
>> principle there's no delay that's guaranteed to be enough. A more
>> robust solution would be to run a "select count(*) from
>> pg_stat_activity" and wait until the number of connections are what's
>> expected. I'll try that and see how complicated that gets..
> 
> Checking pg_stat_activity doesn't help, because the backend doesn't
> register itself in pg_stat_activity until later. A connection that's
> rejected due to connection limits never shows up in pg_stat_activity.
> 
> Some options:
> 
> 0. Do nothing
> 
> 1. Add a small sleep to the test
> 

I'd just add a short sleep. Yeah, sleeps are not great, but everything
else seems like a lot of effort just to make this one test pass under
valgrind, and I don't think it's worth it.

Can we make the sleep conditional on valgrind, so that regular builds
are not affected? I guess regular builds could fail too, but I don't
think we've seen such failures until now.


regards

-- 
Tomas Vondra





Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-10 Thread Shubham Khanna
On Tue, Dec 10, 2024 at 7:42 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> Thanks for the proposal!
>
> > I am writing to propose the addition of the two_phase option in
> > pg_createsubscriber. As discussed in [1], supporting this feature
> > during the development of pg_createsubscriber was planned for this
> > version.
>
> Yes, that was the pending item.
>
> > The attached patch has the changes for the same.
>
> There are API related comments.
>
> 01.
> I think the two-phase should be off by default. Because default value for 
> CREATE
> SUBSCRIPTION command is off, and your patch breaks pre-existing style.
>
> 02.
> API style should be changed: no need to require the argument. I think it is 
> enough
> to provide "enable-twophase" and "T" option which specify the two_phase to 
> true.
>

I have fixed the given comments. The v2 version patch attached at [1]
has the changes for the same.
[1] - 
https://www.postgresql.org/message-id/CAHv8RjLcdmz%3D_RMwveuDdr8i7r%3D09TAwtOnFmXeaia_v2RmnYA%40mail.gmail.com

Thanks and Regards,
Shubham Khanna.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-12-10 Thread Nisha Moond
On Fri, Dec 6, 2024 at 11:04 AM vignesh C  wrote:
>
>
> Determining the correct time may be challenging for users, as it
> depends on when the active_since value is set, as well as when the
> checkpoint_timeout occurs and the subsequent checkpoint is triggered.
> Even if the user sets it to an appropriate value, there is still a
> possibility of delayed identification due to the timing of when the
> slot's active_timeout is being set. Including this information in the
> documentation should be sufficient.
>

+1
v54 documents this information as suggested.

Attached the v54 patch-set addressing all the comments till now in
[1], [2] and [3].

[1] 
https://www.postgresql.org/message-id/CALDaNm0mTWwg0z4v-sorq08S2CdZmL2s%2Brh4nHpWeJaBQ2F%2Bmg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CALDaNm1STyk%3DS_EAihWP9SowBkS5dJ32JfEqmG5tTeC2Ct39yg%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CAHut%2BPtHbYNxPvtMfs7jARbsVcFXL1%3DC9SO3Q93NgVDgbKN7LQ%40mail.gmail.com

--
Thanks,
Nisha
From 713871d8cda02f2b70c63983fc49dede3097f016 Mon Sep 17 00:00:00 2001
From: Nisha Moond 
Date: Mon, 18 Nov 2024 16:13:26 +0530
Subject: [PATCH v54 1/2] Enhance replication slot error handling, slot
 invalidation, and inactive_since setting logic

In ReplicationSlotAcquire(), raise an error for invalid slots if the
caller specifies error_if_invalid=true.

Add check if the slot is already acquired, then mark it invalidated directly.

Ensure same inactive_since time for all slots in update_synced_slots_inactive_since()
and RestoreSlotFromDisk().
---
 .../replication/logical/logicalfuncs.c|  2 +-
 src/backend/replication/logical/slotsync.c| 13 ++--
 src/backend/replication/slot.c| 61 ---
 src/backend/replication/slotfuncs.c   |  2 +-
 src/backend/replication/walsender.c   |  4 +-
 src/backend/utils/adt/pg_upgrade_support.c|  2 +-
 src/include/replication/slot.h|  3 +-
 src/test/recovery/t/019_replslot_limit.pl |  2 +-
 8 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b4dd5cce75..56fc1a45a9 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -197,7 +197,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	else
 		end_of_wal = GetXLogReplayRecPtr(NULL);
 
-	ReplicationSlotAcquire(NameStr(*name), true);
+	ReplicationSlotAcquire(NameStr(*name), true, true);
 
 	PG_TRY();
 	{
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index f4f80b2312..e3645aea53 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -446,7 +446,7 @@ drop_local_obsolete_slots(List *remote_slot_list)
 
 			if (synced_slot)
 			{
-ReplicationSlotAcquire(NameStr(local_slot->data.name), true);
+ReplicationSlotAcquire(NameStr(local_slot->data.name), true, false);
 ReplicationSlotDropAcquired();
 			}
 
@@ -665,7 +665,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		 * pre-check to ensure that at least one of the slot properties is
 		 * changed before acquiring the slot.
 		 */
-		ReplicationSlotAcquire(remote_slot->name, true);
+		ReplicationSlotAcquire(remote_slot->name, true, false);
 
 		Assert(slot == MyReplicationSlot);
 
@@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
 static void
 update_synced_slots_inactive_since(void)
 {
-	TimestampTz now = 0;
+	TimestampTz now;
 
 	/*
 	 * We need to update inactive_since only when we are promoting standby to
@@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void)
 	/* The slot sync worker or SQL function mustn't be running by now */
 	Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
 
+	/* Use same inactive_since time for all slots */
+	now = GetCurrentTimestamp();
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (int i = 0; i < max_replication_slots; i++)
@@ -1537,10 +1540,6 @@ update_synced_slots_inactive_since(void)
 			/* The slot must not be acquired by any process */
 			Assert(s->active_pid == 0);
 
-			/* Use the same inactive_since time for all the slots. */
-			if (now == 0)
-now = GetCurrentTimestamp();
-
 			SpinLockAcquire(&s->mutex);
 			s->inactive_since = now;
 			SpinLockRelease(&s->mutex);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 4a206f9527..db94cec5c3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -535,9 +535,12 @@ ReplicationSlotName(int index, Name name)
  *
  * An error is raised if nowait is true and the slot is currently in use. If
  * nowait is false, we sleep until the slot is released by the owning process.
+ *
+ * An error is raised if error_if_invalid is true and the slot is found to

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-10 Thread Shubham Khanna
On Tue, Dec 10, 2024 at 8:05 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> Here are some review comments for the patch v1-0001.
>
> Note -- I think Kuroda-san's idea to use a new switch like
> '--enable-two-phase' would eliminate lots of my review comments below
> about the inconsistencies with CREATE SUBSCRIBER, and the current
> confusion about defaults and argument values etc.
>
> ==
> Commit message
>
> 1.
> By default, two-phase commit is enabled if the option is provided without
> arguments. Users can explicitly set it to 'on' or 'off' using '--two-phase=on'
> or '--two-phase=off'.
>
> ~
>
> But you don't say what is the default if the option is omitted entirely.
>
>
> ~~~
>
> 2.
> Notably, the replication for prepared transactions functions regardless of the
> initial two-phase setting on the replication slot. However, the user cannot
> change the setting after the subscription is created unless a future command,
> such as 'ALTER SUBSCRIPTION ... SET (two-phase = on)', is supported.
> This provides flexibility for future enhancements.
>
> ~
>
> Typo in ALTER example with the option name /two-phase/two_phase/
>
> ~~~
>
> 3.
> Documentation has been updated to reflect the new option, and test cases have
> been added to validate various scenarios, including proper validation of the
> '--two-phase' flag and its combinations with other options.
>
> /flag/option/ ??
>

Modified the commit message according to the suggested changes.

> ==
> doc/src/sgml/ref/pg_createsubscriber.sgml
>
> 4.
> +
> + -T
> + --two_phase
> + 
> +  
> +   Enables or disables two-phase commit for the subscription.
> +   When the option is provided without a value, it defaults to
> +   on. Specify on to enable or
> +   off to disable.
> +   Two-phase commit ensures atomicity in logical replication for prepared
> +   transactions. By default, this option is enabled unless explicitly set
> +   to off.
> +  
> + 
> +
> +
>
> 4a.
> Typo -- the option you made is 'two-phase', not 'two_phase'
>
> ~

Fixed.

> 4b.
> When you say "By default, this option is enabled unless explicitly set
> to off." I take that as meaning it is default *enabled* even when the
> option is entirely omitted.
>
> But, that would be contrary to the behaviour of a normal CREATE
> SUBSCRIPTION 'two_phase' option, so I wonder why should
> pg_createsubscriber have a different default. Is this intentional? IMO
> if no switch is specified then I thought it should be the same as the
> CREATE SUBSCRIPTION default (i.e. false).
>
> ~

Modified the documentation on the basis of the latest changes added to
the patch.

> 4c.
> This "-T" option should be moved to be adjacent (below) "-t" to keep
> everything consistently in A-Z order.
>

Fixed.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> 5.
>   bool made_replslot; /* replication slot was created */
>   bool made_publication; /* publication was created */
> + bool two_phase; /* two-phase option was created */
>
> I am not sure what "option was created" even means. Cut/paste error?
>
> Also, perhaps this field belongs with the 1st group of fields in this
> struct, instead of with those made_xxx fields.
>
> ~~~
>
> store_pub_sub_info:
>
> 6.
> + /* Store two-phase option */
> + dbinfo[i].two_phase = opt->two_phase;
> +
>
> The comment says the same as what the code is doing which seems
> unhelpful/redundant. And if you rearrange the struct fields as
> previously suggested, this assignment should also move.
>

Fixed.

>
> 7.
>   pg_log_debug("subscriber(%d): subscription: %s ; connection string: %s", i,
>   dbinfo[i].subname ? dbinfo[i].subname : "(auto)",
>   dbinfo[i].subconninfo);
> + pg_log_debug("publisher(%d): two-phase: %s", i,
> + dbinfo[i].two_phase ? "true" : "false");
>
> "two_phase" is a subscription option. So shouldn't this added
> pg_log_debug info be part of the preceding pg_log_debug about the
> subscription?
>

Fixed.

>
> main:
>
> 8.
>   {"recovery-timeout", required_argument, NULL, 't'},
> + {"two_phase", optional_argument, NULL, 'T'},
>   {"subscriber-username", required_argument, NULL, 'U'},
>
> AFAIK this option was supposed to be 'two-phase' (dash instead of
> underscore  is consistent with the other pg_createsubscriber options)
>

Fixed.

>
> 9.
>   opt.sub_username = NULL;
> + opt.two_phase = true;
>
> As previously mentioned this default is not the same as the CREATE
> SUBSCRIPTION default for 'two_phase', and I find that inconsistency to
> be confusing.
>

Changed the default option to false as suggested.

>
> 10.
> + case 'T':
> + if (optarg != NULL)
> + {
> + if (strcmp(optarg, "on") == 0)
> + opt.two_phase = true;
> + else if (strcmp(optarg, "off") == 0)
> + opt.two_phase = false;
> + else
> + pg_fatal("invalid value for --two-phase: must be 'on' or 'off'");
> + }
> + else
> + opt.two_phase = true; /* Default to true if no argument
> + * is provided */
> + break;
>
> I wonder if users familiar with the CREATE SUBSCRIPT

Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Jakub Wartak
 On Tue, Dec 10, 2024 at 7:34 AM Michael Harris  wrote:

Hi Michael,

1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).

2.

> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512agcount=4,
agsize=262471424 blks
> =   sectsz=512   attr=2, projid32bit=1
> =   crc=1finobt=0, sparse=0, rmapbt=0
> =   reflink=0bigtime=0 inobtcount=0
nrext64=0

Yay, reflink=0, that's pretty old fs ?!

> ERROR:  could not extend file
"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device

2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)

stat pg_tblspc/16401/PG_16_202307071/17643/
ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess
timing of getdents() call as that may something about that directory
indirectly

3. Maybe somehow there is a bigger interaction between posix_fallocate()
and delayed XFS's dynamic speculative preallocation from many processes all
writing into different partitions ? Maybe try "allocsize=1m" mount option
for that /fs and see if that helps.  I'm going to speculate about XFS
speculative :) pre allocations, but if we have fdcache and are *not*
closing fds, how XFS might know to abort its own speculation about
streaming write ? (multiply that up to potentially the number of opened fds
to get an avalanche of "preallocations").

4. You can also try compiling with patch from Alvaro from [2]
"0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
having more clarity in offsets involved. If not then you could use 'strace
-e fallocate -p ' to get the exact syscall.

5. Another idea could be catching the kernel side stacktrace of fallocate()
when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF
tracer I could get the source of the problem in my artificial reproducer,
e.g

# bpftrace ./track_enospc2.bt # wait for "START" and then start reproducing
on the sess2, but try to minimize the time period, that eBPF might things
really slow

$ dd if=/dev/zero of=/fs/test1 bs=1M count=200
$ fallocate /fs/test -l 3000
fallocate: fallocate failed: No space left on device
$ df -h /fs
Filesystem  Size  Used Avail Use% Mounted on
/dev/loop0  236M  217M   20M  92% /fs

# in bpftrace CTRL+C, will get:
@errors[-28, kretprobe:xfs_file_fallocate,
xfs_alloc_file_space+665
xfs_alloc_file_space+665
xfs_file_fallocate+869
vfs_fallocate+319
__x64_sys_fallocate+68
do_syscall_64+130
entry_SYSCALL_64_after_hwframe+118
]: 1

-28 = ENOSPC, xfs_alloc_file_space() was the routine that was root-cause
and shows the full logic behind it. That ABI might be different on Your
side due to kernel variations. It could be enhanced, and it might print too
much (so you need to look for that -28 in the outputs). Possibly if you get
any sensible output from it, you could also involve OS support (because if
posix_fallocate() fails and there's space , then it's pretty odd anyway).

-J.

[1] -
https://www.postgresql.org/message-id/50a117b6.5030...@optionshouse.com
[2] -
https://www.postgresql.org/message-id/202409110955.6njbwzm4ocus%40alvherre.pgsql


track_enospc2.bt
Description: Binary data


Re: WARNING: missing lock on database "postgres" (OID 5) @ TID (0,4)

2024-12-10 Thread Kirill Reshke
On Tue, 10 Dec 2024 at 10:47, Kirill Reshke  wrote:
>
> On Tue, 10 Dec 2024 at 10:45, Kirill Reshke  wrote:
>
> > PFA v2.
> Also CF entry https://commitfest.postgresql.org/51/5430/ to get CI feedback.

CI fails due to bad naming in the regression test.
The change is deptestdb1 -> regressdeptestdb1
PFA v3.



-- 
Best regards,
Kirill Reshke


v3-0001-When-making-dependency-changes-lock-the-tuple-for.patch
Description: Binary data


Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-10 Thread Amit Kapila
On Tue, Dec 10, 2024 at 11:24 AM Amit Kapila  wrote:
>
> On Tue, Dec 10, 2024 at 8:54 AM vignesh C  wrote:
> >
> > On Tue, 10 Dec 2024 at 04:56, Michael Paquier  wrote:
> > >
> > > On Mon, Dec 09, 2024 at 03:36:15PM +0530, Amit Kapila wrote:
> > > > It couldn't solve the problem completely even in back-branches. The
> > > > SQL API case I mentioned and tested by Hou-San in the email [1] won't
> > > > be solved.
> > > >
> > > > [1] - 
> > > > https://www.postgresql.org/message-id/OS0PR01MB57166A4DA0ABBB94F2FBB28694362%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> > >
> > > Yeah, exactly (wanted to reply exactly that yesterday but lacked time,
> > > thanks!).
> >
> > Yes, that makes sense. How about something like the attached patch.
> >
>
> - oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> - if (data->publications)
> - {
> - list_free_deep(data->publications);
> - data->publications = NIL;
> - }
> + static MemoryContext pubctx = NULL;
> +
> + if (pubctx == NULL)
> + pubctx = AllocSetContextCreate(CacheMemoryContext,
> +"logical replication publication list context",
> +ALLOCSET_SMALL_SIZES);
> + else
> + MemoryContextReset(pubctx);
> +
> + oldctx = MemoryContextSwitchTo(pubctx);
>
> Considering the SQL API case, why is it okay to allocate this context
> under CacheMemoryContext?
>

On further thinking, we can't allocate it under
LogicalDecodingContext->context because once that is freed at the end
of SQL API pg_logical_slot_get_changes(), pubctx will be pointing to a
dangling memory. One idea is that we use
MemoryContextRegisterResetCallback() to invoke a reset callback
function where we can reset pubctx but not sure if we want to go there
in back branches. OTOH, the currently proposed fix won't leak memory
on repeated calls to pg_logical_slot_get_changes(), so that might be
okay as well.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: Pass ParseState as down to utility functions.

2024-12-10 Thread Kirill Reshke
On Tue, 10 Dec 2024 at 08:28, Michael Paquier  wrote:

> I would suggest to split the patch into two pieces for clarity, based
> on the fact that your v7 patch is doing more than one thing at the
> same time:
> - Introduce new tests for the new coverage (domain, CREATE TABLE OF,
> ALTER TABLE flavors) in a first patch.
> - Introduce the ParseStates in these new code paths in a second patch.
>
> By structuring things this way, it is possible to see what kind of
> difference related to the new ParseStates is introduced, based on the
> new test coverage introduced in the first patch.
>
> This makes also the whole review easier.

Ok. Sure.

> Typo here: s/exists/exist/.

Fixed, Thank you


PFA v8.

-- 
Best regards,
Kirill Reshke


v8-0001-Add-more-regression-tests-to-various-DDL-patterns.patch
Description: Binary data


v8-0002-Print-out-error-position-for-number-of-DDL-comman.patch
Description: Binary data


Re: bt_index_parent_check and concurrently build indexes

2024-12-10 Thread Michail Nikolaev
Hello, everyone and Peter.

I simplified the patch to reproduce the issue more effectively. Now,
bt_index_parent_check fails on a valid index containing just two tuples.

Peter, I included you in this message because you are the author of
bt_index_parent_check, so I thought you might find this relevant.

Best regards,
Mikhail.

>
From 6de9549360b7d92b0184c50016aec1c41534e127 Mon Sep 17 00:00:00 2001
From: nkey 
Date: Sun, 8 Dec 2024 17:32:15 +0100
Subject: [PATCH v2] test to reproduce issue with bt_index_parent_check and
 CREATE INDEX CONCURRENTLY

---
 contrib/amcheck/meson.build   |  1 +
 .../t/006_cic_bt_index_parent_check.pl| 51 +++
 2 files changed, 52 insertions(+)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index fc08e32539..2eb7ff11bd 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -45,6 +45,7 @@ tests += {
   't/003_cic_2pc.pl',
   't/004_verify_nbtree_unique.pl',
   't/005_pitr.pl',
+  't/006_cic_bt_index_parent_check.pl',
 ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00..1d8bb84f22
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,51 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(
+	q(
+BEGIN;
+SELECT pg_current_xact_id();
+));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+# it fails, because it is expect to find the deleted row in index
+is($result, '0', 'bt_index_parent_check after removed rows');
+
+$in_progress_h->quit;
+done_testing();
-- 
2.43.0



Re: bt_index_parent_check and concurrently build indexes

2024-12-10 Thread Andrey M. Borodin



> On 9 Dec 2024, at 23:51, Michail Nikolaev  wrote:
> 
> * Modify bt_index_parent_check to use an MVCC snapshot for the heap scan

Hi!

Interesting bug. It's amazing how long it stand, giving that it would be 
triggered by almost any check after updating a table.

From my POV correct fix direction is to use approach similar to index building.
E.i. remove "if (!state->readonly)" check. Are there any known downsides of 
this?


Best regards, Andrey Borodin.



Re: Fix bank selection logic in SLRU

2024-12-10 Thread Andrey M. Borodin



> On 10 Dec 2024, at 15:39, Yura Sokolov  wrote:
> 
> It is not critical bug, since it doesn't hurt correctness just performance. 
> In worst case only one bank will be used.

Ugh... yeah. IMO the problem is that we do not have protection that rejects 
values that are not power of 2.
If other values given system operates as if there are 2^(popcount(n)-1) banks. 
So if we just round down value to nearest power of 2 - we will help incorrectly 
configured systems to use proper amount of memory and keep performance of 
properly configured systems.

IMO doing modulo is not necessary. And hash function is pure waste of CPU 
cycles.


Best regards, Andrey Borodin.



Re: Unmark gen_random_uuid() function leakproof

2024-12-10 Thread Masahiko Sawada
On Mon, Dec 9, 2024 at 2:48 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 9, 2024 at 2:23 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2024-12-09 14:10:30 -0800, Masahiko Sawada wrote:
> > > While reviewing UUIDv7 patch[1], I realized gen_random_uuid() is
> > > marked leakproof even though it doesn't take arguments. The functions
> > > without arguments don't need to be marked leakproof in principle. This
> > > is the sole function that has no arguments and is listed in the "List
> > > of built-in leakproof functions" in opr_sanity.sql. I've attached the
> > > patch for fixing it and for better consistency with new UUID
> > > generation functions discussed on that thread.
> >
> > Seems like it'd make sense to add a test to opr_sanity.sql so we don't
> > reintroduce such cases?
> >
>
> Thank you for the comment. It's a good idea. I've updated the patch.
>

I'm going to push the updated patch tomorrow, barring objections and
further comments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-10 Thread Peter Smith
Hi Shubham,

Here are some review comments for patch v2-0001.

==
GENERAL - the new option name

1.
I am not sure if this new switch needed to be called
'--enable-two-phase'; probably just calling it '--two-phase' would
mean the same because specifying the switch already implies "enabling"
it to me.

Perhaps either name is fine. What do others think?

==
Commit message

2.
This patch introduces the '--enable-two-phase' option to the
'pg_createsubscriber' utility, allowing users to enable or disable two-phase
commit for subscriptions during their creation.

~

It seems a bit strange IMO to say "enable or disable", because this is
only for "enable", and the default is disable as described in the
following sentence.

~~~

3.
By default, two-phase commit is disabled if the option is provided without
arguments.

~

But, this option now has no arguments so the part "if the option is
provided without arguments" is not applicable anymore and should be
removed. Or, if you want to say something you could say "if the option
is not provided."

==
doc/src/sgml/ref/pg_createsubscriber.sgml

4.
+
+ -T
+ --enable-two-phase
+ 
+  
+   Enables two-phase commit for the subscription. When the option is
+   provided, it is explicitly enabled. By default, two-phase commit is
+   off.
+   Two-phase commit ensures atomicity in logical replication for prepared
+   transactions. See the
+   two_phase
+   documentation for more details.
+  
+ 
+
+

I felt this was more verbose than necessary. IMO you only needed to
say something very simple; the user can chase the link to learn more
about two_phase if they want to.

SUGGESTION
Enables two_phase
commit for the subscription. The default is false.

==
src/bin/pg_basebackup/pg_createsubscriber.c

usage:

5.
  printf(_("  -t, --recovery-timeout=SECS seconds to wait for
recovery to end\n"));
+ printf(_("  -T, --enable-two-phase enable two-phase commit for the
subscription\n"));
  printf(_("  -U, --subscriber-username=NAME  user name for subscriber
connection\n"));

AFAICT the patch is wrongly using tabs here instead of spaces.

~~~

store_pub_sub_info:

6.
+ dbinfo[i].two_phase = opt->two_phase; /* Set two-phase option */

I still think this comment only states the obvious so it is not
helpful. Can remove it.

~~~

7.
  dbinfo[i].made_publication = false;
+
  /* Fill subscriber attributes */
This whitespace change is unrelated to this patch.

~~~

8.
- pg_log_debug("subscriber(%d): subscription: %s ; connection string: %s", i,
+ pg_log_debug("subscriber(%d): subscription: %s ; connection string:
%s, --enable-two-phase: %s", i,
  dbinfo[i].subname ? dbinfo[i].subname : "(auto)",
- dbinfo[i].subconninfo);
+ dbinfo[i].subconninfo,
+ dbinfo[i].two_phase ? "true" : "false");

IMO say "two_phase" here, not "--enable-two-phase".

==
.../t/040_pg_createsubscriber.pl

9.
 max_worker_processes = 8
+max_prepared_transactions = 10
 });

AFAICT the comment for this test code said:

# Restore default settings here but only apply it after testing standby. Some
# standby settings should not be a lower setting than on the primary.

But max_prepared_transactions = 10 is not the default setting for this GUC.

~~~

10.
 is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'),
  't', 'standby is in recovery');
+
 $node_s->stop;

This whitespace change has nothing to do with the patch.

~~~

11.
- 'replslot1'
+ 'replslot1', '--enable-two-phase'

The comment for this test only says
# pg_createsubscriber can run without --databases option

Now it is doing more, so maybe it should give more details like "In
passing, also test the --enable-two-phase option."

~~~

12.
+# Verify that the prepared transaction is replicated to the subscriber
+my $count_prepared_s =
+  $node_s->safe_psql($db1, "SELECT count(*) FROM pg_prepared_xacts;");
+
+is($count_prepared_s, qq(0), 'Prepared transaction replicated to subscriber');
+

Is this test OK? It says to verify it is replicated. How does checking
subscriber has zero pg_prepared_xacts ensure replication occurred?

==
Please see the attached NITPICKS diffs which includes some (not all)
of the above suggestions.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml 
b/doc/src/sgml/ref/pg_createsubscriber.sgml
index 0fcd30d..d2bbef3 100644
--- a/doc/src/sgml/ref/pg_createsubscriber.sgml
+++ b/doc/src/sgml/ref/pg_createsubscriber.sgml
@@ -166,13 +166,8 @@ PostgreSQL documentation
  --enable-two-phase
  
   
-   Enables two-phase commit for the subscription. When the option is
-   provided, it is explicitly enabled. By default, two-phase commit is
-   off.
-   Two-phase commit ensures atomicity in logical replication for prepared
-   transactions. See the
-   two_phase
-   documentation for more details.
+   Enables two_phase
+   commit for the subscription. The default is fa

Re: FileFallocate misbehaving on XFS

2024-12-10 Thread Michael Harris
Hi Andres

On Wed, 11 Dec 2024 at 03:09, Andres Freund  wrote:
> I think it's implied, but I just want to be sure: This was one of the affected
> systems?

Yes, correct.

> Any chance to get df output? I'm mainly curious about the number of used
> inodes.

Sorry, I could swear I had included that already! Here it is:

# df /var/opt
Filesystem   1K-blocks   Used Available Use% Mounted on
/dev/mapper/ippvg-ipplv 4197492228 3803866716 393625512  91% /var/opt

# df -i /var/opt
Filesystem Inodes   IUsed IFree IUse% Mounted on
/dev/mapper/ippvg-ipplv 419954240 1568137 4183861031% /var/opt

> Could you show the mount options that end up being used?
>grep /var/opt /proc/mounts

/dev/mapper/ippvg-ipplv /var/opt xfs
rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

These seem to be the defaults.

> I assume you have never set XFS options for the PG directory or files within
> it?

Correct.

>  Could you show
>   xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc

-p--X pg_tblspc/16402/PG_16_202307071/49163/1132925906.4
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
fd.flags = non-sync,non-direct,read-only
stat.ino = 4320612794
stat.type = regular file
stat.size = 201211904
stat.blocks = 393000
fsxattr.xflags = 0x8002 [-p--X]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.cowextsize = 0
fsxattr.nextents = 165
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
statfs.f_bsize = 4096
statfs.f_blocks = 1049373057
statfs.f_bavail = 98406378
statfs.f_files = 419954240
statfs.f_ffree = 418386103
statfs.f_flags = 0x1020
geom.bsize = 4096
geom.agcount = 4
geom.agblocks = 262471424
geom.datablocks = 1049885696
geom.rtblocks = 0
geom.rtextents = 0
geom.rtextsize = 1
geom.sunit = 0
geom.swidth = 0
counts.freedata = 98406378
counts.freertx = 0
counts.freeino = 864183
counts.allocino = 2432320

> I'd try monitoring the per-ag free space over time and see if the the ENOSPC
> issue is correlated with one AG getting full.  'freesp' is probably too
> expensive for that, but it looks like
>xfs_db -r -c agresv /dev/nvme6n1
> should work?
>
> Actually that output might be interesting to see, even when you don't hit the
> issue.

I will see if I can set that up.

> How many partitions are there for each of the tables? Mainly wondering because
> of the number of inodes being used.

It is configurable and varies from site to site. It could range from 7
up to maybe 60.

> Are all of the active tables within one database? That could be relevant due
> to per-directory behaviour of free space allocation.

Each pg instance may have one or more application databases. Typically
data is being written into all of them (although sometimes a database
will be archived, with no new data going into it).

You might be onto something though. The system I got the above prints
from is only experiencing this issue in one directory - that might not
mean very much though, it only has 2 databases and one of them looks
like it is not receiving imports.
But another system I can access has multiple databases with ongoing
imports, yet all the errors bar one relate to one directory.
I will collect some data from that system and post it shortly.

Cheers
Mike




Re: Skip collecting decoded changes of already-aborted transactions

2024-12-10 Thread Masahiko Sawada
On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar  wrote:
>
> On Tue, Dec 10, 2024 at 11:09 AM Amit Kapila  wrote:
> >
> > On Tue, Dec 10, 2024 at 10:59 AM Dilip Kumar  wrote:
> > >
> > > On Tue, Nov 26, 2024 at 3:02 AM Masahiko Sawada  
> > > wrote:
> > >
> > > >
> > > > I've attached a new version patch that incorporates all comments I got 
> > > > so far.
> > > >
> > > > I think the patch is in good shape but I'm considering whether we
> > > > might want to call ReorderBufferToastReset() after truncating all
> > > > changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
> > > > investigate further.
> > > >
> > >
> > > There’s something that seems a bit odd to me. Consider the case where
> > > the largest transaction(s) are aborted. If
> > > ReorderBufferCanStartStreaming() returns true, the changes from this
> > > transaction will only be discarded if it's a streamable transaction.
> > > However, if ReorderBufferCanStartStreaming() is false, the changes
> > > will be discarded regardless.
> > >
> > > What seems strange to me in this patch is truncating the changes of a
> > > large aborted transaction depending on whether we need to stream or
> > > spill but actually that should be completely independent IMHO. My
> > > concern is that if the largest transaction is aborted but isn’t yet
> > > streamable, we might end up picking the next transaction, which could
> > > be much smaller. This smaller transaction might not help us stay
> > > within the memory limit, and we could repeat this process for a few
> > > more transactions. In contrast, it might be more efficient to simply
> > > discard the large aborted transaction, even if it’s not streamable, to
> > > avoid this issue.
> > >
> >
> > If the largest transaction is non-streamable, won't the transaction
> > returned by ReorderBufferLargestTXN() in the other case already
> > suffice the need?
>
> I see your point, but I don’t think it’s quite the same. When
> ReorderBufferCanStartStreaming() is true, the function
> ReorderBufferLargestStreamableTopTXN() looks for the largest
> transaction among those that have a base_snapshot. So, if the largest
> transaction is aborted but hasn’t yet received a base_snapshot, it
> will instead select the largest transaction that does have a
> base_snapshot, which could be significantly smaller than the largest
> aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




  1   2   >