date:20231025

Re: Simplify create_merge_append_path a bit for clarity

2023-10-25 Thread Richard Guo

On Tue, Oct 24, 2023 at 6:00 PM Alena Rybakina 
wrote:

> I agree with you, and we can indeed directly set the param_info value to
> NULL, and there are enough comments here to explain.
>
> I didn't find anything else to add in your patch.


Thanks for reviewing this patch!

Thanks
Richard

Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-25 Thread Laurenz Albe

On Tue, 2023-10-24 at 15:05 -0400, Stephen Frost wrote:
> On Tue, Oct 24, 2023 at 14:42 Robert Haas  wrote:
> > On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis  wrote:
> > > Perhaps the idea is that if there are constraints involved, the failure
> > > or success of an INSERT/UPDATE/DELETE could leak information that you
> > > don't have privileges to read.
> > 
> > My recollection of this topic is pretty hazy, but like Tom, I seem to
> > remember it being intentional, and I think the reason had something to
> > do with wanting the slice of a RLS-protect table that you can see to
> > feel like a complete table. When you update a row in a table all of
> > which is visible to you, the updated row can never vanish as a result
> > of that update, so it was thought, if I remember correctly, that this
> > should also be true here. It's also similar to what happens if an
> > updatable view has WITH CHECK OPTION, and I think that was part of the
> > precedent as well. I don't know whether or not the constraint issue
> > that you mention here was also part of the concern, but it may have
> > been. This was all quite a while ago...
> 
> Yes, having it be similar to a view WITH CHECK OPTION was intentional,
> also on not wishing for things to be able to disappear or to not get saved.
> The risk of a constraint possibly causing the leak of information is better
> than either having data just thrown away or having the constraint not
> provide the guarantee it’s supposed to …

Thanks everybody for looking and remembering.

I can accept that the error is intentional, even though it violated the
POLA for me.  I can buy into the argument that an UPDATE should not make
a row seem to vanish.

I cannot buy into the constraint argument.  If the table owner wanted to
prevent you from causing a constraint violation error with a row you
cannot see, she wouldn't have given you a FOR UPDATE policy that allows
you to perform such an UPDATE.

Anyway, it is probably too late to change a behavior that has been like
that for a while and is not manifestly buggy.

Yours,
Laurenz Albe

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-25 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

Based on your advice, I revised the patch again. 

> >
> > I spent some time on the v57 patch and it looks good to me - tests are
> > passing, no complaints from pgindent and pgperltidy. I turned the CF
> > entry https://commitfest.postgresql.org/45/4273/ to RfC.
> >
> 
> Thanks, the patch looks mostly good to me but I am not convinced of
> keeping the tests across versions in this form. I don't think they are
> tested in BF, only one can manually create a setup to test.

I analyzed and agreed that current BF client does not use TAP test framework
for cross-version checks.

> Shall we
> remove it for now and then consider it separately?

OK, some parts for cross-checks were removed.

> Apart from that, I have made minor modifications in the docs to adjust
> the order of various prerequisites.

Thanks, included.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v59-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Description:  v59-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-25 Thread Bharath Rupireddy

On Wed, Oct 25, 2023 at 11:39 AM Amit Kapila  wrote:
>
> On Tue, Oct 24, 2023 at 1:20 PM Bharath Rupireddy
>  wrote:
> >
> >
> > I spent some time on the v57 patch and it looks good to me - tests are
> > passing, no complaints from pgindent and pgperltidy. I turned the CF
> > entry https://commitfest.postgresql.org/45/4273/ to RfC.
> >
>
> Thanks, the patch looks mostly good to me but I am not convinced of
> keeping the tests across versions in this form. I don't think they are
> tested in BF, only one can manually create a setup to test. Shall we
> remove it for now and then consider it separately?

I think we can retain the test_upgrade_from_pre_PG17 because it is not
only possible to trigger it manually but also one can write a CI
workflow to trigger it.

> Apart from that, I have made minor modifications in the docs to adjust
> the order of various prerequisites.

+
+ pg_upgrade attempts to migrate logical
+ replication slots. This helps avoid the need for manually defining the
+ same replication slots on the new publisher. Migration of logical
+ replication slots is only supported when the old cluster is version 17.0
+ or later. Logical replication slots on clusters before version 17.0 will
+ silently be ignored.
+

+   The new cluster must not have permanent logical replication slots, i.e.,

How about using "logical slots" in place of "logical replication
slots" to be more generic? We agreed and changed the function name to

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-25 Thread Amit Kapila

On Wed, Oct 25, 2023 at 1:39 PM Bharath Rupireddy
 wrote:
>
> On Wed, Oct 25, 2023 at 11:39 AM Amit Kapila  wrote:
> >
> > On Tue, Oct 24, 2023 at 1:20 PM Bharath Rupireddy
> >  wrote:
> > >
> > >
> > > I spent some time on the v57 patch and it looks good to me - tests are
> > > passing, no complaints from pgindent and pgperltidy. I turned the CF
> > > entry https://commitfest.postgresql.org/45/4273/ to RfC.
> > >
> >
> > Thanks, the patch looks mostly good to me but I am not convinced of
> > keeping the tests across versions in this form. I don't think they are
> > tested in BF, only one can manually create a setup to test. Shall we
> > remove it for now and then consider it separately?
>
> I think we can retain the test_upgrade_from_pre_PG17 because it is not
> only possible to trigger it manually but also one can write a CI
> workflow to trigger it.
>

It would be better to gauge its value separately and add it once the
main patch is committed. I am slightly unhappy even with the hack used
for pre-version testing in previous patch which is as follows:
+# XXX: Older PG version had different rules for the inter-dependency of
+# 'max_wal_senders' and 'max_connections', so assign values which will work for
+# all PG versions. If Cluster.pm is fixed this code is not needed.
+$old_publisher->append_conf(
+ 'postgresql.conf', qq[
+max_wal_senders = 5
+max_connections = 10
+]);

There should be a way to avoid this but we can decide it afterwards. I
don't want to hold the main patch for this point. What do you think?

> > Apart from that, I have made minor modifications in the docs to adjust
> > the order of various prerequisites.
>
> +
> + pg_upgrade attempts to migrate logical
> + replication slots. This helps avoid the need for manually defining the
> + same replication slots on the new publisher. Migration of logical
> + replication slots is only supported when the old cluster is version 17.0
> + or later. Logical replication slots on clusters before version 17.0 will
> + silently be ignored.
> +
>
> +   The new cluster must not have permanent logical replication slots, 
> i.e.,
>
> How about using "logical slots" in place of "logical replication
> slots" to be more generic? We agreed and changed the function name to
>

Yeah, I am fine with that and I can take care of it before committing
unless there is more to change.

-- 
With Regards,
Amit Kapila.

doc: a small improvement about pg_am description

2023-10-25 Thread Yugo NAGATA

Hi,

When reading the documentation about operator class, I found
the following description:

 The pg_am table contains one row for every index access method. 
 Support for access to regular tables is built into PostgreSQL, 
 but all index access methods are described in pg_am.

It seems to me that this description says pg_am contains only
index access methods but not table methods. I wonder it is missed
to fix this when tableam was supported and other documentation
was changed in b73c3a11963c8bb783993cfffabb09f558f86e37.

Attached is a patch to remove the sentence that starts with
"Support for access to regular tables is ".

Ragards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index c753d8005a..ff73233818 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -30,11 +30,8 @@
 
   
The pg_am table contains one row for every
-   index method (internally known as access method).  Support for
-   regular access to tables is built into
-   PostgreSQL, but all index methods are
-   described in pg_am.  It is possible to add a
-   new index access method by writing the necessary code and
+   index and table method (internally known as access method). It is possible
+   to add a new index access method by writing the necessary code and
then creating an entry in pg_am — but that is
beyond the scope of this chapter (see ).

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-25 Thread Bharath Rupireddy

On Wed, Oct 25, 2023 at 1:50 PM Amit Kapila  wrote:
>
> It would be better to gauge its value separately and add it once the
> main patch is committed.
> There should be a way to avoid this but we can decide it afterwards. I
> don't want to hold the main patch for this point. What do you think?

+1 to go with the main patch first. We also have another thing to take
care of - pg_upgrade option to not migrate logical slots.

> > How about using "logical slots" in place of "logical replication
> > slots" to be more generic? We agreed and changed the function name to
> >
>
> Yeah, I am fine with that and I can take care of it before committing
> unless there is more to change.

+1. I have no other comments.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Heikki Linnakangas


On 25/10/2023 02:09, Andres Freund wrote:

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...


That can only happen when walwriter has just come out of "hibernation", 
ie. when the system has been idle for a while. So probably not a big 
deal in practice.



I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.


In the patch, you actually did:


+   if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * 
XLOG_BLCKSZ)
+   return;


It means that you never wake up the walwriter to merely *write* the WAL. 
You only wake it up if it's going to also fsync() it. I think that's 
correct and appropriate, but it took me a while to reach that conclusion:


It might be beneficial to wake up the walwriter just to perform a 
write(), to offload that work from the backend. And walwriter will 
actually also perform an fsync() after finishing the current segment, so 
it would make sense to also wake it up when 'asyncXactLSN' crosses a 
segment boundary. However, if those extra wakeups make sense, they would 
also make sense when there are no asynchronous commits involved. 
Therefore those extra wakeups should be done elsewhere, perhaps 
somewhere around AdvanceXLInsertBuffer(). The condition you have in the 
patch is appropriate for XLogSetAsyncXactLSN().


Another reason to write the WAL aggressively, even if you don't flush 
it, would be to reduce the number of lost transactions on a segfault. 
But we don't give any guarantees on that, and even with the current 
aggressive logic, we only write when a page is full so you're anyway 
going to lose the last partial page.


It also took me a while to convince myself that this calculation matches 
the calculation that XLogBackgroundFlush() uses to determine whether it 
needs to flush or not. XLogBackgroundFlush() first divides the request 
and result with XLOG_BLCKSZ and then compares the number of blocks, 
whereas here you perform the calculation in bytes. I think the result is 
the same, but to make it more clear, let's do it the same way in both 
places.


See attached. It's the same logic as in your patch, just formulatd more 
clearly IMHO.



Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.


Even with this patch, when I bumped up wal_writer_delay to 2 so that the 
wal writer gets woken up by the async commits rather than the timeout, 
the write pattern is a bit silly:


$ strace -p 1099926 # walwriter
strace: Process 1099926 attached
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 1991) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, 
"\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"..., 
1007616, 49152) = 1007616

fdatasync(5)= 0
pwrite64(5, "\24\321\5\0\1\0\0\0\0 
\20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 
1056768) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 2000) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, 
"\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"..., 
1040384, 1073152) = 1040384

fdatasync(5)= 0
pwrite64(5, "\24\321\4\0\1\0\0\0\0@ 
\373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, 
u64=94261056289248}}], 1, 2000) = 1
read(3, 
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 128
pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 
\373\5\0\0\0003\0\0\0\0\0\0\0\320\177 \373\5\0\0\0"..., 1040384, 
2129920) = 1040384

fdatasync(5)= 0

In each cycle, the wal writer writes a full 1 MB chunk 
(wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller 
write before going to sleep.


Those smaller writes seem a bit silly. But I think it's fine.


More throughput for less CPU, seems neat :)


Indeed, impressive speedup from such a small patch!


I'm not add

Re: run pgindent on a regular basis / scripted manner

2023-10-25 Thread Amit Kapila

On Tue, Oct 24, 2023 at 6:21 AM Jeff Davis  wrote:
>
> On Wed, 2023-10-18 at 22:34 +1300, David Rowley wrote:
> > It would be good to learn how many of the committers out of the ones
> > you listed that --enable-indent-checks would have saved from breaking
> > koel.
>
> I'd find that a useful option.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2023-10-25 Thread Drouvot, Bertrand


Hi,

On 10/25/23 5:00 AM, shveta malik wrote:

On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
 wrote:


Hi,

On 10/23/23 2:56 PM, shveta malik wrote:

On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
 wrote:



We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if 
there
is new synced slot(s) to be created on the standby. Do we want to keep this 
behavior
for V1?



I think for the slotsync workers case, we should reduce the naptime in
the launcher to say 30sec and retain the default one of 3mins for
subscription apply workers. Thoughts?



Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
API on the standby that would refresh the list of sync slot at wish, thoughts?



Do you mean API to refresh list of DBIDs rather than sync-slots?
As per current design, launcher gets DBID lists for all the failover
slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.


I mean an API to get a newly created slot on the primary being created/synced on
the standby at wish.

Also let's imagine this scenario:

- create logical_slot1 on the primary (and don't start using it)

Then on the standby we'll get things like:

2023-10-25 08:33:36.897 UTC [740298] LOG:  waiting for remote slot 
"logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot LSN 
(0/C0049530) and and catalog xmin (754)

That's expected and due to the fact that ReplicationSlotReserveWal() does set 
the slot
restart_lsn to a value < at the corresponding restart_lsn slot on the primary.

- create logical_slot2 on the primary (and start using it)

Then logical_slot2 won't be created/synced on the standby until there is 
activity on logical_slot1 on the primary
that would produce things like:

2023-10-25 08:41:35.508 UTC [740298] LOG:  wait over for remote slot 
"logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed 
local slot LSN (0/C0049530) and catalog xmin (754)

With this new dedicated API, it will be:

- clear that the API call is "hanging" until there is some activity on the 
newly created slot
(currently there is "waiting for remote slot " message in the logfile as 
mentioned above but
I'm not sure that's enough)

- be possible to create/sync logical_slot2 in the example above without waiting 
for activity
on logical_slot1.

Maybe we should change our current algorithm during slot creation so that a 
newly created inactive
slot on the primary does not block other newly created "active" slots on the 
primary to be created
on the standby? Depending on how we implement that, the new API may not be 
needed at all.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use virtual tuple slot for Unique node

2023-10-25 Thread Ashutosh Bapat

On Tue, Oct 24, 2023 at 4:30 AM David Rowley  wrote:
>
> On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
>  wrote:
> > I ran my experiments again. It seems on my machine the execution times
> > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> > average of execution times. I did this three times. For each run the
> > standard deviation was within 2%. Here are the numbers
> > master: 13548.33, 13878.88, 14572.52
> > master + 0001: 13734.58, 14193.83, 14574.73
> >
> > So for me, I would say, this particular query performs the same with
> > or without patch.
>
> I'm not really sure which of the 7 queries you're referring to here.
> The times you're quoting seem to align best to Q7 from your previous
> results, so I'll assume you mean Q7.
>
> I'm not really concerned with Q7 as both patched and unpatched use
> TTSOpsMinimalTuple.

It's Q7. Yes. I was responding to your statement: " However, Q7 does
seem to be above noise level faster and I'm not sure why.". Anyway, we
can set that aside.

>
> I also think you need to shrink the size of your benchmark down.  With
> 1 million tuples, you're more likely to be also measuring the time it
> takes to get cache lines from memory into the CPU. A smaller scale
> test will make this less likely. Also, you'd be better timing SELECT
> rather than the time it takes to EXPLAIN ANALYZE. They're not the same
> thing. EXPLAIN ANALYZE has additional timing going on and we may end
> up not de-toasting toasted Datums.

I ran experiments with 10K rows and measured timing using \timing in
psql. The measurements are much more flaky than a larger set of rows
and EXPLAIN ANALYZE. But I think your observations are good enough.

>
> > On Thu, Oct 19, 2023 at 4:26 PM David Rowley  wrote:
> > > We can see that Q2 and Q3 become a bit slower.  This makes sense as
> > > tts_virtual_materialize() is quite a bit more complex than
> > > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> > >
> >
> > If the source slot is a materialized virtual slot,
> > tts_virtual_copyslot() could perform a memcpy of the materialized data
> > itself rather than materialising from datums. That might be more
> > efficient.
>
> I think you're talking about just performing a memcpy() of the
> VirtualTupleTableSlot->data field.  Unfortunately, you'd not be able
> to do just that as you'd also need to repoint the non-byval Datums in
> tts_values at the newly memcpy'd memory. If you skipped that part,
> those would remain pointing to the original memory. If that memory
> goes away, then bad things will happen. I think you'd still need to do
> the 2nd loop in tts_virtual_materialize()

Yes, we will need repoint non-byval Datums ofc.

>
> > May be we should fix the above said inefficiency in
> > tt_virtual_copyslot()?
>
> I don't have any bright ideas on how to make tts_virtual_materialize()
> itself faster.  If there were some way to remember !attbyval
> attributes for the 2nd loop, that might be good, but creating
> somewhere to store that might result in further overheads.

We may save the size of data in VirtualTupleTableSlot, thus avoiding
the first loop. I assume that when allocating
VirtualTupleTableSlot->data, we always know what size we are
allocating so it should be just a matter of saving it in
VirtualTupleTableSlot->size. This should avoid the first loop in
tts_virtual_materialize() and give some speed up. We will need a loop
to repoint non-byval Datums. I imagine that the pointers to non-byval
Datums can be computed as dest_slot->tts_values[natts] =
dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
This would work as long as all the non-byval datums in the source slot
are all stored flattened in source slot's data. I am assuming that
that would be true in a materialized virtual slot. The byval datums
are copied as is. I think, this will avoid multiple memcpy calls, one
per non-byval attribute and hence some speedup. I may be wrong though.

-- 
Best Wishes,
Ashutosh Bapat

Re: Synchronizing slots from primary to standby

2023-10-25 Thread Drouvot, Bertrand


Hi,

On 10/25/23 6:57 AM, Amit Kapila wrote:

On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
 wrote:


On 10/24/23 7:44 AM, Ajin Cherian wrote:

On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
 wrote:


2) When we create a subscription, another slot is created during the 
subscription synchronization, namely
like "pg_16397_sync_16388_7293447291374081805" (coming from 
ReplicationSlotNameForTablesync()).

This extra slot appears to have failover also set to true.

So, If the standby refresh the list of slot to sync when the subscription is 
still synchronizing we'd see things like
on the standby:

LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) 
to pass local slot LSN (0/C0034840) and and catalog xmin (756)
LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog 
xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN 
(0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog 
xmin (756)
WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the 
primary, aborting slot creation

I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have 
failover set to true. If there is a failover
during the subscription creation, better to re-launch the subscription instead?



But note that the subscription doesn't wait for the completion of
tablesync. 


Right.


So, how will we deal with that? Also, this situation is the
same for non-tablesync slots as well. I have given another option in
the email [1] which is to enable failover even for the main slot after
all tables are in ready state, something similar to what we do for
two_phase.


Oh right that looks like a better option (enable failover even for the main 
slot after
all tables are in ready state).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-25 Thread Jelte Fennema

On Wed, 25 Oct 2023 at 04:55, David Rowley  wrote:
> With foreach(), we commonly do "if (lc == NULL)" at the end of loops
> as a way of checking if we did "break" to terminate the loop early.

Afaict it's done pretty infrequently. The following crude attempt at
an estimate estimates it's only done about ~1.5% of the time a foreach
is used:
$ rg 'lc == NULL' | wc -l
13
$ rg '\bforeach\(lc,' -S | wc -l
899

> Doing the equivalent with the new macros won't be safe as the list
> element's value we broke on may be set to NULL. I think it might be a
> good idea to document the fact that this wouldn't be safe with the new
> macros, or better yet, document the correct way to determine if we
> broke out the loop early.  I imagine someone will want to do some
> conversion work at some future date and it would be good if we could
> avoid introducing bugs during that process.
>
> I wonder if we should even bother setting the variable to NULL at the
> end of the loop.  It feels like someone might just end up mistakenly
> checking for NULLs even if we document that it's not safe.  If we left
> the variable pointing to the last list element then the user of the
> macro is more likely to notice their broken code. It'd also save a bit
> of instruction space.

Makes sense. Addressed this now by mentioning this limitation and
possible workarounds in the comments of the new macros and by not
setting the loop variable to NULL/0. I don't think there's an easy way
to add this feature to these new macros natively, it's a limitation of
not having a second variable. This seems fine to me, since these new
macros are meant as an addition to foreach() instead of a complete
replacement.

v3-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data

v3-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch
Description: Binary data

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-25 Thread Jelte Fennema

Attached is a slightly updated version, with a bit simpler
implementation of foreach_delete_current.
Instead of decrementing i and then adding 1 to it when indexing the
list, it now indexes the list using a postfix decrement.
From 9073777a6d82e2b2db8b1ed9aef200550234d89a Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Tue, 24 Oct 2023 17:13:20 +0200
Subject: [PATCH v4 2/2] Use new for_each_xyz macros in a few places

This starts using each of the newly added for_each_xyz macros in at
least one place. This shows how they improve readability by reducing the
amount of boilerplate code.
---
 src/backend/executor/execExpr.c | 11 
 src/backend/executor/nodeSubplan.c  | 28 +
 src/backend/replication/logical/relation.c  |  5 ++--
 src/backend/replication/logical/tablesync.c |  6 ++---
 src/backend/replication/pgoutput/pgoutput.c |  8 +++---
 5 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2c62b0c9c84..e73261ec445 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -216,7 +216,8 @@ ExecInitQual(List *qual, PlanState *parent)
 	ExprState  *state;
 	ExprEvalStep scratch = {0};
 	List	   *adjust_jumps = NIL;
-	ListCell   *lc;
+	Expr	   *node;
+	int			jump;
 
 	/* short-circuit (here and in ExecQual) for empty restriction list */
 	if (qual == NIL)
@@ -250,10 +251,8 @@ ExecInitQual(List *qual, PlanState *parent)
 	scratch.resvalue = &state->resvalue;
 	scratch.resnull = &state->resnull;
 
-	foreach(lc, qual)
+	for_each_ptr(node, qual)
 	{
-		Expr	   *node = (Expr *) lfirst(lc);
-
 		/* first evaluate expression */
 		ExecInitExprRec(node, state, &state->resvalue, &state->resnull);
 
@@ -265,9 +264,9 @@ ExecInitQual(List *qual, PlanState *parent)
 	}
 
 	/* adjust jump targets */
-	foreach(lc, adjust_jumps)
+	for_each_int(jump, adjust_jumps)
 	{
-		ExprEvalStep *as = &state->steps[lfirst_int(lc)];
+		ExprEvalStep *as = &state->steps[jump];
 
 		Assert(as->opcode == EEOP_QUAL);
 		Assert(as->d.qualexpr.jumpdone == -1);
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index c136f75ac24..8f88f289269 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -307,7 +307,7 @@ ExecScanSubPlan(SubPlanState *node,
 		Datum		rowresult;
 		bool		rownull;
 		int			col;
-		ListCell   *plst;
+		int			paramid;
 
 		if (subLinkType == EXISTS_SUBLINK)
 		{
@@ -367,9 +367,8 @@ ExecScanSubPlan(SubPlanState *node,
 			 * Now set all the setParam params from the columns of the tuple
 			 */
 			col = 1;
-			foreach(plst, subplan->setParam)
+			for_each_int(paramid, subplan->setParam)
 			{
-int			paramid = lfirst_int(plst);
 ParamExecData *prmdata;
 
 prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
@@ -412,9 +411,8 @@ ExecScanSubPlan(SubPlanState *node,
 		 * combining expression.
 		 */
 		col = 1;
-		foreach(plst, subplan->paramIds)
+		for_each_int(paramid, subplan->paramIds)
 		{
-			int			paramid = lfirst_int(plst);
 			ParamExecData *prmdata;
 
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
@@ -480,10 +478,11 @@ ExecScanSubPlan(SubPlanState *node,
 		}
 		else if (subLinkType == MULTIEXPR_SUBLINK)
 		{
+			int			paramid;
+
 			/* We don't care about function result, but set the setParams */
-			foreach(l, subplan->setParam)
+			for_each_int(paramid, subplan->setParam)
 			{
-int			paramid = lfirst_int(l);
 ParamExecData *prmdata;
 
 prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
@@ -604,16 +603,15 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 slot = ExecProcNode(planstate))
 	{
 		int			col = 1;
-		ListCell   *plst;
 		bool		isnew;
+		int			paramid;
 
 		/*
 		 * Load up the Params representing the raw sub-select outputs, then
 		 * form the projection tuple to store in the hashtable.
 		 */
-		foreach(plst, subplan->paramIds)
+		for_each_int(paramid, subplan->paramIds)
 		{
-			int			paramid = lfirst_int(plst);
 			ParamExecData *prmdata;
 
 			prmdata = &(innerecontext->ecxt_param_exec_vals[paramid]);
@@ -880,11 +878,10 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 	if (subplan->setParam != NIL && subplan->parParam == NIL &&
 		subplan->subLinkType != CTE_SUBLINK)
 	{
-		ListCell   *lst;
+		int			paramid;
 
-		foreach(lst, subplan->setParam)
+		for_each_int(paramid, subplan->setParam)
 		{
-			int			paramid = lfirst_int(lst);
 			ParamExecData *prm = &(estate->es_param_exec_vals[paramid]);
 
 			prm->execPlan = sstate;
@@ -906,7 +903,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 		List	   *oplist,
    *lefttlist,
    *righttlist;
-		ListCell   *l;
+		OpExpr	   *opexpr;
 
 		/* We need a memory context to hold the hash table(s) */
 		sstate->hashtablecxt =
@@ -966,9 +963,8 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 		cross_eq_funcoids = (Oid *) palloc(ncols * sizeof(Oid));

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-10-25 Thread Ashutosh Bapat

On Tue, Oct 24, 2023 at 8:53 PM José Neves  wrote:
>
> Hi there, hope to find you well.
>
> I have a follow-up question to this already long thread.
>
> Upon deploying my PostgreSQL logical replication fed application on a stale 
> database, I ended up running out of space, as the replication slot is being 
> held back till the next time that we receive a data-changing event, and we 
> advance to that new LSN offset.
> I think that the solution for this is to advance our LSN offset every time a 
> keep-alive message is received ('k' // 107).
> My doubt is, can the keep-alive messages be received in between open 
> transaction events? I think not, but I would like to get your input to be 
> extra sure as if this happens, and I commit that offset, I may introduce 
> again faulty logic leading to data loss.
>
> In sum, something like this wouldn't happen:
> BEGIN LSN001
> INSERT LSN002
> KEEP LIVE LSN003
> UPDATE LSN004
> COMMIT LSN005
>

If the downstream acknowledges receipt of LSN003 and saves it locally
and crashes, upon restart the upstream will resend all the
transactions that committed after LSN003 including the one ended at
LSN005. So this is safe.

--
Best Wishes,
Ashutosh

RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-10-25 Thread José Neves

Ok, I see. In that situation is safe indeed, as the offset is lower than the 
current transaction commit.
But I think that I asked the wrong question. I guess that the right question 
is: Can we receive a keep-alive message with an LSN offset bigger than the 
commit of the open or following transactions?
Something like:

BEGIN LSN001
INSERT LSN002
KEEP LIVE LSN006
UPDATE LSN004
COMMIT LSN005

Or:

KEEP LIVE LSN006
BEGIN LSN001
INSERT LSN002
UPDATE LSN004
COMMIT LSN005
KEEP LIVE LSN007

Or is the sequence ensured not only between commits but also with keep-alive 
messaging?

De: Ashutosh Bapat 
Enviado: 25 de outubro de 2023 11:42
Para: José Neves 
Cc: Amit Kapila ; Andres Freund ; 
pgsql-hack...@postgresql.org 
Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

On Tue, Oct 24, 2023 at 8:53 PM José Neves  wrote:
>
> Hi there, hope to find you well.
>
> I have a follow-up question to this already long thread.
>
> Upon deploying my PostgreSQL logical replication fed application on a stale 
> database, I ended up running out of space, as the replication slot is being 
> held back till the next time that we receive a data-changing event, and we 
> advance to that new LSN offset.
> I think that the solution for this is to advance our LSN offset every time a 
> keep-alive message is received ('k' // 107).
> My doubt is, can the keep-alive messages be received in between open 
> transaction events? I think not, but I would like to get your input to be 
> extra sure as if this happens, and I commit that offset, I may introduce 
> again faulty logic leading to data loss.
>
> In sum, something like this wouldn't happen:
> BEGIN LSN001
> INSERT LSN002
> KEEP LIVE LSN003
> UPDATE LSN004
> COMMIT LSN005
>

If the downstream acknowledges receipt of LSN003 and saves it locally
and crashes, upon restart the upstream will resend all the
transactions that committed after LSN003 including the one ended at
LSN005. So this is safe.

--
Best Wishes,
Ashutosh

Re: POC, WIP: OR-clause support for indexes

2023-10-25 Thread a.rybakina


Hi!

On 15.10.2023 01:34, Alexander Korotkov wrote:

Hi, Alena!

Thank you for your work on the subject.

On Wed, Oct 4, 2023 at 10:21 PM a.rybakina  wrote:

I fixed the kernel dump issue and all the regression tests were successful, but 
I discovered another problem when I added my own regression tests.
Some queries that contain "or" expressions do not convert to "ANY". I have 
described this in more detail using diff as expected and real results:

diff -U3 
/home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 
/home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
--- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 
2023-10-04 21:54:12.496282667 +0300
+++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out  
2023-10-04 21:55:41.665422459 +0300
@@ -1925,17 +1925,20 @@
  EXPLAIN (COSTS OFF)
  SELECT count(*) FROM tenk1
WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
-   QUERY PLAN
-
+QUERY PLAN
+---
   Aggregate
 ->  Bitmap Heap Scan on tenk1
- Recheck Cond: (((thousand = 42) AND (tenthous = ANY 
('{1,3}'::integer[]))) OR (thousand = 41))
+ Recheck Cond: thousand = 42) AND (tenthous = 1)) OR ((thousand = 
42) AND (tenthous = 3))) OR (thousand = 41))
   ->  BitmapOr
-   ->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: ((thousand = 42) AND (tenthous = ANY 
('{1,3}'::integer[])))
+   ->  BitmapOr
+ ->  Bitmap Index Scan on tenk1_thous_tenthous
+   Index Cond: ((thousand = 42) AND (tenthous = 1))
+ ->  Bitmap Index Scan on tenk1_thous_tenthous
+   Index Cond: ((thousand = 42) AND (tenthous = 3))
 ->  Bitmap Index Scan on tenk1_thous_tenthous
   Index Cond: (thousand = 41)
-(8 rows)
+(11 rows)

I think this query is not converted, because you only convert
top-level ORs in the transform_ors() function.  But in the example
given, the target OR lays under AND, which in turn lays under another
OR.  I think you need to make transform_ors() recursive to handle
cases like this.

I wonder about the default value of the parameter or_transform_limit
of 500. In [1] and [2] you show the execution time degradation from 0
to ~500 OR clauses.  I made a simple SQL script with the query "SELECT
* FROM pgbench_accounts a WHERE  aid = 1 OR aid = 2 OR ... OR aid =
100;". The pgbench results for a single connection in prepared mode
are the following.
master: 936 tps
patched (or_transform_limit == 0) :1414 tps
So, transformation to ANY obviously accelerates the execution.

I think it's important to identify the cases where this patch causes
the degradation.  Generally, I don't see why ANY could be executed
slower than the equivalent OR clause.  So, the possible degradation
cases are slower plan generation and worse plans.  I managed to find
both.

As you stated before, currently the OR transformation has a quadratic
complexity depending on the number of or-clause-groups.  I made a
simple test to evaluate this. containing 1 or-clause-groups.
SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
bid = 1 OR ... OR aid + 1 * bid = 1;
master: 316ms
patched: 7142ms
Note, that the current or_transform_limit GUC parameter is not capable
of cutting such cases, because it cuts cases lower than the limit not
higher than the limit.  In the comment, you mention that we could
invent something like hash to handle this.  Hash should be nice, but
the problem is that we currently don't have a generic facility to hash
nodes (or even order them).  It would be nice to add this facility,
that would be quite a piece of work.  I would propose to limit this
patch for now to handle just a single Var node as a non-const side of
the clause and implement a simple hash for Vars.

Another problem is the possible generation of worse plans.  I made an
example table with two partial indexes.
create table test as (select (random()*10)::int x, (random()*1000) y
from generate_series(1,100) i);
create index test_x_1_y on test (y) where x = 1;
create index test_x_2_y on test (y) where x = 2;
vacuum analyze test;

Without the transformation of ORs to ANY, our planner manages to use
both indexes with a Bitmap scan.
# explain select * from test where (x = 1 or x = 2) and y = 100;
   QUERY PLAN
--
  Bitmap Heap Scan on test  (cost=8.60..12.62 rows=1 width=

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-10-25 Thread Ashutosh Bapat

On Wed, Oct 25, 2023 at 4:23 PM José Neves  wrote:
>
> Ok, I see. In that situation is safe indeed, as the offset is lower than the 
> current transaction commit.
> But I think that I asked the wrong question. I guess that the right question 
> is: Can we receive a keep-alive message with an LSN offset bigger than the 
> commit of the open or following transactions?
> Something like:
>
> BEGIN LSN001
> INSERT LSN002
> KEEP LIVE LSN006
> UPDATE LSN004
> COMMIT LSN005
>
> Or:
>
> KEEP LIVE LSN006
> BEGIN LSN001
> INSERT LSN002
> UPDATE LSN004
> COMMIT LSN005
> KEEP LIVE LSN007
>

AFAIU the code in walsender this isn't possible. Keep alive sends the
LSN of the last WAL record it read (sentPtr). Upon reading a commit
WAL record, the whole transaction is decoded. Till that point sentPtr
is not updated.

Please take a look at XLogSendLogical(void) and the places where
WalSndKeepaliveIfNecessary() is called.

-- 
Best Wishes,
Ashutosh Bapat

Re: ResourceOwner refactoring

2023-10-25 Thread Heikki Linnakangas

On 10/07/2023 15:37, Peter Eisentraut wrote:

A few suggestions on the API:

  > +static ResourceOwnerFuncs tupdesc_resowner_funcs =

These aren't all "functions", so maybe another word like "info" or
"description" would be more appropriate?

  > +   .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
  > +   .release_priority = RELEASE_PRIO_TUPDESC_REFS,

I suggest assigning explicit non-zero numbers to the RESOURCE_RELEASE_*
constants, as well as changing RELEASE_PRIO_FIRST to 1 and adding an
assertion that the priority is not zero.  Otherwise, someone might
forget to assign these fields and would implicitly get phase 0 and
priority 0, which would probably work in most cases, but wouldn't be the
explicit intention.

Done.

Then again, you are using RELEASE_PRIO_FIRST for the pgcrypto patch.  Is
it the recommendation that if there are no other requirements, external
users should use that?  If we are going to open this up to external
users, we should probably have some documented guidance around this.

I added a brief comment about that in the ResourceReleasePhase typedef.

I also added a section to the README on how to define your own resource 
kind. (The README doesn't go into any detail on how to choose the 
release priority though).

  > +   .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning

I think this can be refactored further.  We really just need a function
to describe a resource, like maybe

static char *
ResOwnerDescribeTupleDesc(Datum res)
{
  TupleDesc   tupdesc = (TupleDesc) DatumGetPointer(res);

  return psprintf("TupleDesc %p (%u,%d)",
  tupdesc, tupdesc->tdtypeid, tupdesc->tdtypmod);
}

and then the common code can generate the warnings from that like

  elog(WARNING,
   "%s leak: %s still referenced",
   kind->name, kind->describe(value));

That way, we could more easily develop further options, such as turning
this warning into an error (useful for TAP tests).

Also, maybe, you could supply a default description that just prints
"%p", which appears to be applicable in about half the places.

Refactored it that way.

Possible bonus project: I'm not sure these descriptions are so useful
anyway.  It doesn't help me much in debugging if "File 55 was not
released" or some such.  If would be cool if ResourceOwnerRemember() in
some debug mode could remember the source code location, and then print
that together with the leak warning.

Yeah, that would be useful. I remember I've hacked something like that 
as a one-off thing in the past, when debugging a leak.

  > +#define ResourceOwnerRememberTupleDesc(owner, tupdesc) \
  > +   ResourceOwnerRemember(owner, PointerGetDatum(tupdesc),
&tupdesc_resowner_funcs)
  > +#define ResourceOwnerForgetTupleDesc(owner, tupdesc) \
  > +   ResourceOwnerForget(owner, PointerGetDatum(tupdesc),
&tupdesc_resowner_funcs)

I would prefer that these kinds of things be made inline functions, so
that we maintain the type safety of the previous interface.  And it's
also easier when reading code to see type decorations.

(But I suspect the above bonus project would require these to be macros?)

  > +   if (context->resowner)
  > +   ResourceOwnerForgetJIT(context->resowner, context);

It seems most ResourceOwnerForget*() calls have this kind of check
before them.  Could this be moved inside the function?

Many do, but I don't know if it's the majority. We could make 
ResurceOwnerForget(NULL) do nothing, but I think it's better to have the 
check in the callers. You wouldn't want to silently do nothing when the 
resource owner is not expected to be NULL.

(I'm attaching new patch version in my reply to Andres shortly)

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-10-25 Thread tender wang

Hi
   Is there any conclusion to this issue?

Jehan-Guillaume de Rorthais  于2023年8月10日周四 23:03写道：

> On Thu, 3 Aug 2023 11:02:43 +0200
> Alvaro Herrera  wrote:
>
> > On 2023-Aug-03, tender wang wrote:
> >
> > > I think  old "sub-FK" should not be dropped, that will be violates
> foreign
> > > key constraint.
> >
> > Yeah, I've been playing more with the patch and it is definitely not
> > doing the right things.  Just eyeballing the contents of pg_trigger and
> > pg_constraint for partitions added by ALTER...ATTACH shows that the
> > catalog contents are inconsistent with those added by CREATE TABLE
> > PARTITION OF.
>
> Well, as stated in my orignal message, at the patch helps understanding the
> problem and sketch a possible solution. It definitely is not complete.
>
> After DETACHing the table, we surely needs to check everything again and
> recreating what is needed to keep the FK consistent.
>
> But should we keep the FK after DETACH? Did you check the two other
> discussions
> related to FK, self-FK & partition? Unfortunately, as Tender experienced,
> the
> more we dig the more we find bugs. Moreover, the second one might seems
> unsolvable and deserve a closer look. See:
>
> * FK broken after DETACHing referencing part
>   https://www.postgresql.org/message-id/20230420144344.40744130%40karst
> * Issue attaching a table to a partitioned table with an  auto-referenced
>   foreign key
>   https://www.postgresql.org/message-id/20230707175859.17c91538%40karst
>
>

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-25 Thread Alvaro Herrera

On 2023-Oct-24, Jelte Fennema wrote:

> Many usages of the foreach macro in the Postgres codebase only use the
> ListCell variable to then get its value. This adds macros that
> simplify iteration code for that very common use case. Instead of
> passing a ListCell you can pass a variable of the type of its
> contents. This IMHO improves readability of the code by reducing the
> total amount of code while also essentially forcing the use of useful
> variable names.

+1 for getting rid of useless "lc" variables.

Looking at for_each_ptr() I think it may be cleaner to follow
palloc_object()'s precedent and make it foreach_object() instead (I have
no love for the extra underscore, but I won't object to it either).  And
like foreach_node, have it receive a type name to add a cast to.

I'd imagine something like

  SubscriptionRelState *rstate;

  foreach_object(SubscriptionRelState *, rstate, table_states_not_ready)
  {

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan



On 2023-10-24 Tu 12:08, Robert Haas wrote:



It looks like each file entry in the manifest takes about 150 bytes, so
1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

I suspect a few people have more files than that. They'll just have to Maybe 
someone on the list can see some way o
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).



Robert asked me to work on this quite some time ago, and most of this 
work was done last year.


Here's my WIP for an incremental JSON parser. It works and passes all 
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. 
The reason I haven't posted it before is that it's about 50% slower in 
pure parsing speed than the current recursive descent parser in my 
testing. I've tried various things to make it faster, but haven't made 
much impact. One of my colleagues is going to take a fresh look at it, 
but maybe someone on the list can see where we can save some cycles.


If we can't make it faster, I guess we could use the RD parser for 
non-incremental cases and only use the non-RD parser for incremental, 
although that would be a bit sad. However, I don't think we can make the 
RD parser suitable for incremental parsing - there's too much state 
involved in the call stack.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


json_incremental_parser-2023-09-25.patch.gz
Description: application/gzip

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-10-25 Thread Alvaro Herrera

On 2023-Oct-25, tender wang wrote:

> Hi
>Is there any conclusion to this issue?

None yet.  I intend to work on this at some point, hopefully soon.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-25 Thread Amit Kapila

On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar  wrote:
>
> On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila  wrote:
> >
> > This and other results shared by you look promising. Will there be any
> > improvement in workloads related to clog buffer usage?
>
> I did not understand this question can you explain this a bit?
>

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

-- 
With Regards,
Amit Kapila.

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-25 Thread Jelte Fennema

On Wed, 25 Oct 2023 at 13:52, Alvaro Herrera  wrote:
> Looking at for_each_ptr() I think it may be cleaner to follow
> palloc_object()'s precedent and make it foreach_object() instead (I have
> no love for the extra underscore, but I won't object to it either).  And
> like foreach_node, have it receive a type name to add a cast to.
>
> I'd imagine something like
>
>   SubscriptionRelState *rstate;
>
>   foreach_object(SubscriptionRelState *, rstate, table_states_not_ready)
>   {

Could you clarify why you think it may be cleaner? I don't see much
benefit to passing the type in there if all we use it for is adding a
cast. It seems like extra things to type for little benefit.
palloc_object uses the passed in type to not only do the cast, but
also to determine the size of the the allocation.

If foreach_object would allow us to remove the declaration further up
in the function I do see a benefit though.

I attached a new patchset which includes a 3rd patch that does this
(the other patches are equivalent to v4). I quite like that it moves
the type declaration to the loop itself, limiting its scope. But I'm
not fully convinced it's worth the hackiness of introducing a second
for loop that does a single iteration, just to be able to declare a
variable of a different type though. But I don't know another way of
achieving this. If this hack/trick is deemed acceptable, we can do the
same for the other newly introduced macros. The type would not even
need to be specified for oid/xid/int because it's already known to be
Oid/TransactionId/int respectively.

v5-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data

v5-0003-Introduce-for_each_object.patch
Description: Binary data

v5-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch
Description: Binary data

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-25 Thread Dilip Kumar

On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila  wrote:
>
> On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar  wrote:
> >
> > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila  wrote:
> > >
> > > This and other results shared by you look promising. Will there be any
> > > improvement in workloads related to clog buffer usage?
> >
> > I did not understand this question can you explain this a bit?
> >
>
> I meant to ask about the impact of this patch on accessing transaction
> status via TransactionIdGetStatus(). Shouldn't we expect some
> improvement in accessing CLOG buffers?

Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: RFC: Pluggable TOAST

2023-10-25 Thread Aleksander Alekseev

Hi Nikita,

> We need community feedback on previously discussed topic [1].
> There are some long-live issues in Postgres related to the TOAST mechanics, 
> like [2].
> Some time ago we already proposed a set of patches with an API allowing to 
> plug in
> different TOAST implementations into a live database. The patch set 
> introduced a lot
> of code and was quite crude in some places, so after several implementations 
> we decided
> to try to implement it in the production environment for further check-up.
>
> The main idea behind pluggable TOAST is make it possible to easily plug in 
> and use different
> implementations of large values storage, preserving existing mechanics to 
> keep backward
> compatibilitну provide easy Postgres-way  give users alternative mechanics 
> for storing large
> column values in a more effective way - we already have custom and very 
> effective (up to tens
> and even hundreds of times faster) TOAST implementations for bytea and JSONb 
> data types.
>
> As we see it - Pluggable TOAST proposes
> 1) changes in TOAST pointer itself, extending it to store custom data - 
> current limitations
> of TOAST pointer were discussed in [1] and [4];
> 2) API which allows calls of custom TOAST implementations for certain table 
> columns and
> (a topic for discussion) certain datatypes.
>
> Custom TOAST could be also used in a not so trivial way - for example, 
> limited columnar storage could be easily implemented and plugged in without 
> heavy core modifications
> of implementation of Pluggable Storage (Table Access Methods), preserving 
> existing data
> and database structure, be upgraded, replicated and so on.
>
> Any thoughts and proposals are welcome.

It seems to me that discarding the previous discussion and starting a
new thread where you ask the community for *another* feedback is not
going to be productive. Pretty sure it's not going to change.

-- 
Best regards,
Aleksander Alekseev

Re: race condition in pg_class

2023-10-25 Thread Andrey M. Borodin

> On 25 Oct 2023, at 13:39, Smolkin Grigory  wrote:
> 
> We are running PG13.10 and recently we have encountered what appears to be a 
> bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and 
> some other catalog-writer, possibly ANALYZ

> I've tried to reproduce this scenario with CREATE INDEX and various 
> concurrent statements, but no luck.

Maybe it would be possible to reproduce with modifying tests for concurrent 
index creation. For example add “ANALYZE” here [0].
Keep in mind that for easier reproduction it would make sense to increase 
transaction count radically.

Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-25 Thread Andrei Zubkov

Hi Andrei,

On Wed, 2023-10-25 at 13:59 +0700, Andrei Lepikhov wrote:
> But minmax_stats_since and changes in the UI of the reset routine
> look like syntactic sugar here.
> I can't convince myself that it is really needed. Do you have some
> set of cases that can enforce the changes proposed?

Yes, it looks strange, but it is needed in some way.
The main purpose of this patch is to provide means for sampling
solutions for collecting statistics data in series of samples. The
first goal here - is to be sure that the statement was not evicted and
come back between samples (especially between rare samples). This is
the target of the stats_since field. The second goal - is the ability
of getting all statistic increments for the interval between samples.
All statistics provided by pg_stat_statements with except of min/max
values can be calculated for the interval since the last sample knowing
the values in the last and current samples. We need a way to get
min/max values too. This is achieved by min/max reset performed by the
sampling solution. The minmax_stats_since field is here for the same
purpose - the sampling solution is need to be sure that no one have
done a min/max reset between samples. And if such reset was performed
at least we know when it was performed.

regards,
Andrei Zubkov
Postgres Professional

Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas

On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan  wrote:
> Robert asked me to work on this quite some time ago, and most of this
> work was done last year.
>
> Here's my WIP for an incremental JSON parser. It works and passes all
> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
> The reason I haven't posted it before is that it's about 50% slower in
> pure parsing speed than the current recursive descent parser in my
> testing. I've tried various things to make it faster, but haven't made
> much impact. One of my colleagues is going to take a fresh look at it,
> but maybe someone on the list can see where we can save some cycles.
>
> If we can't make it faster, I guess we could use the RD parser for
> non-incremental cases and only use the non-RD parser for incremental,
> although that would be a bit sad. However, I don't think we can make the
> RD parser suitable for incremental parsing - there's too much state
> involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add connection active, idle time to pg_stat_activity

2023-10-25 Thread Andrei Zubkov

Hi Sergey,

I've done a review of this patch. I found the patch idea very useful,
thank you for the patch. I've noted something observing this patch:
1. Patch can't be applied on the current master. My review is based on
   application of this patch over ac68323a878
2. Being applied over ac68323a878 patch works as expected.
3. Field names seems quite long to me (and they should be uniformly
   named with the same statistics in other views. For example
   "running" term is called "active" in pg_stat_database)
4. Meaningless spaces at the end of line:
   - backend_status.c:586
   - monitoring.sgml:5857
5. Patch adds

 usecs_diff = secs * 100 + usecs;

   at backend_status.c:pgstat_report_activity() to optimize
   calculations. But

 pgstat_count_conn_active_time((PgStat_Counter) secs * 100 +
usecs);
   and  
 pgstat_count_conn_txn_idle_time((PgStat_Counter) secs * 100 +
usecs);

   are left in place after that.
6. I'm not sure that I can understand the comment
 /* Keep statistics for pg_stat_database intact */
   at backend_status.c:600 correctly. Can you please explain it a
   little?
7. Tests seems incomplete. It looks like we can check increments in
   all fields playing with transactions in tests.

Also, I have a thought about other possible improvements fitting to
this patch.

The view pg_stat_session is really needed in Postgres but I think it
should have much more statistics. I mean all resource statistics
related to sessions. Every backend has instrumentation that tracks
resource consumption. Data of this instrumentation goes to the
cumulative statistics system and is used in monitoring extensions
(like pg_stat_statements). I think pg_stat_session view is able to add
one more dimension of monitoring - a dimension of sessions. In my
opinion this view should provide resource consumption statistics of
current sessions in two cumulative sets of statistics - since backend
start and since transaction start.  Such view will be really useful in
monitoring of long running sessions and transactions providing
resource consumption information besides timing statistics.

regards, Andrei Zubkov
Postgres Professional

Re: Add connection active, idle time to pg_stat_activity

2023-10-25 Thread Aleksander Alekseev

Hi,

> I've done a review of this patch. I found the patch idea very useful,
> thank you for the patch. I've noted something observing this patch:
> 1. Patch can't be applied on the current master. My review is based on
>application of this patch over ac68323a878

On top of that not sure if I see the patch on the November commitfest
[1]. Please make sure it's there so that cfbot will check the patch.

[1]: https://commitfest.postgresql.org/45/

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-25 Thread Alena Rybakina


On 19.10.2023 15:40, Andrei Zubkov wrote:

Hi hackers,

New version 23 attached. It contains rebase to the current master.
Noted that v1.11 adds new fields to the pg_stat_sstatements view, but
leaves the PGSS_FILE_HEADER constant unchanged. It this correct?

Hi! Thank you for your work on the subject.

1. I didn't understand why we first try to find pgssEntry with a false 
top_level value, and later find it with a true top_level value.


/*
 * Remove the entry if it exists, starting with the non-top-level entry.
 */
*key.toplevel = false;*
entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);

SINGLE_ENTRY_RESET(entry);

*/* Also reset the top-level entry if it exists. */
key.toplevel = true;*
entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);

SINGLE_ENTRY_RESET(entry);

I looked through this topic and found some explanation in this email 
[0], but I didn't understand it. Can you explain it to me?


2. And honestly, I think you should change
"Remove the entry if it exists, starting with the non-top-level entry." on
"Remove the entry or reset min/max time statistic information and the 
timestamp if it exists, starting with the non-top-level entry."


And the same with the comment bellow:

"Also reset the top-level entry if it exists."
"Also remove the entry or reset min/max time statistic information and 
the timestamp if it exists."


In my opinion, this is necessary because the minmax_only parameter is 
set by the user, so both ways are possible.



0 - 
https://www.postgresql.org/message-id/62d16845-e74e-a6f9-9661-022e44f48922%40inbox.ru


--
Regards,
Alena Rybakina

Re: Add connection active, idle time to pg_stat_activity

2023-10-25 Thread Andrei Zubkov

Hi Aleksander,

On Wed, 2023-10-25 at 16:17 +0300, Aleksander Alekseev wrote:
> On top of that not sure if I see the patch on the November commitfest
> [1]. Please make sure it's there so that cfbot will check the patch.

Yes, this patch is listed on the November commitfest. cfbot says rebase
needed since 2023-08-21.

regards, Andrei Zubkov

Re: Add connection active, idle time to pg_stat_activity

2023-10-25 Thread Aleksander Alekseev

Hi,

> On Wed, 2023-10-25 at 16:17 +0300, Aleksander Alekseev wrote:
> > On top of that not sure if I see the patch on the November commitfest
> > [1]. Please make sure it's there so that cfbot will check the patch.
>
> Yes, this patch is listed on the November commitfest. cfbot says rebase
> needed since 2023-08-21.

You are right, I missed the corresponding entry [1].

[1]: https://commitfest.postgresql.org/45/3405/


-- 
Best regards,
Aleksander Alekseev

Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan




On 2023-10-25 We 09:05, Robert Haas wrote:

On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan  wrote:

Robert asked me to work on this quite some time ago, and most of this
work was done last year.

Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.

If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.



I'm not too worried about the maintenance burden. The RD routines were 
added in March 2013 (commit a570c98d7fa) and have hardly changed since 
then. The new code is not ground-breaking - it's just a different (and 
fairly well known) way of doing the same thing. I'd be happier if we 
could make it faster, but maybe it's just a fact that keeping an 
explicit stack, which is how this works, is slower.


I wouldn't at all be surprised if there were other good uses for 
incremental JSON parsing, including some you've identified.


That said, I agree that JSON might not be the best format for backup 
manifests, but maybe that ship has sailed.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [patch] pg_basebackup: mention that spread checkpoints are the default in --help

2023-10-25 Thread Peter Eisentraut

On 19.10.23 11:39, Michael Banck wrote:

Hi,

I believed that spread (not fast) checkpoints are the default in
pg_basebackup, but noticed that --help does not specify which is which -
contrary to the reference documentation.

So I propose the small attached patch to clarify that.

>  printf(_("  -c, --checkpoint=fast|spread\n"
>-  " set fast or spread 
checkpointing\n"));
>+  " set fast or spread (default) 
checkpointing\n"));

Could we do like

  -c, --checkpoint=fast|spread
 set fast or spread checkpointing
 (default: spread)

This seems to be easier to read.

Re: [dynahash] do not refill the hashkey after hash_search

2023-10-25 Thread Nathan Bossart

On Wed, Oct 25, 2023 at 12:48:52PM +0700, John Naylor wrote:
> On Wed, Oct 25, 2023 at 12:21 PM Tom Lane  wrote:
>>
>> John Naylor  writes:
>> > I'd prefer just adding "Assert(hentry->event == oldn);" and declaring
>> > hentry PG_USED_FOR_ASSERTS_ONLY.
>>
>> I'm not aware of any other places where we have Asserts checking
>> that hash_search() honored its contract.  Why do we need one here?
> 
> [removing old CC]
> The author pointed out here that we're not consistent in this regard:
> 
> https://www.postgresql.org/message-id/CAEG8a3KEO_Kdt2Y5hFNWMEX3DpCXi9jtZOJY-GFUEE9QLgF%2Bbw%40mail.gmail.com
> 
> ...but I didn't try seeing where the balance lay. We can certainly
> just remove redundant assignments.

While it probably doesn't hurt anything, IMHO it's unnecessary to verify
that hash_search() works every time it is called.  This behavior seems
unlikely to change anytime soon, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

libpq async connection and multiple hosts

2023-10-25 Thread Daniele Varrazzo

Hello,

We are aware that, using async connection functions (`PQconnectStart`,
`PQconnectPoll`), the `connect_timeout` parameter is not supported;
this is documented at
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PQCONNECTSTARTPARAMS

"""
The connect_timeout connection parameter is ignored when using
PQconnectPoll; it is the application's responsibility to decide
whether an excessive amount of time has elapsed. Otherwise,
PQconnectStart followed by a PQconnectPoll loop is equivalent to
PQconnectdb.
"""

However, ISTM that connecting to multiple hosts is not supported
either. I have a couple of issues I am looking into in psycopg 3:

- https://github.com/psycopg/psycopg/issues/602
- https://github.com/psycopg/psycopg/issues/674

Do we have to reimplement the connection attempts loop too?

Are there other policies that we would need to reimplement? Is
`target_session_attrs` taken care of by PQconnectPoll?

On my box (testing with psql and libpq itself),
PQconnect("host=8.8.8.8") fails after 2m10s. Is this the result of
some unspecified socket connection timeout on my Ubuntu machine?.

If we need to reimplement async connection to "host=X,Y", we will need
to use a timeout even if the user didn't specify one, otherwise we
will never stop the connection attempt to X and move to Y. What
timeout can we specify that will not upset anyone?

Thank you very much

-- Daniele

Re: Synchronizing slots from primary to standby

2023-10-25 Thread Drouvot, Bertrand


Hi,

On 10/9/23 12:30 PM, shveta malik wrote:

PFA v22 patch-set. It has below changes:

patch 001:
1) Now physical walsender wakes up logical walsender(s) by using a new
CV as suggested in [1]


Thanks!

I think that works fine as long as the standby is up and running and catching 
up.

The problem I see with the current WalSndWaitForStandbyConfirmation() 
implementation
is that if the standby is not running then:

+   for (;;)
+   {
+   ListCell   *l;
+   longsleeptime = -1;

will loop until we reach the "terminating walsender process due to replication 
timeout" if we
explicitly want to end with SIGINT or friends.

For example a scenario like:

- standby down
- pg_recvlogical running

then CTRL-C on pg_recvlogical would not "respond" immediately but when we reach 
the replication timeout.

So it seems that we should use something like WalSndWait() instead of 
ConditionVariableTimedSleep() here:

+   /*
+* Sleep until other physical walsenders awaken us or until a 
timeout
+* occurs.
+*/
+   sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+
+   ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, 
sleeptime,
+   
WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION);

In that case I think that WalSndWait() should take care of the new CV 
WalSndCtl->wal_confirm_rcv_cv too.
The wait on the socket should allow us to stop waiting when, for example, 
CTRL-C on pg_recvlogical is triggered.

Then we would need to deal with this scenario: Standby down or not catching up 
and exited WalSndWait() due to the socket
to break the loop or shutdown the walsender.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas

On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan  wrote:
> I'm not too worried about the maintenance burden.
>
> That said, I agree that JSON might not be the best format for backup
> manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: libpq async connection and multiple hosts

2023-10-25 Thread Jelte Fennema

On Wed, 25 Oct 2023 at 17:03, Daniele Varrazzo
 wrote:
> However, ISTM that connecting to multiple hosts is not supported
> either. I have a couple of issues I am looking into in psycopg 3:
>
> - https://github.com/psycopg/psycopg/issues/602
> - https://github.com/psycopg/psycopg/issues/674

Another approach is to use tcp_user_timeout instead of connect_timeout
to skip non-responsive hosts. It's not completely equivalent though to
connection_timeout though, since it also applies when the connection
is actually being used. Also it only works on Linux afaik. It could be
nice to add support for BSD its TCP_CONNECTIONTIMEOUT socket option.

> Do we have to reimplement the connection attempts loop too?

If you want to support connection_timeout, it seems yes.

> Are there other policies that we would need to reimplement? Is
> `target_session_attrs` taken care of by PQconnectPoll?

Afaict from the code target_session_attrs are handled inside
PQconnectPoll, so you would not have to re-implement that.
PQconnectPoll would simply fail if target_session_attrs don't match
for the server. You should implement load_balance_hosts=random though
by randomizing your hosts list.

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-25 Thread Andrei Zubkov

Hi Alena,

On Wed, 2023-10-25 at 16:25 +0300, Alena Rybakina wrote:
>  Hi! Thank you for your work on the subject.
> 1. I didn't understand why we first try to find pgssEntry with a
> false top_level value, and later find it with a true top_level value.

In case of pg_stat_statements the top_level field is the bool field of
the pgssHashKey struct used as the key for pgss_hash hashtable. When we
are performing a reset only userid, dbid and queryid values are
provided. We need to reset both top-level and non-top level entries. We
have only one way to get them all from a hashtable - search for entries
having top_level=true and with top_level=false in their keys.

> 2. And honestly, I think you should change 
>  "Remove the entry if it exists, starting with the non-top-level
> entry." on 
>  "Remove the entry or reset min/max time statistic information and
> the timestamp if it exists, starting with the non-top-level entry."
> And the same with the comment bellow:
> "Also reset the top-level entry if it exists."
>  "Also remove the entry or reset min/max time statistic information
> and the timestamp if it exists."

There are four such places actually - every time when the
SINGLE_ENTRY_RESET macro is used. The logic of reset performed is
described a bit in this macro definition. It seems quite redundant to
repeat this description four times. But I've noted that "remove" terms
should be replaced by "reset".

I've attached v24 with commits updated.

regards, Andrei Zubkov
Postgres Professional

From 4fadc88d5e6c1afe7558393ba99c28d070ac7244 Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Wed, 25 Oct 2023 17:58:57 +0300
Subject: [PATCH 1/2] pg_stat_statements tests: Add NOT NULL checking of
 pg_stat_statements_reset

This is preliminary patch. It adds NOT NULL checking for the result of
pg_stat_statements_reset() function. It is needed for upcoming patch
"Track statement entry timestamp" that will change the result type of
this function to the timestamp of a reset performed.

Author: Andrei Zubkov (zubkov)

Reviewed by: Julien Rouhaud (rjuju), Hayato Kuroda (ha-kun),
  Yuki Seino (seinoyu), Chengxi Sun (martin-sun),
  Anton Melnikov (antonmel), Darren Rush (darrenr),
  Michael Paquier (michael-kun), Sergei Kornilov (melkij),
  Alena Rybakina (a.rybakina), Andrei Lepikhov (lepikhov)

Discussion: https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 .../pg_stat_statements/expected/cursors.out   |  28 +--
 contrib/pg_stat_statements/expected/dml.out   |  28 +--
 .../expected/level_tracking.out   |  80 
 .../pg_stat_statements/expected/planning.out  |  18 +-
 .../pg_stat_statements/expected/select.out|  44 ++--
 .../expected/user_activity.out| 120 +--
 .../pg_stat_statements/expected/utility.out   | 192 +-
 contrib/pg_stat_statements/expected/wal.out   |  10 +-
 contrib/pg_stat_statements/sql/cursors.sql|   6 +-
 contrib/pg_stat_statements/sql/dml.sql|   6 +-
 .../pg_stat_statements/sql/level_tracking.sql |  12 +-
 contrib/pg_stat_statements/sql/planning.sql   |   4 +-
 contrib/pg_stat_statements/sql/select.sql |  10 +-
 .../pg_stat_statements/sql/user_activity.sql  |  15 +-
 contrib/pg_stat_statements/sql/utility.sql|  34 ++--
 contrib/pg_stat_statements/sql/wal.sql|   2 +-
 16 files changed, 307 insertions(+), 302 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/cursors.out b/contrib/pg_stat_statements/expected/cursors.out
index 46375ea9051..0fc4b2c098d 100644
--- a/contrib/pg_stat_statements/expected/cursors.out
+++ b/contrib/pg_stat_statements/expected/cursors.out
@@ -3,10 +3,10 @@
 --
 -- These tests require track_utility to be enabled.
 SET pg_stat_statements.track_utility = TRUE;
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
 -- DECLARE
@@ -20,13 +20,13 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 ---+--+---
  2 |0 | CLOSE cursor_stats_1
  2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT $1
- 1 |1 | SELECT pg_stat_statements_reset()
+ 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (3 rows)
 
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
 -- FETCH
@@ -59,12 +59,12 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
  1 |0 | DECLARE cursor_stats_2 CURSOR WITH HOLD FOR SELECT $1
  1 |1 | FETCH 1 IN cursor_stats_1
  1 |1 | FETCH 1 IN cursor_stats_2
- 1 |1 | SELECT pg_stat_statements_reset()
+ 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (9 rows)
 
-SELECT pg_stat_statements_re

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread Bruce Momjian

On Tue, Oct 24, 2023 at 06:45:48PM -0700, David G. Johnston wrote:
> I'd prefer to keep pointing out that the ones documented are those whose
> outputs will vary due to ordering.

Okay, I re-added it in the attached patch, and tightened up the text.

> I've been sympathetic to the user comments that we don't have enough 
> examples. 

Good point.

> Just using array_agg for that purpose, showing both DISTINCT and ORDER BY 
> seems
> like a fair compromise (removes two from my original proposal).  The examples
> in the section we tell them to go see aren't of that great quality.  If you
> strongly dislike having the function table contain the examples we should at
> least improve the page we are sending them to.  (As an aside to this, I've
> personally always found the syntax block with the 5 syntaxes shown there to be
> intimidating/hard-to-read).

I think you are right that it belongs in the syntax section;  we cover
ordering extensively there.  We already have queries there, but not
output, so I moved the relevant examples to there and replaced the
example that had no output.

> I'd at least suggest you reconsider the commentary and examples surrounding
> jsonb_object_agg.

I moved that as well, and tightened the example.

> The same goes for the special knowledge of floating point behavior for why
> we've chosen to document avg/sum, something that typically doesn't care about
> order, as having an optional order by.

The floating example seems too obscure to mention in our function docs. 
I can put a sentence in the syntax docs, but is there value in
explaining that to users?  How it that helpful?  Example?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..d60c65f8fc 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20851,11 +20861,11 @@ SELECT NULLIF(value, '(none)') ...
 numeric


-sum ( real )
+sum ( real  ORDER BY input_sort_columns  )
 real


-sum ( double precision )
+sum ( double precision  ORDER BY input_sort_columns  )
 double precision


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..b64733d8aa 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1647,7 +1647,26 @@ sqrt(2)
 are always just expressions and cannot be output-column names or numbers.
 For example:
 
-SELECT array_agg(a ORDER BY b DESC) FROM table;
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(v ORDER BY v DESC) FROM vals;
+  array_agg
+-
+ {4,3,3,2,1}
+
+Since jsonb only keeps the last matching key, ordering
+of its keys can be significant:
+
+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-25 Thread Nathan Bossart

+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

It looks like most of this function is duplicated from
pg_comp_crc32c_armv8().  I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used.  It's callers would then just provide a boolean to indicate which
branch to take.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test 
x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+USE_ARMV8_VMULL=1
+  fi
+fi

Hm.  I wonder if we need to switch to a runtime check in some cases.  For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added?  It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.

Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS.  The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: libpq async connection and multiple hosts

2023-10-25 Thread Daniele Varrazzo

On Wed, 25 Oct 2023 at 17:35, Jelte Fennema  wrote:

> You should implement load_balance_hosts=random though
> by randomizing your hosts list.

Good catch. So it seems that, if someone wants to build an equivalent
an async version of PQconnectdb, they need to handle on their own:

- connect_timeout
- multiple host, hostaddr, port
- load_balance_hosts=random

Does this list sound complete?

-- Daniele

Re: libpq async connection and multiple hosts

2023-10-25 Thread Daniele Varrazzo

On Wed, 25 Oct 2023 at 17:35, Jelte Fennema  wrote:

> Another approach is to use tcp_user_timeout instead of connect_timeout
> to skip non-responsive hosts. It's not completely equivalent though to
> connection_timeout though, since it also applies when the connection
> is actually being used. Also it only works on Linux afaik. It could be
> nice to add support for BSD its TCP_CONNECTIONTIMEOUT socket option.

This seems brittle and platform-dependent enough that we would surely
receive less grief by hardcoding a default two minutes timeout.

-- Daniele

Re: remaining sql/json patches

2023-10-25 Thread Nikita Malakhov

Hi!

Amit, on previous email, patch #2 - I agree that it is not the best idea to
introduce
new type of logic into the parser, so this logic could be moved to the
executor,
or removed at all. What do you think of these options?

On Wed, Oct 18, 2023 at 5:19 AM jian he  wrote:

> Hi.
> based on v22.
>
> I added some tests again json_value for the sake of coverager test.
>
> A previous email thread mentioned needing to check *empty in
> ExecEvalJsonExpr.
> since JSON_VALUE_OP, JSON_QUERY_OP, JSON_EXISTS_OP all need to have
> *empty cases, So I refactored a little bit.
> might be helpful. Maybe we can also refactor *error cases.
>
> The following part is not easy to understand.
> res = ExecPrepareJsonItemCoercion(jbv,
> +  jsestate->item_jcstates,
> +  &post_eval->jcstate);
> + if (post_eval->jcstate &&
> + post_eval->jcstate->coercion &&
> + (post_eval->jcstate->coercion->via_io ||
> + post_eval->jcstate->coercion->via_populate))
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Ashutosh Sharma

Hi All,

At present, we represent temp files as a signed long int number. And
depending on the system architecture (32 bit or 64 bit), the range of
signed long int varies, for example on a 32-bit system it will range
from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a
session to create more than 2 billion temporary files and that is not
a small number at all, but still what if we make it an unsigned long
int which will allow a session to create 4 billion temporary files if
needed. I might be sounding a little stupid here because 2 billion
temporary files is like 2000 peta bytes (2 billion * 1GB), considering
each temp file is 1GB in size which is not a small data size at all,
it is a huge amount of data storage. However, since the variable we
use to name temporary files is a static long int (static long
tempFileCounter = 0;), there is a possibility that this number will
get exhausted soon if the same session is trying to create too many
temp files via multiple queries.

Just adding few lines of code related to this from postmaster.c:

/*
 * Number of temporary files opened during the current session;
 * this is used in generation of tempfile names.
 */
static long tempFileCounter = 0;

/*
 * Generate a tempfile name that should be unique within the current
 * database instance.
 */
snprintf(tempfilepath, sizeof(tempfilepath), "%s/%s%d.%ld",
 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, tempFileCounter++);

--
With Regards,
Ashutosh Sharma.

Re: race condition in pg_class

2023-10-25 Thread Tom Lane

Smolkin Grigory  writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

regards, tom lane

Re: ResourceOwner refactoring

2023-10-25 Thread Andres Freund

Hi,

On 2023-10-25 15:43:36 +0300, Heikki Linnakangas wrote:
> On 10/07/2023 22:14, Andres Freund wrote:
> > >   /*
> > > - * Initially allocated size of a ResourceArray.  Must be power of two 
> > > since
> > > - * we'll use (arraysize - 1) as mask for hashing.
> > > + * Size of the fixed-size array to hold most-recently remembered 
> > > resources.
> > >*/
> > > -#define RESARRAY_INIT_SIZE 16
> > > +#define RESOWNER_ARRAY_SIZE 32
> >
> > That's 512 bytes, pretty large to be searched sequentially. It's somewhat 
> > sad
> > that the array needs to include 8 byte of ResourceOwnerFuncs...
>
> Yeah. Early on I considered having a RegisterResourceKind() function that
> you need to call first, and then each resource kind could be assigned a
> smaller ID. If we wanted to keep the old mechanism of separate arrays for
> each different resource kind, that'd be the way to go. But overall this
> seems simpler, and the performance is decent despite that linear scan.

Realistically, on a 64bit platform, the compiler / we want 64bit alignment for
ResourceElem->item, so making "kind" smaller isn't going to do much...


> > Due to 'narr' being before the large 'arr', 'nhash' is always on a separate
> > cacheline. I.e. the number of cachelines accessed in happy paths is higher
> > than necessary, also relevant because there's more dependencies on narr, 
> > nhash
> > than on the contents of those.
> >
> > I'd probably order it so that both of the large elements (arr, locks) are at
> > the end.
> >
> > It's hard to know with changes this small, but it does seem to yield a small
> > and repeatable performance benefit (pipelined pgbench -S).
>
> Swapped the 'hash' and 'arr' parts of the struct, so that 'arr' and 'locks'
> are at the end.

I don't remember what the layout was then, but now there are 10 bytes of holes
on x86-64 (and I think most other 64bit architectures).


struct ResourceOwnerData {
ResourceOwner  parent;   /* 0 8 */
ResourceOwner  firstchild;   /* 8 8 */
ResourceOwner  nextchild;/*16 8 */
const char  *  name; /*24 8 */
_Bool  releasing;/*32 1 */
_Bool  sorted;   /*33 1 */

/* XXX 6 bytes hole, try to pack */

ResourceElem * hash; /*40 8 */
uint32 nhash;/*48 4 */
uint32 capacity; /*52 4 */
uint32 grow_at;  /*56 4 */
uint32 narr; /*60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
ResourceElem   arr[32];  /*64   512 */
/* --- cacheline 9 boundary (576 bytes) --- */
uint32 nlocks;   /*   576 4 */

/* XXX 4 bytes hole, try to pack */

LOCALLOCK *locks[15];/*   584   120 */

/* size: 704, cachelines: 11, members: 14 */
/* sum members: 694, holes: 2, sum holes: 10 */
};

While somewhat annoying from a human pov, it seems like it would make sense to
rearrange a bit? At least moving nlocks into the first cacheline would be good
(not that we guarantee cacheline alignment rn, though it sometimes works out
to be aligned).

We also can make narr, nlocks uint8s.

With that narrowing and reordering, we end up with 688 bytes. Not worth for
most structs, but here perhaps worth doing?

Moving the lock length to the start of the struct would make sense to keep
things that are likely to be used in a "stalling" manner at the start of the
struct / in one cacheline.

It's not too awful to have it be in this order:

struct ResourceOwnerData {
ResourceOwner  parent;   /* 0 8 */
ResourceOwner  firstchild;   /* 8 8 */
ResourceOwner  nextchild;/*16 8 */
const char  *  name; /*24 8 */
_Bool  releasing;/*32 1 */
_Bool  sorted;   /*33 1 */
uint8  narr; /*34 1 */
uint8  nlocks;   /*35 1 */
uint32 nhash;/*36 4 */
uint32 capacity; /*40 4 */
uint32 grow_at;  /*44 4 */
ResourceElem   arr[32];  /*48   512 */
/* --- cacheline 8 boundary (512 bytes) was 48 bytes ago --- */
ResourceElem * hash; /*   560 8 */

Does UCS_BASIC have the right CTYPE?

2023-10-25 Thread Jeff Davis

UCS_BASIC is defined in the standard as a collation based on comparing
the code point values, and in UTF8 that is satisfied with memcmp(), so
the collation locale for UCS_BASIC in Postgres is simply "C".

But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
Postgres, the answer is 'á', but intuitively, one could reasonably
expect the answer to be 'Á'.

Reading the standard, it seems that LOWER()/UPPER() are defined in
terms of the Unicode General Category (Section 4.2, " is a pair
of functions..."). It is somewhat ambiguous about the case mappings,
but I could guess that it means the Default Case Algorithm[1].

That seems to suggest the standard answer should be 'Á' regardless of
any COLLATE clause (though I could be misreading). I'm a bit confused
by that... what's the standard-compatible way to specify the locale for
UPPER()/LOWER()? If there is none, then it makes sense that Postgres
overloads the COLLATE clause for that purpose so that users can use a
different locale if they want.

But given that UCS_BASIC is a collation specified in the standard,
shouldn't it have ctype behavior that's as close to the standard as
possible?

Regards,
Jeff Davis

[1] https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992

Remove dead code in pg_ctl.c

2023-10-25 Thread David Steele


Hackers,

It looks like this code was missed in 39969e2a when exclusive backup was 
removed.


Regards,
-Daviddiff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 807d7023a99..4099d240e03 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -96,7 +96,6 @@ static time_t start_time;
 static char postopts_file[MAXPGPATH];
 static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
-static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
@@ -2447,7 +2446,6 @@ main(int argc, char **argv)
snprintf(postopts_file, MAXPGPATH, "%s/postmaster.opts", 
pg_data);
snprintf(version_file, MAXPGPATH, "%s/PG_VERSION", pg_data);
snprintf(pid_file, MAXPGPATH, "%s/postmaster.pid", pg_data);
-   snprintf(backup_file, MAXPGPATH, "%s/backup_label", pg_data);
 
/*
 * Set mask based on PGDATA permissions,

Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Andres Freund

Hi,

On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:
> On 25/10/2023 02:09, Andres Freund wrote:
> > Because of the inherent delay between the checks of 
> > XLogCtl->WalWriterSleeping
> > and Latch->is_set, we also sometimes end up with multiple processes 
> > signalling
> > walwriter, which can be bad, because it increases the likelihood that some 
> > of
> > the signals may be received when we are already holding WALWriteLock, 
> > delaying
> > its release...
>
> That can only happen when walwriter has just come out of "hibernation", ie.
> when the system has been idle for a while. So probably not a big deal in
> practice.

Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?

We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.


> > I think the most important optimization we need is to have
> > XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
> > WAL. Unless walsender is hibernating, walsender will wake up on its own 
> > after
> > wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.
> >
> > E.g. a condition like
> > if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
> > XLOG_BLCKSZ)
> > return;
> > drastically cuts down on the amount of wakeups, without - I think - loosing
> > guarantees around synchronous_commit=off.
>
> In the patch, you actually did:
>
> > +   if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * 
> > XLOG_BLCKSZ)
> > +   return;
>
> It means that you never wake up the walwriter to merely *write* the WAL. You
> only wake it up if it's going to also fsync() it. I think that's correct and
> appropriate, but it took me a while to reach that conclusion:

Yea, after writing the email I got worried that just looking at Write would
perhaps lead to not flushing data soon enough...


> It might be beneficial to wake up the walwriter just to perform a write(),
> to offload that work from the backend. And walwriter will actually also
> perform an fsync() after finishing the current segment, so it would make
> sense to also wake it up when 'asyncXactLSN' crosses a segment boundary.
> However, if those extra wakeups make sense, they would also make sense when
> there are no asynchronous commits involved. Therefore those extra wakeups
> should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer().
> The condition you have in the patch is appropriate for
> XLogSetAsyncXactLSN().

Yea. I agree we should wake up walsender in other situations too...


> Another reason to write the WAL aggressively, even if you don't flush it,
> would be to reduce the number of lost transactions on a segfault. But we
> don't give any guarantees on that, and even with the current aggressive
> logic, we only write when a page is full so you're anyway going to lose the
> last partial page.

Wal writer does end up writing the trailing partially filled page during the
next wal_writer_delay cycle.


> It also took me a while to convince myself that this calculation matches the
> calculation that XLogBackgroundFlush() uses to determine whether it needs to
> flush or not. XLogBackgroundFlush() first divides the request and result
> with XLOG_BLCKSZ and then compares the number of blocks, whereas here you
> perform the calculation in bytes. I think the result is the same, but to
> make it more clear, let's do it the same way in both places.
>
> See attached. It's the same logic as in your patch, just formulatd more
> clearly IMHO.

Yep, makes sense!


> > Because of the frequent wakeups, we do something else that's not smart: We
> > write out 8k blocks individually, many times a second. Often thousands of
> > 8k pwrites a second.
>
> Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal
> writer gets woken up by the async commits rather than the timeout, the write
> pattern is a bit silly:
>
> $ strace -p 1099926 # walwriter
> strace: Process 1099926 attached
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 1991) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5, 
> "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"...,
> 1007616, 49152) = 1007616
> fdatasync(5)= 0
> pwrite64(5, "\24\321\5\0\1\0\0\0\0
> \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768)
> = 16384
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 2000) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5,
> "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"...,
> 1040384, 1073152) =

Re: Custom tstzrange with importance factored in

2023-10-25 Thread Jeff Davis

On Fri, 2023-10-06 at 16:55 +0300, Rares Pop (Treelet) wrote:
> I essentially want to be able to aggregate multiple tstzranges - each
> range with its own importance. The aggregation would be like a a
> join/intersect where ranges with higher importance override the ones
> with lower importance.

It may be possible without a new data type. Can you describe the
semantics more precisely?

Regards,
Jeff Davis

Re: Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Tom Lane

Ashutosh Sharma  writes:
> At present, we represent temp files as a signed long int number. And
> depending on the system architecture (32 bit or 64 bit), the range of
> signed long int varies, for example on a 32-bit system it will range
> from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a
> session to create more than 2 billion temporary files and that is not
> a small number at all, but still what if we make it an unsigned long
> int which will allow a session to create 4 billion temporary files if
> needed.

AFAIK, nothing particularly awful will happen if that counter wraps
around.  Perhaps if you gamed the system really hard, you could cause
a collision with a still-extant temp file from the previous cycle,
but I seriously doubt that could happen by accident.  So I don't
think there's anything to worry about here.  Maybe we could make
that filename pattern %lu not %ld, but minus sign is a perfectly
acceptable filename character, so such a change would be cosmetic.

regards, tom lane

Re: Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Robert Haas

On Wed, Oct 25, 2023 at 1:28 PM Ashutosh Sharma  wrote:
> At present, we represent temp files as a signed long int number. And
> depending on the system architecture (32 bit or 64 bit), the range of
> signed long int varies, for example on a 32-bit system it will range
> from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a
> session to create more than 2 billion temporary files and that is not
> a small number at all, but still what if we make it an unsigned long
> int which will allow a session to create 4 billion temporary files if
> needed. I might be sounding a little stupid here because 2 billion
> temporary files is like 2000 peta bytes (2 billion * 1GB), considering
> each temp file is 1GB in size which is not a small data size at all,
> it is a huge amount of data storage. However, since the variable we
> use to name temporary files is a static long int (static long
> tempFileCounter = 0;), there is a possibility that this number will
> get exhausted soon if the same session is trying to create too many
> temp files via multiple queries.

I think we use signed integer types in a bunch of places where an
unsigned integer type would be straight-up better, and this is one of
them.

I don't know whether it really matters, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan




On 2023-10-25 We 11:24, Robert Haas wrote:

On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan  wrote:

I'm not too worried about the maintenance burden.

That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.



OK, I'll go with that. It will actually be a bit less invasive than the 
patch I posted.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas

On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan  wrote:
> OK, I'll go with that. It will actually be a bit less invasive than the
> patch I posted.

Why's that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-25 Thread Tom Lane

David Rowley  writes:
> I've attached a patch which builds on the previous patch and relaxes
> the rule that the StringInfo must be NUL-terminated.   The rule is
> only relaxed for StringInfos that are initialized with
> initReadOnlyStringInfo.

Yeah, that's probably a reasonable way to frame it.

> There's also an existing confusing comment in logicalrep_read_tuple()
> which seems to think we're just setting the NUL terminator to conform
> to StringInfo's practises. This is misleading as the NUL is required
> for LOGICALREP_COLUMN_TEXT mode as we use the type's input function
> instead of the receive function.  You don't have to look very hard to
> find an input function that needs a NUL terminator.

Right, input functions are likely to expect this.

> I'm a bit less confident that the type's receive function will never
> need to be NUL terminated. cstring_recv() came to mind as one I should
> look at, but on looking I see it's not required as it just reads the
> remaining bytes from the input StringInfo.  Is it safe to assume this?

I think that we can make that assumption starting with v17.
Back-patching it would be hazardous perhaps; but if there's some
function out there that depends on NUL termination, testing should
expose it before too long.  Wouldn't hurt to mention this explicitly
as a possible incompatibility in the commit message.

Looking over the v5 patch, I have some nits:

* In logicalrep_read_tuple,
s/input function require that/input functions require that/
(or fix the grammatical disagreement some other way)

* In exec_bind_message, you removed the comment pointing out that
we are scribbling directly on the message buffer, even though
we still are.  This patch does nothing to make that any safer,
so I object to removing the comment.

* In stringinfo.h, I'd suggest adding text more or less like this
within or at the end of the "As a special case, ..." para in
the first large comment block:

 * Also, it is caller's option whether a read-only string buffer has
 * a terminating '\0' or not.  This depends on the intended usage.

That's partially redundant with some other comments, but this para
is defining the API for read-only buffers, so I think it would
be good to include it here.

regards, tom lane

Re: POC, WIP: OR-clause support for indexes

2023-10-25 Thread Robert Haas

On Sat, Oct 14, 2023 at 6:37 PM Alexander Korotkov  wrote:
> Regarding the GUC parameter, I don't see we need a limit.  It's not
> yet clear whether a small number or a large number of OR clauses are
> more favorable for transformation.  I propose to have just a boolean
> enable_or_transformation GUC.

That's a poor solution. So is the GUC patch currently has
(or_transform_limit). What you need is a heuristic that figures out
fairly reliably whether the transformation is going to be better or
worse. Or else, do the whole thing in some other way that is always
same-or-better.

In general, adding GUCs makes sense when the user knows something that
we can't know. For example, shared_buffers makes some sense because,
even if we discovered how much memory the machine has, we can't know
how much of it the user wants to devote to PostgreSQL as opposed to
anything else. And track_io_timing makes sense because we can't know
whether the user wants to pay the price of gathering that additional
data. But GUCs are a poor way of handling cases where the real problem
is that we don't know what code to write. In this case, some queries
will be better with enable_or_transformation=on, and some will be
better with enable_or_transformation=off. Since we don't know which
will work out better, we make the user figure it out and set the GUC,
possibly differently for each query. That's terrible. It's the query
optimizer's whole job to figure out which transformations will speed
up the query. It shouldn't turn around and punt the decision back to
the user.

Notice that superficially-similar GUCs like enable_seqscan aren't
really the same thing at all. That's just for developer testing and
debugging. Nobody expects that you have to adjust that GUC on a
production system - ever.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remove dead code in pg_ctl.c

2023-10-25 Thread Nathan Bossart

On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote:
> It looks like this code was missed in 39969e2a when exclusive backup was
> removed.

Indeed.  I'll plan on committing this shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: walwriter interacts quite badly with synchronous_commit=off

2023-10-25 Thread Heikki Linnakangas


On 25/10/2023 21:59, Andres Freund wrote:

On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:

On 25/10/2023 02:09, Andres Freund wrote:

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...


That can only happen when walwriter has just come out of "hibernation", ie.
when the system has been idle for a while. So probably not a big deal in
practice.


Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?


Ah sure, multiple backends can decide to wake up walwriter at the same 
time. I thought you meant that the window for that was somehow wider 
when XLogCtl->WalWriterSleeping.



We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.


One small easy thing we could do to reduce the redundant wakeups: only 
wake up walwriter if asyncXactLSN points to different page than 
prevAsyncXactLSN.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: doc: a small improvement about pg_am description

2023-10-25 Thread Jeff Davis

On Wed, 2023-10-25 at 17:25 +0900, Yugo NAGATA wrote:
> It seems to me that this description says pg_am contains only
> index access methods but not table methods. I wonder it is missed
> to fix this when tableam was supported and other documentation
> was changed in b73c3a11963c8bb783993cfffabb09f558f86e37.

Thank you for the report.

That section should not refer to pg_am directly now that there's CREATE
ACCESS METHOD. I committed a fix for that which also fixes the problem
you describe.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Guiding principle for dropping LLVM versions?

2023-10-25 Thread Thomas Munro

On Wed, Oct 25, 2023 at 7:12 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > 3.  We exclude OSes that don't bless an LLVM release (eg macOS running
> > an arbitrarily picked version), and animals running only to cover
> > ancient LLVM compiled from source for coverage (Andres's sid
> > menagerie).
>
> Seems generally reasonable.  Maybe rephrase 3 as "We consider only
> an OS release's default LLVM version"?  Or a bit more forgivingly,
> "... only LLVM versions available from the OS vendor"?  Also,
> what's an OS vendor?  You rejected macOS which is fine, but
> I think the packages available from MacPorts or Homebrew should
> be considered.

OK.  For me the key differences are that they are independent of OS
releases and time lines, they promptly add new releases, they have a
wide back-catalogue of the old releases and they let the user decide
which to use.  So I don't think they constrain us and it makes no
sense to try to apply 'end of support' logic to them.

https://formulae.brew.sh/formula/llvm
https://ports.macports.org/search/?q=llvm&name=on

(Frustratingly, the ancient releases of clang don't actually seem to
work on MacPorts at least on aarch64, and they tell you so when you
try to install them.)

The BSDs may be closer to macOS in that respect too, since they have
ports separate from OS releases and they offer a rolling and wide
range of LLVMs and generally default to a very new one, so I don't
think they constrain us either.  It's really Big Linux that is
drawing the lines in the sand here, though (unusually) not
RHEL-and-frenemies as they've opted for rolling to the latest in every
minor release as Devrim explained.

Re: Remove dead code in pg_ctl.c

2023-10-25 Thread Nathan Bossart

On Wed, Oct 25, 2023 at 03:02:01PM -0500, Nathan Bossart wrote:
> On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote:
>> It looks like this code was missed in 39969e2a when exclusive backup was
>> removed.
> 
> Indeed.  I'll plan on committing this shortly.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Container Types

2023-10-25 Thread Jeff Davis

On Tue, 2022-12-20 at 10:24 +0100, Vik Fearing wrote:
> Obviously there would have to be an actual type in order to store it
> in 
> a table, but what I am most interested in here is being able to
> create 
> them on the fly.  I do not think it is feasible to create N new types
> for every type like we do for arrays on the off-chance you would want
> to 
> put it in a PERIOD for example.

By "on the fly" do you mean when creating real objects, like a table?
In that case it might not be so hard, because we can just create an
ordinary entry in pg_type.

But for this to be a complete feature, I think we need the container
types to be useful when constructed within a query, too. E.g.

  SELECT two_things(v1, v2) FROM foo;

where the result of two_things is some new type two_things_int_text
which is based on the types of v1 and v2 and has never been used
before.

I don't think it's reasonable to create a permanent pg_type entry on
the fly to answer a read-only query. But we could introduce some notion
of an ephemeral in-memory pg_type entry with its own OID, and create
that on the fly.

One way to do that might be to reserve some of the system OID space
(e.g. 15000-16000) for OIDs for temporary catalog entries, and then
have some in-memory structure that holds those temporary entries. Any
lookups in that range would search the in-memory structure instead of
the real catalog. All of this is easier said than done, but I think it
could work.

We'd also need to think about how to infer types through a container
type, e.g.

  SELECT second_thing(two_things(v1,v2)) FROM foo;

should infer that the return type of second_thing() is the type of v2.
To do that, perhaps pg_proc entries can include some kind of type
sublanguage to do this inference, e.g. "a, b -> b" for second_thing(),
or "a, b -> a" for first_thing().

Regards,
Jeff Davis

Re: Partial aggregates pushdown

2023-10-25 Thread Bruce Momjian

On Tue, Oct 24, 2023 at 12:12:41AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi Mr.Momjian.
> 
> > Fujii-san, to get this patch closer to finished, can I modify this version 
> > of the patch to improve some wording and post an
> > updated version you can use for future changes?
> Yes, I greatly appreciate your offer.
> I would very much appreciate your modifications.

I am almost done updating the patch, but I got stuck on how the feature
is supposed to work.  This documentation sentence is where I got
confused:


 check_partial_aggregate_support 
(boolean)
 
  
   If this option is false, postgres_fdw assumes
   that for each built-in aggregate function,
   the partial aggregate function is defined on the remote server
   without checking the remote server version.
   If this option is true, during query planning,
   postgres_fdw connects to the remote server
   and checks if the remote server version is older than the local 
server version.
   If so,
   postgres_fdw
-->assumes that for each built-in aggregate function, the partial 
aggregate function is not defined
-->on the remote server unless the partial aggregate function and the 
aggregate
-->function match.
   Otherwise postgres_fdw assumes that for each 
built-in aggregate function,
   the partial aggregate function is defined on the remote server.
   The default is false.
  
 


What does that marked sentence mean?  What is match?  Are one or both of
these remote?  It sounds like you are checking the local aggregate
against the remote partial aggregate, but I don't see any code that does
this in the patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: libpq async connection and multiple hosts

2023-10-25 Thread Jelte Fennema

On Wed, 25 Oct 2023 at 18:54, Daniele Varrazzo
 wrote:
> - connect_timeout
> - multiple host, hostaddr, port
> - load_balance_hosts=random
>
> Does this list sound complete?

I think you'd also want to resolve the hostnames to IPs yourself and
iterate over those one-by-one. Otherwise if the first IP returned for
the hostname times out, you will never connect to the others.

Re: Improving btree performance through specializing by key shape, take 2

2023-10-25 Thread Peter Geoghegan

On Mon, Sep 25, 2023 at 9:13 AM Matthias van de Meent
 wrote:
> I think it's fairly complete, and mostly waiting for review.
>
> > I don't have time to do a comprehensive (or even a fairly
> > cursory) analysis of which parts of the patch are helping, and which
> > are marginal or even add no value.
>
> It is a shame that you don't have the time to review this patch.

I didn't say that. Just that I don't have the time (or more like the
inclination) to do most or all of the analysis that might allow us to
arrive at a commitable patch. Most of the work with something like
this is the analysis of the trade-offs, not writing code. There are
all kinds of trade-offs that you could make with something like this,
and the prospect of doing that myself is kind of daunting. Ideally
you'd have made a significant start on that at this point.

> > I have long understood that you gave up on the idea of keeping the
> > bounds across levels of the tree (which does make sense to me), but
> > yesterday the issue became totally muddled by this high key business.
> > That's why I rehashed the earlier discussion, which I had previously
> > understood to be settled.
>
> Understood. I'll see if I can improve the wording to something that is
> more clear about what the optimization entails.

Cool.

--
Peter Geoghegan

Re: post-recovery amcheck expectations

2023-10-25 Thread Peter Geoghegan

On Tue, Oct 24, 2023 at 8:05 PM Noah Misch  wrote:
> Can't it still happen if the sequence of unfortunately timed crashes causes
> deletions from left to right?  Take this example, expanding the one above.
> Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:
>
>  *  1
>  *   // | \\
>  * 4 <-> 3 <-> 2 <-> 1
>
> (That's not to say it has ever happened outside of a test.)

Hmm. Perhaps you're right. I thought that this wasn't possible in part
due to the fact that you'd have to access all of these leaf pages in
the same order each time, without ever passing over a previous
half-dead page. But I suppose that there's nothing stopping the index
tuples from being deleted from each page in an order that leaves open
the possibility of something like this. (It's extremely unlikely, of
course, but that wasn't ever in question.)

I withdraw my suggestion about the wording from your patch. It seems
committable.

Thanks
--
Peter Geoghegan

Re: Container Types

2023-10-25 Thread Andres Freund

Hi,

On 2023-10-25 15:03:04 -0700, Jeff Davis wrote:
> On Tue, 2022-12-20 at 10:24 +0100, Vik Fearing wrote:
> > Obviously there would have to be an actual type in order to store it
> > in 
> > a table, but what I am most interested in here is being able to
> > create 
> > them on the fly.  I do not think it is feasible to create N new types
> > for every type like we do for arrays on the off-chance you would want
> > to 
> > put it in a PERIOD for example.
> 
> By "on the fly" do you mean when creating real objects, like a table?
> In that case it might not be so hard, because we can just create an
> ordinary entry in pg_type.
> 
> But for this to be a complete feature, I think we need the container
> types to be useful when constructed within a query, too. E.g.
> 
>   SELECT two_things(v1, v2) FROM foo;
> 
> where the result of two_things is some new type two_things_int_text
> which is based on the types of v1 and v2 and has never been used
> before.
> 
> I don't think it's reasonable to create a permanent pg_type entry on
> ethe fly to answer a read-only query. But we could introduce some notion
> of an ephemeral in-memory pg_type entry with its own OID, and create
> that on the fly.

I don't think particularly like the idea of an in-memory pg_type entry. But
I'm not sure we need that anyway - we already have this problem with record
types.  We support both named record types (tables and explicitly created
composite types) and ad-hoc ones (created if you write ROW(foo, bar) or
something like that).  If a record's typmod is negative, it refers to an
anonymous row type, if positive it's a named typmod.

We even have support for sharing such ad-hoc rowtypes across backends for
parallel query...

I'd look whether you can generalize that infrastructure.

Greetings,

Andres Freund

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread David G. Johnston

On Wed, Oct 25, 2023 at 8:36 AM Bruce Momjian  wrote:

> On Tue, Oct 24, 2023 at 06:45:48PM -0700, David G. Johnston wrote:
> > I'd prefer to keep pointing out that the ones documented are those whose
> > outputs will vary due to ordering.
>
> Okay, I re-added it in the attached patch, and tightened up the text.
>

Thanks

> I think you are right that it belongs in the syntax section;  we cover
> ordering extensively there.  We already have queries there, but not
> output, so I moved the relevant examples to there and replaced the
> example that had no output.
>

Thanks

> > The same goes for the special knowledge of floating point behavior for
> why
> > we've chosen to document avg/sum, something that typically doesn't care
> about
> > order, as having an optional order by.
>
> The floating example seems too obscure to mention in our function docs.
> I can put a sentence in the syntax docs, but is there value in
> explaining that to users?  How it that helpful?  Example?
>
>
Yeah, we punt on the entire concept in the data type section:

"Managing these errors and how they propagate through calculations is the
subject of an entire branch of mathematics and computer science and will
not be discussed here," ...

Also, I'm now led to believe that the relevant IEEE 754 floating point
addition is indeed commutative.  Given that, I am inclined to simply not
add the order by clause at all to those four functions. (actually, you
already got rid of the avg()s but the sum()s are still present, so just
those two).

David J.

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread Bruce Momjian

On Wed, Oct 25, 2023 at 04:14:11PM -0700, David G. Johnston wrote:
> Yeah, we punt on the entire concept in the data type section:
> 
> "Managing these errors and how they propagate through calculations is the
> subject of an entire branch of mathematics and computer science and will not 
> be
> discussed here," ...
> 
> Also, I'm now led to believe that the relevant IEEE 754 floating point 
> addition
> is indeed commutative.  Given that, I am inclined to simply not add the order
> by clause at all to those four functions. (actually, you already got rid of 
> the
> avg()s but the sum()s are still present, so just those two).

Ah, yes, sum() removed.  Updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..3b49e63987 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..b64733d8aa 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1647,7 +1647,26 @@ sqrt(2)
 are always just expressions and cannot be output-column names or numbers.
 For example:
 
-SELECT array_agg(a ORDER BY b DESC) FROM table;
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(v ORDER BY v DESC) FROM vals;
+  array_agg
+-
+ {4,3,3,2,1}
+
+Since jsonb only keeps the last matching key, ordering
+of its keys can be significant:
+
+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
+SELECT jsonb_object_agg(k, v) FROM vals;
+  jsonb_object_agg
+
+ {"key0": "1", "key1": "2"}
+
+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
+SELECT jsonb_object_agg(k, v ORDER BY v) FROM vals;
+  jsonb_object_agg
+
+ {"key0": "1", "key1": "3"}
 

 
@@ -1673,6 +1692,14 @@ SELECT string_agg(a ORDER BY a, ',') FROM table;  -- incorrect
 expressions must match regular arguments of the aggregate; that is,
 you cannot sort on an expression that is not included in the
 DISTINCT list.
+For example:
+
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(DISTINCT v) FROM vals;
+ array_agg
+---
+ {1,2,3,4}
+

Re: pg_stat_statements and "IN" conditions

2023-10-25 Thread Michael Paquier

On Tue, Oct 17, 2023 at 10:15:41AM +0200, Dmitry Dolgov wrote:
> In the current patch version I didn't add anything yet to address the
> question of having more parameters to tune constants merging. The main
> obstacle as I see it is that the information for that has to be
> collected when jumbling various query nodes. Anything except information
> about the ArrayExpr itself would have to be acquired when jumbling some
> other parts of the query, not directly related to the ArrayExpr. It
> seems to me this interdependency between otherwise unrelated nodes
> outweigh the value it brings, and would require some more elaborate (and
> more invasive for the purpose of this patch) mechanism to implement.

 typedef struct ArrayExpr
 {
+   pg_node_attr(custom_query_jumble)
+

Hmm.  I am not sure that this is the best approach
implementation-wise.  Wouldn't it be better to invent a new
pg_node_attr (these can include parameters as well!), say
query_jumble_merge or query_jumble_agg_location that aggregates all
the parameters of a list to be considered as a single element.  To put
it short, we could also apply the same property to other parts of a
parsed tree, and not only an ArrayExpr's list.

 /* GUC parameters */
 extern PGDLLIMPORT int compute_query_id;
-
+extern PGDLLIMPORT bool query_id_const_merge;

Not much a fan of this addition as well for an in-core GUC.  I would
suggest pushing the GUC layer to pg_stat_statements, maintaining the
computation method to use as a field of JumbleState as I suspect that
this is something we should not enforce system-wide, but at
extension-level instead.

+   /*
+* Indicates the constant represents the beginning or the end of a 
merged
+* constants interval.
+*/
+   boolmerged;

Not sure that this is the best thing to do either.  Instead of this
extra boolean flag, could it be simpler if we switch LocationLen so as
we track the start position and the end position of a constant in a
query string, so as we'd use one LocationLen for a whole set of Const
nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
of its own?

+   /*
+* If the first expression is a constant, verify if the following 
elements
+* are constants as well. If yes, the list is eligible for merging, and 
the
+* order of magnitude need to be calculated.
+*/
+   if (IsA(firstExpr, Const))
+   {
+   foreach(temp, elements)
+   if (!IsA(lfirst(temp), Const))
+   return false;

This path should be benchmarked, IMO.
--
Michael

signature.asc
Description: PGP signature

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread David G. Johnston

On Wed, Oct 25, 2023 at 4:22 PM Bruce Momjian  wrote:

> On Wed, Oct 25, 2023 at 04:14:11PM -0700, David G. Johnston wrote:
> > Yeah, we punt on the entire concept in the data type section:
> >
> > "Managing these errors and how they propagate through calculations is the
> > subject of an entire branch of mathematics and computer science and will
> not be
> > discussed here," ...
> >
> > Also, I'm now led to believe that the relevant IEEE 754 floating point
> addition
> > is indeed commutative.  Given that, I am inclined to simply not add the
> order
> > by clause at all to those four functions. (actually, you already got rid
> of the
> > avg()s but the sum()s are still present, so just those two).
>
> Ah, yes, sum() removed.  Updated patch attached.
>
>
The paragraph leading into the last added example needs to be tweaked:

If DISTINCT is specified within an aggregate, the data is sorted in
ascending order while extracting unique values.  You can add an ORDER BY
clause, limited to expressions matching the regular arguments of the
aggregate, to sort the output in descending order.

(show existing - DISTINCT only - example here)

WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
SELECT string_agg(DISTINCT v::text, ';' ORDER BY v::text DESC) FROM vals;
 string_agg
---
  4;3;2;1

(existing note)

Question: Do you know whether we for certain always sort ascending here to
compute the unique values or whether if, say, there is an index on the
column in descending order (or ascending and traversed backwards) that the
data within the aggregate could, with an order by, be returned in
descending order?  If it is ascending, is that part of the SQL Standard
(since it doesn't even allow an order by to give the user the ability the
control the output ordering) or does the SQL Standard expect that even a
random order would be fine since there are algorithms that can be used that
do not involve sorting the input?

It seems redundant to first say "regular arguments" then negate it in order
to say "DISTINCT list".  Using the positive form with "DISTINCT list"
should get the point across sufficiently and succinctly.  It also avoids me
feeling like there should be an example of what happens when you do "sort
on an expression that is not included in the DISTINCT list".

Interestingly:

WITH vals (v,l) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM vals;

ERROR:  in an aggregate with DISTINCT, ORDER BY expressions must appear in
argument list
LINE 2: SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM...

But both expressions in the argument list (el and semicolon) do appear in
the ORDER BY...

David J.

Re: Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Michael Paquier

On Wed, Oct 25, 2023 at 03:07:39PM -0400, Tom Lane wrote:
> AFAIK, nothing particularly awful will happen if that counter wraps
> around.  Perhaps if you gamed the system really hard, you could cause
> a collision with a still-extant temp file from the previous cycle,
> but I seriously doubt that could happen by accident.  So I don't
> think there's anything to worry about here.  Maybe we could make
> that filename pattern %lu not %ld, but minus sign is a perfectly
> acceptable filename character, so such a change would be cosmetic.

In the mood of removing long because it may be 4 bytes or 8 bytes
depending on the environment, I'd suggest to change it to either int64
or uint64.  Not that it matters much for this specific case, but that
makes the code more portable.
--
Michael


signature.asc
Description: PGP signature

Re: Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Tom Lane

Michael Paquier  writes:
> In the mood of removing long because it may be 4 bytes or 8 bytes
> depending on the environment, I'd suggest to change it to either int64
> or uint64.  Not that it matters much for this specific case, but that
> makes the code more portable.

Then you're going to need a not-so-portable conversion spec in the
snprintf call.  Not sure it's any improvement.

regards, tom lane

Re: libpq async connection and multiple hosts

2023-10-25 Thread Daniele Varrazzo

On Thu, 26 Oct 2023, 00:10 Jelte Fennema,  wrote:

> On Wed, 25 Oct 2023 at 18:54, Daniele Varrazzo
>  wrote:
> > - connect_timeout
> > - multiple host, hostaddr, port
> > - load_balance_hosts=random
> >
> > Does this list sound complete?
>
> I think you'd also want to resolve the hostnames to IPs yourself and
> iterate over those one-by-one. Otherwise if the first IP returned for
> the hostname times out, you will never connect to the others.
>

For async connections we were already unpacking and processing the hosts
list, in order to perform non-blocking resolution and populate the
hostaddr. This already accounted for the possibility of one host resolving
to more than one address. But then we would have packed everything back
into a single conninfo and made a single connection attempt.

https://github.com/psycopg/psycopg/blob/14740add6bb1aebf593a65245df21699daabfad5/psycopg/psycopg/conninfo.py#L278

The goal here was only non-blocking name resolution. Ahaini understand we
should do is to split on the hosts for sync connections too, shuffle if
requested, and make separate  connection attempts.

-- Daniele

>

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread Bruce Momjian

On Wed, Oct 25, 2023 at 05:10:17PM -0700, David G. Johnston wrote:
> The paragraph leading into the last added example needs to be tweaked:
> 
> If DISTINCT is specified within an aggregate, the data is sorted in ascending
> order while extracting unique values.  You can add an ORDER BY clause, limited
> to expressions matching the regular arguments of the aggregate, to sort the
> output in descending order.
> 
> (show existing - DISTINCT only - example here)
> 
> 
> WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
> SELECT string_agg(DISTINCT v::text, ';' ORDER BY v::text DESC) FROM vals;
>  string_agg
> ---
>   4;3;2;1
> 
> 
> (existing note)

I see what you mean.  I added an example that doesn't match the existing
paragraph.  I have rewritten the paragraph and used a relevant example;
patch attached.

> Question: Do you know whether we for certain always sort ascending here to
> compute the unique values or whether if, say, there is an index on the column
> in descending order (or ascending and traversed backwards) that the data 
> within
> the aggregate could, with an order by, be returned in descending order?  If it
> is ascending, is that part of the SQL Standard (since it doesn't even allow an
> order by to give the user the ability the control the output ordering) or does
> the SQL Standard expect that even a random order would be fine since there are
> algorithms that can be used that do not involve sorting the input?

I don't think order is ever guaranteed in the standard without an ORDER
BY.

> It seems redundant to first say "regular arguments" then negate it in order to
> say "DISTINCT list".  Using the positive form with "DISTINCT list" should get
> the point across sufficiently and succinctly.  It also avoids me feeling like
> there should be an example of what happens when you do "sort on an expression
> that is not included in the DISTINCT list".

Agreed, I rewrote that.

> Interestingly:
> 
> WITH vals (v,l) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
> SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM vals;
> 
> ERROR:  in an aggregate with DISTINCT, ORDER BY expressions must appear in
> argument list
> LINE 2: SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM...
> 
> But both expressions in the argument list (el and semicolon) do appear in the
> ORDER BY...

I think ORDER BY has to match DISTINCT columns, while you are using ';'.
I used a simpler example with array_agg() in my patch to avoid the issue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..3b49e63987 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..c5627001c7 100644
--- a/do

Re: Introduce a new view for checkpointer related stats

2023-10-25 Thread Michael Paquier

On Mon, Feb 13, 2023 at 11:31:03AM +0530, Bharath Rupireddy wrote:
> Needed a rebase. Please review the attached v8 patch set.

I was looking at this patch, and got a few comments.

FWIW, I kind of agree with the feeling of Bertrand upthread that using
"checkpoint_" in the attribute names for the new view is globally
inconsistent.  After 0003, we get: 
=# select attname from pg_attribute
 where attrelid = 'pg_stat_checkpointer'::regclass
 and attnum > 0;
  attname
-
 timed_checkpoints
 requested_checkpoints
 checkpoint_write_time
 checkpoint_sync_time
 buffers_written_checkpoints
 stats_reset
(6 rows)
=# select attname from pg_attribute
 where attrelid = 'pg_stat_bgwriter'::regclass and
 attnum > 0;
 attname
--
 buffers_clean
 maxwritten_clean
 buffers_alloc
 stats_reset
(4 rows)

The view for the bgwriter does not do that.  I'd suggest to use
functions that are named as pg_stat_get_checkpoint_$att with shorter
$atts.  It is true that "timed" is a bit confusing, because it refers
to a number of checkpoints, and that can be read as a time value (?).
So how about num_timed?  And for the others num_requested and
buffers_written?

+ * Unlike the checkpoint fields, reqquests related fields are protected by 

s/reqquests/requests/.

 SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
 {
+   SlruShared  shared = ctl->shared;
int fd;
int save_errno;
int result;
 
+   /* update the stats counter of flushes */
+   pgstat_count_slru_flush(shared->slru_stats_idx);

Why is that in 0002?  Isn't that something we should treat as a bug
fix of its own, even backpatching it to make sure that the flush
requests for individual commit_ts, multixact and clog files are
counted in the stats?

Saying that, I am OK with moving ahead with 0001 and 0002 to remove
buffers_backend_fsync and buffers_backend from pg_stat_bgwriter, but
it is better to merge them into a single commit.  It helps with 0003
and this would encourage the use of pg_stat_io that does a better job
overall with more field granularity.
--
Michael


signature.asc
Description: PGP signature

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread David Rowley

On Thu, 26 Oct 2023 at 13:10, David G. Johnston
 wrote:
> Question: Do you know whether we for certain always sort ascending here to 
> compute the unique values or whether if, say, there is an index on the column 
> in descending order (or ascending and traversed backwards) that the data 
> within the aggregate could, with an order by, be returned in descending order?

The way it's currently coded, we seem to always require ascending
order.  See addTargetToGroupList().  The call to
get_sort_group_operators() only requests the ltOpr.

A quick test creating an index on a column with DESC shows that we end
up doing a backwards index scan so that we get the requested ascending
order:

create table b (b text);
create index on b (b desc);
explain select string_agg(distinct b,',') from b;
QUERY PLAN
--
 Aggregate  (cost=67.95..67.97 rows=1 width=32)
   ->  Index Only Scan Backward using b_b_idx on b  (cost=0.15..64.55
rows=1360 width=32)
(2 rows)

However, I think we'd best stay clear of offering any guarantees in
the documents about this.  If we did that it would be much harder in
the future if we wanted to implement the DISTINCT aggregates by
hashing.

David

Re: remaining sql/json patches

2023-10-25 Thread Amit Langote

Hi Nikita,

On Thu, Oct 26, 2023 at 2:13 AM Nikita Malakhov  wrote:
> Amit, on previous email, patch #2 - I agree that it is not the best idea to 
> introduce
> new type of logic into the parser, so this logic could be moved to the 
> executor,
> or removed at all. What do you think of these options?

Yes maybe, though I'd first like to have a good answer to why is that
logic necessary at all.  Maybe you think it's better to emit an error
in the SQL/JSON layer of code than in the type input function if it's
unsafe?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Remove dead code in pg_ctl.c

2023-10-25 Thread David Steele


On 10/25/23 17:30, Nathan Bossart wrote:

On Wed, Oct 25, 2023 at 03:02:01PM -0500, Nathan Bossart wrote:

On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote:

It looks like this code was missed in 39969e2a when exclusive backup was
removed.


Indeed.  I'll plan on committing this shortly.


Committed.


Thank you, Nathan!

-David

Re: Add null termination to string received in parallel apply worker

2023-10-25 Thread David Rowley

On Wed, 11 Oct 2023 at 19:54, Zhijie Hou (Fujitsu)
 wrote:
> The parallel apply worker didn't add null termination to the string received
> from the leader apply worker via the shared memory queue. This action doesn't
> bring bugs as it's binary data but violates the rule established in 
> StringInfo,
> which guarantees the presence of a terminating '\0' at the end of the string.

Just for anyone not following the other thread that you linked,  I
just pushed f0efa5aec, and, providing that sticks, I think we can drop
this discussion now.

That commit relaxes the requirement that the StringInfo must be NUL terminated.

David

Re: Is this a problem in GenericXLogFinish()?

2023-10-25 Thread Alexander Lakhin


Hello hackers,

24.10.2023 03:45, Jeff Davis wrote:

Committed.


I've encountered a case with a hash index, when an assert added by the
commit fails:

CREATE TABLE t (i INT);
CREATE INDEX hi ON t USING HASH (i);
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);

BEGIN;
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);
ROLLBACK;

CHECKPOINT;

VACUUM t;

Leads to:
Core was generated by `postgres: law regression [local] VACUUM  
 '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/211944' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) 
at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139924569388864) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139924569388864, signo=signo@entry=6) at 
./nptl/pthread_kill.c:89
#3  0x7f42b99ed476 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x7f42b99d37f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x55f128e83f1b in ExceptionalCondition (conditionName=0x55f128f33520 "BufferIsExclusiveLocked(buffer) && 
BufferIsDirty(buffer)", fileName=0x55f128f333a8 "xloginsert.c", lineNumber=262) at assert.c:66

#6  0x55f1287edce3 in XLogRegisterBuffer (block_id=1 '\001', buffer=1808, 
flags=8 '\b') at xloginsert.c:262
#7  0x55f128742833 in _hash_freeovflpage (rel=0x7f42adb95c88, bucketbuf=1808, ovflbuf=1825, wbuf=1808, 
itups=0x7ffc3f18c010, itup_offsets=0x7ffc3f18bce0, tups_size=0x7ffc3f18ccd0, nitups=0, bstrategy=0x55f12a562820)

    at hashovfl.c:671
#8  0x55f128743567 in _hash_squeezebucket (rel=0x7f42adb95c88, bucket=6, bucket_blkno=7, bucket_buf=1808, 
bstrategy=0x55f12a562820) at hashovfl.c:1064
#9  0x55f12873ca2a in hashbucketcleanup (rel=0x7f42adb95c88, cur_bucket=6, bucket_buf=1808, bucket_blkno=7, 
bstrategy=0x55f12a562820, maxbucket=7, highmask=15, lowmask=7, tuples_removed=0x7ffc3f18fb28,
    num_index_tuples=0x7ffc3f18fb30, split_cleanup=false, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at hash.c:921
#10 0x55f12873bfcf in hashbulkdelete (info=0x7ffc3f18fc30, stats=0x0, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at hash.c:549
#11 0x55f128776fbb in index_bulk_delete (info=0x7ffc3f18fc30, istat=0x0, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at indexam.c:709
#12 0x55f1289ba03d in vac_bulkdel_one_index (ivinfo=0x7ffc3f18fc30, istat=0x0, dead_items=0x55f12a566778) at 
vacuum.c:2480
#13 0x55f128771286 in lazy_vacuum_one_index (indrel=0x7f42adb95c88, istat=0x0, reltuples=-1, vacrel=0x55f12a4b9c30) 
at vacuumlazy.c:2768

#14 0x55f1287704a3 in lazy_vacuum_all_indexes (vacrel=0x55f12a4b9c30) at 
vacuumlazy.c:2338
#15 0x55f128770275 in lazy_vacuum (vacrel=0x55f12a4b9c30) at 
vacuumlazy.c:2256
...

It looks like the buffer is not dirty in the problematic call.

Best regards,
Alexander

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-25 Thread David Rowley

On Thu, 26 Oct 2023 at 08:43, Tom Lane  wrote:
> I think that we can make that assumption starting with v17.
> Back-patching it would be hazardous perhaps; but if there's some
> function out there that depends on NUL termination, testing should
> expose it before too long.  Wouldn't hurt to mention this explicitly
> as a possible incompatibility in the commit message.
>
> Looking over the v5 patch, I have some nits:

Thanks for looking at this again. I fixed up each of those and pushed
the result, mentioning the incompatibility in the commit message.

Now that that's done, I've attached a patch which makes use of the new
initReadOnlyStringInfo initializer function for the original case
mentioned when I opened this thread. I don't think there are any
remaining objections to this, but I'll let it sit for a bit to see.

David
diff --git a/src/backend/utils/adt/array_userfuncs.c 
b/src/backend/utils/adt/array_userfuncs.c
index c831a9395c..57bd7e82bc 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -723,12 +723,11 @@ array_agg_deserialize(PG_FUNCTION_ARGS)
sstate = PG_GETARG_BYTEA_PP(0);
 
/*
-* Copy the bytea into a StringInfo so that we can "receive" it using 
the
-* standard recv-function infrastructure.
+* Initialize a StringInfo so that we can "receive" it using the 
standard
+* recv-function infrastructure.
 */
-   initStringInfo(&buf);
-   appendBinaryStringInfo(&buf,
-  VARDATA_ANY(sstate), 
VARSIZE_ANY_EXHDR(sstate));
+   initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate),
+  VARSIZE_ANY_EXHDR(sstate));
 
/* element_type */
element_type = pq_getmsgint(&buf, 4);
@@ -815,7 +814,6 @@ array_agg_deserialize(PG_FUNCTION_ARGS)
}
 
pq_getmsgend(&buf);
-   pfree(buf.data);
 
PG_RETURN_POINTER(result);
 }
@@ -1124,12 +1122,11 @@ array_agg_array_deserialize(PG_FUNCTION_ARGS)
sstate = PG_GETARG_BYTEA_PP(0);
 
/*
-* Copy the bytea into a StringInfo so that we can "receive" it using 
the
-* standard recv-function infrastructure.
+* Initialize a StringInfo so that we can "receive" it using the 
standard
+* recv-function infrastructure.
 */
-   initStringInfo(&buf);
-   appendBinaryStringInfo(&buf,
-  VARDATA_ANY(sstate), 
VARSIZE_ANY_EXHDR(sstate));
+   initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate),
+  VARSIZE_ANY_EXHDR(sstate));
 
/* element_type */
element_type = pq_getmsgint(&buf, 4);
@@ -1187,7 +1184,6 @@ array_agg_array_deserialize(PG_FUNCTION_ARGS)
memcpy(result->lbs, temp, sizeof(result->lbs));
 
pq_getmsgend(&buf);
-   pfree(buf.data);
 
PG_RETURN_POINTER(result);
 }
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 3c3184f15b..bf61fd7dbc 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -5190,12 +5190,11 @@ numeric_avg_deserialize(PG_FUNCTION_ARGS)
init_var(&tmp_var);
 
/*
-* Copy the bytea into a StringInfo so that we can "receive" it using 
the
-* standard recv-function infrastructure.
+* Initialize a StringInfo so that we can "receive" it using the 
standard
+* recv-function infrastructure.
 */
-   initStringInfo(&buf);
-   appendBinaryStringInfo(&buf,
-  VARDATA_ANY(sstate), 
VARSIZE_ANY_EXHDR(sstate));
+   initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate),
+  VARSIZE_ANY_EXHDR(sstate));
 
result = makeNumericAggStateCurrentContext(false);
 
@@ -5222,7 +5221,6 @@ numeric_avg_deserialize(PG_FUNCTION_ARGS)
result->nInfcount = pq_getmsgint64(&buf);
 
pq_getmsgend(&buf);
-   pfree(buf.data);
 
free_var(&tmp_var);
 
@@ -5306,12 +5304,11 @@ numeric_deserialize(PG_FUNCTION_ARGS)
init_var(&tmp_var);
 
/*
-* Copy the bytea into a StringInfo so that we can "receive" it using 
the
-* standard recv-function infrastructure.
+* Initialize a StringInfo so that we can "receive" it using the 
standard
+* recv-function infrastructure.
 */
-   initStringInfo(&buf);
-   appendBinaryStringInfo(&buf,
-  VARDATA_ANY(sstate), 
VARSIZE_ANY_EXHDR(sstate));
+   initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate),
+  VARSIZE_ANY_EXHDR(sstate));
 
result = makeNumericAggStateCurrentContext(false);
 
@@ -5342,7 +5339,6 @@ numeric_deserialize(PG_FUNCTION_ARGS)
result->nInfcount = pq_getmsgint64(&buf);
 
pq_getmsgend(&buf);
-

Re: A performance issue with Memoize

2023-10-25 Thread Andrei Lepikhov


On 20/10/2023 17:40, Richard Guo wrote:

I noticed $subject with the query below.

set enable_memoize to off;

explain (analyze, costs off)
select * from tenk1 t1 left join lateral
     (select t1.two as t1two, * from tenk1 t2 offset 0) s
on t1.two = s.two;
                                      QUERY PLAN

  Nested Loop Left Join (actual time=0.050..59578.053 rows=5000 loops=1)
    ->  Seq Scan on tenk1 t1 (actual time=0.027..2.703 rows=1 loops=1)
    ->  Subquery Scan on s (actual time=0.004..4.819 rows=5000 loops=1)
          Filter: (t1.two = s.two)
          Rows Removed by Filter: 5000
          ->  Seq Scan on tenk1 t2 (actual time=0.002..3.834 rows=1 
loops=1)

  Planning Time: 0.666 ms
  Execution Time: 60937.899 ms
(8 rows)

set enable_memoize to on;

explain (analyze, costs off)
select * from tenk1 t1 left join lateral
     (select t1.two as t1two, * from tenk1 t2 offset 0) s
on t1.two = s.two;
                                         QUERY PLAN
--
  Nested Loop Left Join (actual time=0.061..122684.607 rows=5000 
loops=1)

    ->  Seq Scan on tenk1 t1 (actual time=0.026..3.367 rows=1 loops=1)
    ->  Memoize (actual time=0.011..9.821 rows=5000 loops=1)
          Cache Key: t1.two, t1.two
          Cache Mode: binary
          Hits: 0  Misses: 1  Evictions:   Overflows: 0  Memory 
Usage: 1368kB
          ->  Subquery Scan on s (actual time=0.008..5.188 rows=5000 
loops=1)

                Filter: (t1.two = s.two)
                Rows Removed by Filter: 5000
                ->  Seq Scan on tenk1 t2 (actual time=0.004..4.081 
rows=1 loops=1)

  Planning Time: 0.607 ms
  Execution Time: 124431.388 ms
(12 rows)

The execution time (best of 3) is 124431.388 VS 60937.899 with and
without memoize.

The Memoize runtime stats 'Hits: 0  Misses: 1  Evictions: '
seems suspicious to me, so I've looked into it a little bit, and found
that the MemoizeState's keyparamids and its outerPlan's chgParam are
always different, and that makes us have to purge the entire cache each
time we rescan the memoize node.

But why are they always different?  Well, for the query above, we have
two NestLoopParam nodes, one (with paramno 1) is created when we replace
outer-relation Vars in the scan qual 't1.two = s.two', the other one
(with paramno 0) is added from the subquery's subplan_params, which was
created when we replaced uplevel vars with Param nodes for the subquery.
That is to say, the chgParam would be {0, 1}.

When it comes to replace outer-relation Vars in the memoize keys, the
two 't1.two' Vars are both replaced with the NestLoopParam with paramno
1, because it is the first NLP we see in root->curOuterParams that is
equal to the Vars in memoize keys.  That is to say, the memoize node's
keyparamids is {1}.
...
Any thoughts?


Do you've thought about the case, fixed with the commit 1db5667? As I 
see, that bugfix still isn't covered by regression tests. Could your 
approach of a PARAM_EXEC slot reusing break that case?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Open a streamed block for transactional messages during decoding

2023-10-25 Thread Amit Kapila

On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu)
 wrote:
>
> While reviewing the test_decoding code, I noticed that when skip_empty_xacts
> option is specified, it doesn't open the streaming block( e.g.
> pg_output_stream_start) before streaming the transactional MESSAGE even if 
> it's
> the first change in a streaming block.
>
> It looks inconsistent with what we do when streaming DML
> changes(e.g. pg_decode_stream_change()).
>
> Here is a small patch to open the stream block in this case.
>

The change looks good to me though I haven't tested it yet. BTW, can
we change the comment: "Output stream start if we haven't yet, but
only for the transactional case." to "Output stream start if we
haven't yet for transactional messages"?

I think we should backpatch this fix. What do you think?

-- 
With Regards,
Amit Kapila.

Re: Is this a problem in GenericXLogFinish()?

2023-10-25 Thread Jeff Davis

On Thu, 2023-10-26 at 07:00 +0300, Alexander Lakhin wrote:

> It looks like the buffer is not dirty in the problematic call.

Thank you for the report! I was able to reproduce and observe that the
buffer is not marked dirty.

The call (hashovfl.c:671):

  XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD)

is followed unconditionally by:

  PageSetLSN(BufferGetPage(wbuf), recptr)

so if the Assert were not there, it would be setting the LSN on a page
that's not marked dirty. Therefore I think this is a bug, or at least
an interesting/unexpected case.

Perhaps the registration and the PageSetLSN don't need to happen when
nitups==0?

Regards,
Jeff Davis

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-25 Thread David G. Johnston

On Wed, Oct 25, 2023 at 7:13 PM David Rowley  wrote:

> On Thu, 26 Oct 2023 at 13:10, David G. Johnston
>  wrote:
> > Question: Do you know whether we for certain always sort ascending here
> to compute the unique values or whether if, say, there is an index on the
> column in descending order (or ascending and traversed backwards) that the
> data within the aggregate could, with an order by, be returned in
> descending order?
>
> The way it's currently coded, we seem to always require ascending
> order.  See addTargetToGroupList().  The call to
> get_sort_group_operators() only requests the ltOpr.
>
> A quick test creating an index on a column with DESC shows that we end
> up doing a backwards index scan so that we get the requested ascending
> order:
>
> create table b (b text);
> create index on b (b desc);
> explain select string_agg(distinct b,',') from b;
> QUERY PLAN
>
> --
>  Aggregate  (cost=67.95..67.97 rows=1 width=32)
>->  Index Only Scan Backward using b_b_idx on b  (cost=0.15..64.55
> rows=1360 width=32)
> (2 rows)
>
> However, I think we'd best stay clear of offering any guarantees in
> the documents about this.  If we did that it would be much harder in
> the future if we wanted to implement the DISTINCT aggregates by
> hashing.
>
> So, I think we are mischaracterizing the Standard here, if only in the
specific case of array_agg.

SQL Standard: 4.16.4

Every unary aggregate function takes an arbitrary  as the
argument; most unary aggregate
functions can optionally be qualified with either DISTINCT or ALL.

If ARRAY_AGG is specified, then an array value with one element formed from
the 
evaluated for each row that qualifies.

Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP,
STDDEV_POP, or
STDDEV_SAMP; redundant duplicates are not removed when computing these
functions.

10.9

 ::=
ARRAY_AGG
  [ ORDER BY  ]

I would reword the existing note to be something like:

The SQL Standard defines specific aggregates and their properties,
including which of DISTINCT and/or ORDER BY is allowed.  Due to the
extensible nature of PostgreSQL it accepts either or both clauses for any
aggregate.

>From the most recent patch:

-If DISTINCT is specified in addition to an
-order_by_clause, then all the
ORDER BY
-expressions must match regular arguments of the aggregate; that is,
-you cannot sort on an expression that is not included in the
-DISTINCT list.
+If DISTINCT is specified with an
+order_by_clause, ORDER
+BY expressions can only reference columns in the
+DISTINCT list.  For example:
+
+WITH vals (v1, v2) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
+SELECT array_agg(DISTINCT v2 ORDER BY v2 DESC) FROM vals;
+  array_agg
+-
+ {Z,T,R,D,A}
+

The change to a two-column vals was mostly to try and find corner-cases
that might need to be addressed.  If we don't intend to show the error case
of DISTINCT v1 ORDER BY v2 then we should go back to the original example
and just add ORDER BY v DESC.  I'm fine with not using string_agg here.

+For example:
+
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals;
+ array_agg
+---
+ {4,3,2,1}
+

We get enough complaints regarding "apparent ordering" that I would like to
add:

As a reminder, while some DISTINCT processing algorithms produce sorted
output as a side-effect, only by specifying ORDER BY is the output order
guaranteed.

David J.

Re: Container Types

2023-10-25 Thread Jeff Davis

On Wed, 2023-10-25 at 16:01 -0700, Andres Freund wrote:
> I'd look whether you can generalize that infrastructure.

I had briefly looked at using the record type mechanism before, and it
seemed like a challenge because it doesn't really work when passing
through a function call:

   CREATE TABLE t(a INT, b TEXT);
   INSERT INTO t VALUES(1, 'one');
   CREATE FUNCTION id(RECORD) RETURNS RECORD LANGUAGE plpgsql AS
 $$ BEGIN RETURN $1; END; $$;
   SELECT t.a FROM t; -- 1
   SELECT (id(t)).a FROM t; -- ERROR

But now that I think about it, that's really a type inference
limitation, and that needs to be solved regardless.

After the type inference figures out what the right type is, then I
think you're right that an OID is not required to track it, and however
we do track it should be able to reuse some of the existing
infrastructure for dealing with record types.

Regards,
Jeff Davis

Re: Show WAL write and fsync stats in pg_stat_io

2023-10-25 Thread Michael Paquier

On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> Any kind of feedback would be appreciated.

This was registered in the CF, so I have given it a look.  Note that
0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
applied.

+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+   /*
+* io times of IOOBJECT_WAL IOObject needs to be tracked when
+* 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+*/
+   if (io_object == IOOBJECT_WAL)
+   return track_wal_io_timing;
+
+   return track_io_timing;

I can see the temptation to do that, but I have mixed feelings about
the approach of mixing two GUCs in a code path dedicated to pg_stat_io
where now we only rely on track_io_timing.  The result brings
confusion, while making pg_stat_io, which is itself only used for
block-based operations, harder to read.

The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
is quite tempting, actually, creating a neat separation between the
existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
that provides more details about the contexts and backend types for
the WAL stats with its relevant fields:
https://www.postgresql.org/message-id/caakru_bm55pj3pprw0nd_-pawhlrkou69r816aeztbba-n1...@mail.gmail.com

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

On top of that pg_stat_io is now for block-based I/O operations, so
that does not fit entirely in the picture, though I guess that Melanie
has thought more on the matter than me.  That may be also a matter of
taste.

+  /*  Report pending statistics to the cumulative stats system */
+  pgstat_report_wal(false);

This is hidden in 0001, still would be better if handled as a patch on
its own and optionally backpatch it as we did for the bgwriter with
e64c733bb1?

Side note: I think that we should spend more efforts in documenting
what IOContext and IOOp mean.  Not something directly related to this
patch, still this patch or things similar make it a bit harder which
part of it is used for what by reading pgstat.h.
--
Michael


signature.asc
Description: PGP signature

Re: RFC: Logging plan of the running query

2023-10-25 Thread torikoshia


On 2023-10-25 12:40, Ashutosh Bapat wrote:
On Wed, Oct 18, 2023 at 10:04 PM James Coleman  
wrote:


While that decision as regards auto_explain has long since been made
(and there would be work to undo it), I don't believe that we should
repeat that choice here. I'm -10 on moving this into auto_explain.



Right.


However perhaps there is still an opportunity for moving some of the
auto_explain code into core so as to enable deduplicating the code.



Right. That's what I also think.


Thanks for your comments.

Attached patch adds a new function which assembles es->str for logging
according to specified contents and format.
This is called from both auto_explain and pg_log_query_plan().

On 2023-10-11 16:22, Ashutosh Bapat wrote:

I am also wondering whether it's better to report the WARNING as
status column in the output.


Attached patch left as it was since the inconsistency with
pg_terminate_backend() and pg_log_backend_memory_contexts() as you
pointed out.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 336ba3943f631dcbc0d1226ebd0a7675cf78c1f8 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 26 Oct 2023 15:42:36 +0900
Subject: [PATCH v32] Add function to log the plan of the query

Currently, we have to wait for the query execution to finish
to check its plan. This is not so convenient when
investigating long-running queries on production environments
where we cannot use debuggers.
To improve this situation, this patch adds
pg_log_query_plan() function that requests to log the
plan of the specified backend process.

By default, only superusers are allowed to request to log the
plans because allowing any users to issue this request at an
unbounded rate would cause lots of log messages and which can
lead to denial of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its plan at LOG_SERVER_ONLY level, so
that these plans will appear in the server log but not be sent
to the client.

Reviewed-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar,
Masahiro Ikeda, Ekaterina Sokolova, Justin Pryzby, Kyotaro
Horiguchi, Robert Treat, Alena Rybakina, Ashutosh Bapat

Co-authored-by: James Coleman 
---
 contrib/auto_explain/auto_explain.c  |  23 +-
 doc/src/sgml/func.sgml   |  49 +
 src/backend/access/transam/xact.c|  13 ++
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/explain.c   | 209 ++-
 src/backend/executor/execMain.c  |  14 ++
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/storage/lmgr/lock.c  |   9 +-
 src/backend/tcop/postgres.c  |   4 +
 src/backend/utils/error/elog.c   |   2 +
 src/backend/utils/init/globals.c |   2 +
 src/include/catalog/pg_proc.dat  |   6 +
 src/include/commands/explain.h   |   8 +
 src/include/miscadmin.h  |   1 +
 src/include/storage/lock.h   |   2 -
 src/include/storage/procsignal.h |   1 +
 src/include/tcop/pquery.h|   2 +-
 src/include/utils/elog.h |   1 +
 src/test/regress/expected/misc_functions.out |  54 -
 src/test/regress/sql/misc_functions.sql  |  41 +++-
 20 files changed, 399 insertions(+), 48 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c3ac27ae99..20a73df8c4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -399,26 +399,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->format = auto_explain_log_format;
 			es->settings = auto_explain_log_settings;
 
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-ExplainPrintTriggers(es, queryDesc);
-			if (es->costs)
-ExplainPrintJITSummary(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-es->str->data[0] = '{';
-es->str->data[es->str->len - 1] = '}';
-			}
+			ExplainAssembleLogOutput(es, queryDesc, auto_explain_log_format,
+	 auto_explain_log_triggers,
+	 auto_explain_log_parameter_max_length);
 
 			/*
 			 * Note: we rely on the existing logging of context or
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..8d77fe1a5b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26406,6 +26406,25 @@ SELECT collation for ('foo' COLLATE "de_DE");

   
 
+  
+   
+
+ pg_log_query_plan
+
+pg_log_query_plan

94 matches

Mail list logo