Re: Simplify create_merge_append_path a bit for clarity
On Tue, Oct 24, 2023 at 6:00 PM Alena Rybakina wrote: > I agree with you, and we can indeed directly set the param_info value to > NULL, and there are enough comments here to explain. > > I didn't find anything else to add in your patch. Thanks for reviewing this patch! Thanks Richard
Re: Bug: RLS policy FOR SELECT is used to check new rows
On Tue, 2023-10-24 at 15:05 -0400, Stephen Frost wrote: > On Tue, Oct 24, 2023 at 14:42 Robert Haas wrote: > > On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis wrote: > > > Perhaps the idea is that if there are constraints involved, the failure > > > or success of an INSERT/UPDATE/DELETE could leak information that you > > > don't have privileges to read. > > > > My recollection of this topic is pretty hazy, but like Tom, I seem to > > remember it being intentional, and I think the reason had something to > > do with wanting the slice of a RLS-protect table that you can see to > > feel like a complete table. When you update a row in a table all of > > which is visible to you, the updated row can never vanish as a result > > of that update, so it was thought, if I remember correctly, that this > > should also be true here. It's also similar to what happens if an > > updatable view has WITH CHECK OPTION, and I think that was part of the > > precedent as well. I don't know whether or not the constraint issue > > that you mention here was also part of the concern, but it may have > > been. This was all quite a while ago... > > Yes, having it be similar to a view WITH CHECK OPTION was intentional, > also on not wishing for things to be able to disappear or to not get saved. > The risk of a constraint possibly causing the leak of information is better > than either having data just thrown away or having the constraint not > provide the guarantee it’s supposed to … Thanks everybody for looking and remembering. I can accept that the error is intentional, even though it violated the POLA for me. I can buy into the argument that an UPDATE should not make a row seem to vanish. I cannot buy into the constraint argument. If the table owner wanted to prevent you from causing a constraint violation error with a row you cannot see, she wouldn't have given you a FOR UPDATE policy that allows you to perform such an UPDATE. Anyway, it is probably too late to change a behavior that has been like that for a while and is not manifestly buggy. Yours, Laurenz Albe
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Amit, Based on your advice, I revised the patch again. > > > > I spent some time on the v57 patch and it looks good to me - tests are > > passing, no complaints from pgindent and pgperltidy. I turned the CF > > entry https://commitfest.postgresql.org/45/4273/ to RfC. > > > > Thanks, the patch looks mostly good to me but I am not convinced of > keeping the tests across versions in this form. I don't think they are > tested in BF, only one can manually create a setup to test. I analyzed and agreed that current BF client does not use TAP test framework for cross-version checks. > Shall we > remove it for now and then consider it separately? OK, some parts for cross-checks were removed. > Apart from that, I have made minor modifications in the docs to adjust > the order of various prerequisites. Thanks, included. Best Regards, Hayato Kuroda FUJITSU LIMITED v59-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch Description: v59-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Wed, Oct 25, 2023 at 11:39 AM Amit Kapila wrote: > > On Tue, Oct 24, 2023 at 1:20 PM Bharath Rupireddy > wrote: > > > > > > I spent some time on the v57 patch and it looks good to me - tests are > > passing, no complaints from pgindent and pgperltidy. I turned the CF > > entry https://commitfest.postgresql.org/45/4273/ to RfC. > > > > Thanks, the patch looks mostly good to me but I am not convinced of > keeping the tests across versions in this form. I don't think they are > tested in BF, only one can manually create a setup to test. Shall we > remove it for now and then consider it separately? I think we can retain the test_upgrade_from_pre_PG17 because it is not only possible to trigger it manually but also one can write a CI workflow to trigger it. > Apart from that, I have made minor modifications in the docs to adjust > the order of various prerequisites. + + pg_upgrade attempts to migrate logical + replication slots. This helps avoid the need for manually defining the + same replication slots on the new publisher. Migration of logical + replication slots is only supported when the old cluster is version 17.0 + or later. Logical replication slots on clusters before version 17.0 will + silently be ignored. + + The new cluster must not have permanent logical replication slots, i.e., How about using "logical slots" in place of "logical replication slots" to be more generic? We agreed and changed the function name to -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Wed, Oct 25, 2023 at 1:39 PM Bharath Rupireddy wrote: > > On Wed, Oct 25, 2023 at 11:39 AM Amit Kapila wrote: > > > > On Tue, Oct 24, 2023 at 1:20 PM Bharath Rupireddy > > wrote: > > > > > > > > > I spent some time on the v57 patch and it looks good to me - tests are > > > passing, no complaints from pgindent and pgperltidy. I turned the CF > > > entry https://commitfest.postgresql.org/45/4273/ to RfC. > > > > > > > Thanks, the patch looks mostly good to me but I am not convinced of > > keeping the tests across versions in this form. I don't think they are > > tested in BF, only one can manually create a setup to test. Shall we > > remove it for now and then consider it separately? > > I think we can retain the test_upgrade_from_pre_PG17 because it is not > only possible to trigger it manually but also one can write a CI > workflow to trigger it. > It would be better to gauge its value separately and add it once the main patch is committed. I am slightly unhappy even with the hack used for pre-version testing in previous patch which is as follows: +# XXX: Older PG version had different rules for the inter-dependency of +# 'max_wal_senders' and 'max_connections', so assign values which will work for +# all PG versions. If Cluster.pm is fixed this code is not needed. +$old_publisher->append_conf( + 'postgresql.conf', qq[ +max_wal_senders = 5 +max_connections = 10 +]); There should be a way to avoid this but we can decide it afterwards. I don't want to hold the main patch for this point. What do you think? > > Apart from that, I have made minor modifications in the docs to adjust > > the order of various prerequisites. > > + > + pg_upgrade attempts to migrate logical > + replication slots. This helps avoid the need for manually defining the > + same replication slots on the new publisher. Migration of logical > + replication slots is only supported when the old cluster is version 17.0 > + or later. Logical replication slots on clusters before version 17.0 will > + silently be ignored. > + > > + The new cluster must not have permanent logical replication slots, > i.e., > > How about using "logical slots" in place of "logical replication > slots" to be more generic? We agreed and changed the function name to > Yeah, I am fine with that and I can take care of it before committing unless there is more to change. -- With Regards, Amit Kapila.
doc: a small improvement about pg_am description
Hi, When reading the documentation about operator class, I found the following description: The pg_am table contains one row for every index access method. Support for access to regular tables is built into PostgreSQL, but all index access methods are described in pg_am. It seems to me that this description says pg_am contains only index access methods but not table methods. I wonder it is missed to fix this when tableam was supported and other documentation was changed in b73c3a11963c8bb783993cfffabb09f558f86e37. Attached is a patch to remove the sentence that starts with "Support for access to regular tables is ". Ragards, Yugo Nagata -- Yugo NAGATA diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index c753d8005a..ff73233818 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -30,11 +30,8 @@ The pg_am table contains one row for every - index method (internally known as access method). Support for - regular access to tables is built into - PostgreSQL, but all index methods are - described in pg_am. It is possible to add a - new index access method by writing the necessary code and + index and table method (internally known as access method). It is possible + to add a new index access method by writing the necessary code and then creating an entry in pg_am — but that is beyond the scope of this chapter (see ).
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Wed, Oct 25, 2023 at 1:50 PM Amit Kapila wrote: > > It would be better to gauge its value separately and add it once the > main patch is committed. > There should be a way to avoid this but we can decide it afterwards. I > don't want to hold the main patch for this point. What do you think? +1 to go with the main patch first. We also have another thing to take care of - pg_upgrade option to not migrate logical slots. > > How about using "logical slots" in place of "logical replication > > slots" to be more generic? We agreed and changed the function name to > > > > Yeah, I am fine with that and I can take care of it before committing > unless there is more to change. +1. I have no other comments. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walwriter interacts quite badly with synchronous_commit=off
On 25/10/2023 02:09, Andres Freund wrote: Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping and Latch->is_set, we also sometimes end up with multiple processes signalling walwriter, which can be bad, because it increases the likelihood that some of the signals may be received when we are already holding WALWriteLock, delaying its release... That can only happen when walwriter has just come out of "hibernation", ie. when the system has been idle for a while. So probably not a big deal in practice. I think the most important optimization we need is to have XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed WAL. Unless walsender is hibernating, walsender will wake up on its own after wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this. E.g. a condition like if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ) return; drastically cuts down on the amount of wakeups, without - I think - loosing guarantees around synchronous_commit=off. In the patch, you actually did: + if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ) + return; It means that you never wake up the walwriter to merely *write* the WAL. You only wake it up if it's going to also fsync() it. I think that's correct and appropriate, but it took me a while to reach that conclusion: It might be beneficial to wake up the walwriter just to perform a write(), to offload that work from the backend. And walwriter will actually also perform an fsync() after finishing the current segment, so it would make sense to also wake it up when 'asyncXactLSN' crosses a segment boundary. However, if those extra wakeups make sense, they would also make sense when there are no asynchronous commits involved. Therefore those extra wakeups should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer(). The condition you have in the patch is appropriate for XLogSetAsyncXactLSN(). Another reason to write the WAL aggressively, even if you don't flush it, would be to reduce the number of lost transactions on a segfault. But we don't give any guarantees on that, and even with the current aggressive logic, we only write when a page is full so you're anyway going to lose the last partial page. It also took me a while to convince myself that this calculation matches the calculation that XLogBackgroundFlush() uses to determine whether it needs to flush or not. XLogBackgroundFlush() first divides the request and result with XLOG_BLCKSZ and then compares the number of blocks, whereas here you perform the calculation in bytes. I think the result is the same, but to make it more clear, let's do it the same way in both places. See attached. It's the same logic as in your patch, just formulatd more clearly IMHO. Because of the frequent wakeups, we do something else that's not smart: We write out 8k blocks individually, many times a second. Often thousands of 8k pwrites a second. Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal writer gets woken up by the async commits rather than the timeout, the write pattern is a bit silly: $ strace -p 1099926 # walwriter strace: Process 1099926 attached epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 1991) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"..., 1007616, 49152) = 1007616 fdatasync(5)= 0 pwrite64(5, "\24\321\5\0\1\0\0\0\0 \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768) = 16384 epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 2000) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"..., 1040384, 1073152) = 1040384 fdatasync(5)= 0 pwrite64(5, "\24\321\4\0\1\0\0\0\0@ \373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384 epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, u64=94261056289248}}], 1, 2000) = 1 read(3, "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 \373\5\0\0\0003\0\0\0\0\0\0\0\320\177 \373\5\0\0\0"..., 1040384, 2129920) = 1040384 fdatasync(5)= 0 In each cycle, the wal writer writes a full 1 MB chunk (wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller write before going to sleep. Those smaller writes seem a bit silly. But I think it's fine. More throughput for less CPU, seems neat :) Indeed, impressive speedup from such a small patch! I'm not add
Re: run pgindent on a regular basis / scripted manner
On Tue, Oct 24, 2023 at 6:21 AM Jeff Davis wrote: > > On Wed, 2023-10-18 at 22:34 +1300, David Rowley wrote: > > It would be good to learn how many of the committers out of the ones > > you listed that --enable-indent-checks would have saved from breaking > > koel. > > I'd find that a useful option. > +1. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
Hi, On 10/25/23 5:00 AM, shveta malik wrote: On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand wrote: Hi, On 10/23/23 2:56 PM, shveta malik wrote: On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand wrote: We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there is new synced slot(s) to be created on the standby. Do we want to keep this behavior for V1? I think for the slotsync workers case, we should reduce the naptime in the launcher to say 30sec and retain the default one of 3mins for subscription apply workers. Thoughts? Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new API on the standby that would refresh the list of sync slot at wish, thoughts? Do you mean API to refresh list of DBIDs rather than sync-slots? As per current design, launcher gets DBID lists for all the failover slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. I mean an API to get a newly created slot on the primary being created/synced on the standby at wish. Also let's imagine this scenario: - create logical_slot1 on the primary (and don't start using it) Then on the standby we'll get things like: 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot LSN (0/C0049530) and and catalog xmin (754) That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot restart_lsn to a value < at the corresponding restart_lsn slot on the primary. - create logical_slot2 on the primary (and start using it) Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary that would produce things like: 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) With this new dedicated API, it will be: - clear that the API call is "hanging" until there is some activity on the newly created slot (currently there is "waiting for remote slot " message in the logfile as mentioned above but I'm not sure that's enough) - be possible to create/sync logical_slot2 in the example above without waiting for activity on logical_slot1. Maybe we should change our current algorithm during slot creation so that a newly created inactive slot on the primary does not block other newly created "active" slots on the primary to be created on the standby? Depending on how we implement that, the new API may not be needed at all. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use virtual tuple slot for Unique node
On Tue, Oct 24, 2023 at 4:30 AM David Rowley wrote: > > On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat > wrote: > > I ran my experiments again. It seems on my machine the execution times > > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took > > average of execution times. I did this three times. For each run the > > standard deviation was within 2%. Here are the numbers > > master: 13548.33, 13878.88, 14572.52 > > master + 0001: 13734.58, 14193.83, 14574.73 > > > > So for me, I would say, this particular query performs the same with > > or without patch. > > I'm not really sure which of the 7 queries you're referring to here. > The times you're quoting seem to align best to Q7 from your previous > results, so I'll assume you mean Q7. > > I'm not really concerned with Q7 as both patched and unpatched use > TTSOpsMinimalTuple. It's Q7. Yes. I was responding to your statement: " However, Q7 does seem to be above noise level faster and I'm not sure why.". Anyway, we can set that aside. > > I also think you need to shrink the size of your benchmark down. With > 1 million tuples, you're more likely to be also measuring the time it > takes to get cache lines from memory into the CPU. A smaller scale > test will make this less likely. Also, you'd be better timing SELECT > rather than the time it takes to EXPLAIN ANALYZE. They're not the same > thing. EXPLAIN ANALYZE has additional timing going on and we may end > up not de-toasting toasted Datums. I ran experiments with 10K rows and measured timing using \timing in psql. The measurements are much more flaky than a larger set of rows and EXPLAIN ANALYZE. But I think your observations are good enough. > > > On Thu, Oct 19, 2023 at 4:26 PM David Rowley wrote: > > > We can see that Q2 and Q3 become a bit slower. This makes sense as > > > tts_virtual_materialize() is quite a bit more complex than > > > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > > > > > > > If the source slot is a materialized virtual slot, > > tts_virtual_copyslot() could perform a memcpy of the materialized data > > itself rather than materialising from datums. That might be more > > efficient. > > I think you're talking about just performing a memcpy() of the > VirtualTupleTableSlot->data field. Unfortunately, you'd not be able > to do just that as you'd also need to repoint the non-byval Datums in > tts_values at the newly memcpy'd memory. If you skipped that part, > those would remain pointing to the original memory. If that memory > goes away, then bad things will happen. I think you'd still need to do > the 2nd loop in tts_virtual_materialize() Yes, we will need repoint non-byval Datums ofc. > > > May be we should fix the above said inefficiency in > > tt_virtual_copyslot()? > > I don't have any bright ideas on how to make tts_virtual_materialize() > itself faster. If there were some way to remember !attbyval > attributes for the 2nd loop, that might be good, but creating > somewhere to store that might result in further overheads. We may save the size of data in VirtualTupleTableSlot, thus avoiding the first loop. I assume that when allocating VirtualTupleTableSlot->data, we always know what size we are allocating so it should be just a matter of saving it in VirtualTupleTableSlot->size. This should avoid the first loop in tts_virtual_materialize() and give some speed up. We will need a loop to repoint non-byval Datums. I imagine that the pointers to non-byval Datums can be computed as dest_slot->tts_values[natts] = dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). This would work as long as all the non-byval datums in the source slot are all stored flattened in source slot's data. I am assuming that that would be true in a materialized virtual slot. The byval datums are copied as is. I think, this will avoid multiple memcpy calls, one per non-byval attribute and hence some speedup. I may be wrong though. -- Best Wishes, Ashutosh Bapat
Re: Synchronizing slots from primary to standby
Hi, On 10/25/23 6:57 AM, Amit Kapila wrote: On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand wrote: On 10/24/23 7:44 AM, Ajin Cherian wrote: On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand wrote: 2) When we create a subscription, another slot is created during the subscription synchronization, namely like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). This extra slot appears to have failover also set to true. So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like on the standby: LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and and catalog xmin (756) LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756) LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog xmin (756) WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover during the subscription creation, better to re-launch the subscription instead? But note that the subscription doesn't wait for the completion of tablesync. Right. So, how will we deal with that? Also, this situation is the same for non-tablesync slots as well. I have given another option in the email [1] which is to enable failover even for the main slot after all tables are in ready state, something similar to what we do for two_phase. Oh right that looks like a better option (enable failover even for the main slot after all tables are in ready state). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Wed, 25 Oct 2023 at 04:55, David Rowley wrote: > With foreach(), we commonly do "if (lc == NULL)" at the end of loops > as a way of checking if we did "break" to terminate the loop early. Afaict it's done pretty infrequently. The following crude attempt at an estimate estimates it's only done about ~1.5% of the time a foreach is used: $ rg 'lc == NULL' | wc -l 13 $ rg '\bforeach\(lc,' -S | wc -l 899 > Doing the equivalent with the new macros won't be safe as the list > element's value we broke on may be set to NULL. I think it might be a > good idea to document the fact that this wouldn't be safe with the new > macros, or better yet, document the correct way to determine if we > broke out the loop early. I imagine someone will want to do some > conversion work at some future date and it would be good if we could > avoid introducing bugs during that process. > > I wonder if we should even bother setting the variable to NULL at the > end of the loop. It feels like someone might just end up mistakenly > checking for NULLs even if we document that it's not safe. If we left > the variable pointing to the last list element then the user of the > macro is more likely to notice their broken code. It'd also save a bit > of instruction space. Makes sense. Addressed this now by mentioning this limitation and possible workarounds in the comments of the new macros and by not setting the loop variable to NULL/0. I don't think there's an easy way to add this feature to these new macros natively, it's a limitation of not having a second variable. This seems fine to me, since these new macros are meant as an addition to foreach() instead of a complete replacement. v3-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data v3-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch Description: Binary data
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
Attached is a slightly updated version, with a bit simpler implementation of foreach_delete_current. Instead of decrementing i and then adding 1 to it when indexing the list, it now indexes the list using a postfix decrement. From 9073777a6d82e2b2db8b1ed9aef200550234d89a Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio Date: Tue, 24 Oct 2023 17:13:20 +0200 Subject: [PATCH v4 2/2] Use new for_each_xyz macros in a few places This starts using each of the newly added for_each_xyz macros in at least one place. This shows how they improve readability by reducing the amount of boilerplate code. --- src/backend/executor/execExpr.c | 11 src/backend/executor/nodeSubplan.c | 28 + src/backend/replication/logical/relation.c | 5 ++-- src/backend/replication/logical/tablesync.c | 6 ++--- src/backend/replication/pgoutput/pgoutput.c | 8 +++--- 5 files changed, 25 insertions(+), 33 deletions(-) diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c index 2c62b0c9c84..e73261ec445 100644 --- a/src/backend/executor/execExpr.c +++ b/src/backend/executor/execExpr.c @@ -216,7 +216,8 @@ ExecInitQual(List *qual, PlanState *parent) ExprState *state; ExprEvalStep scratch = {0}; List *adjust_jumps = NIL; - ListCell *lc; + Expr *node; + int jump; /* short-circuit (here and in ExecQual) for empty restriction list */ if (qual == NIL) @@ -250,10 +251,8 @@ ExecInitQual(List *qual, PlanState *parent) scratch.resvalue = &state->resvalue; scratch.resnull = &state->resnull; - foreach(lc, qual) + for_each_ptr(node, qual) { - Expr *node = (Expr *) lfirst(lc); - /* first evaluate expression */ ExecInitExprRec(node, state, &state->resvalue, &state->resnull); @@ -265,9 +264,9 @@ ExecInitQual(List *qual, PlanState *parent) } /* adjust jump targets */ - foreach(lc, adjust_jumps) + for_each_int(jump, adjust_jumps) { - ExprEvalStep *as = &state->steps[lfirst_int(lc)]; + ExprEvalStep *as = &state->steps[jump]; Assert(as->opcode == EEOP_QUAL); Assert(as->d.qualexpr.jumpdone == -1); diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index c136f75ac24..8f88f289269 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -307,7 +307,7 @@ ExecScanSubPlan(SubPlanState *node, Datum rowresult; bool rownull; int col; - ListCell *plst; + int paramid; if (subLinkType == EXISTS_SUBLINK) { @@ -367,9 +367,8 @@ ExecScanSubPlan(SubPlanState *node, * Now set all the setParam params from the columns of the tuple */ col = 1; - foreach(plst, subplan->setParam) + for_each_int(paramid, subplan->setParam) { -int paramid = lfirst_int(plst); ParamExecData *prmdata; prmdata = &(econtext->ecxt_param_exec_vals[paramid]); @@ -412,9 +411,8 @@ ExecScanSubPlan(SubPlanState *node, * combining expression. */ col = 1; - foreach(plst, subplan->paramIds) + for_each_int(paramid, subplan->paramIds) { - int paramid = lfirst_int(plst); ParamExecData *prmdata; prmdata = &(econtext->ecxt_param_exec_vals[paramid]); @@ -480,10 +478,11 @@ ExecScanSubPlan(SubPlanState *node, } else if (subLinkType == MULTIEXPR_SUBLINK) { + int paramid; + /* We don't care about function result, but set the setParams */ - foreach(l, subplan->setParam) + for_each_int(paramid, subplan->setParam) { -int paramid = lfirst_int(l); ParamExecData *prmdata; prmdata = &(econtext->ecxt_param_exec_vals[paramid]); @@ -604,16 +603,15 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext) slot = ExecProcNode(planstate)) { int col = 1; - ListCell *plst; bool isnew; + int paramid; /* * Load up the Params representing the raw sub-select outputs, then * form the projection tuple to store in the hashtable. */ - foreach(plst, subplan->paramIds) + for_each_int(paramid, subplan->paramIds) { - int paramid = lfirst_int(plst); ParamExecData *prmdata; prmdata = &(innerecontext->ecxt_param_exec_vals[paramid]); @@ -880,11 +878,10 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent) if (subplan->setParam != NIL && subplan->parParam == NIL && subplan->subLinkType != CTE_SUBLINK) { - ListCell *lst; + int paramid; - foreach(lst, subplan->setParam) + for_each_int(paramid, subplan->setParam) { - int paramid = lfirst_int(lst); ParamExecData *prm = &(estate->es_param_exec_vals[paramid]); prm->execPlan = sstate; @@ -906,7 +903,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent) List *oplist, *lefttlist, *righttlist; - ListCell *l; + OpExpr *opexpr; /* We need a memory context to hold the hash table(s) */ sstate->hashtablecxt = @@ -966,9 +963,8 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent) cross_eq_funcoids = (Oid *) palloc(ncols * sizeof(Oid));
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
On Tue, Oct 24, 2023 at 8:53 PM José Neves wrote: > > Hi there, hope to find you well. > > I have a follow-up question to this already long thread. > > Upon deploying my PostgreSQL logical replication fed application on a stale > database, I ended up running out of space, as the replication slot is being > held back till the next time that we receive a data-changing event, and we > advance to that new LSN offset. > I think that the solution for this is to advance our LSN offset every time a > keep-alive message is received ('k' // 107). > My doubt is, can the keep-alive messages be received in between open > transaction events? I think not, but I would like to get your input to be > extra sure as if this happens, and I commit that offset, I may introduce > again faulty logic leading to data loss. > > In sum, something like this wouldn't happen: > BEGIN LSN001 > INSERT LSN002 > KEEP LIVE LSN003 > UPDATE LSN004 > COMMIT LSN005 > If the downstream acknowledges receipt of LSN003 and saves it locally and crashes, upon restart the upstream will resend all the transactions that committed after LSN003 including the one ended at LSN005. So this is safe. -- Best Wishes, Ashutosh
RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Ok, I see. In that situation is safe indeed, as the offset is lower than the current transaction commit. But I think that I asked the wrong question. I guess that the right question is: Can we receive a keep-alive message with an LSN offset bigger than the commit of the open or following transactions? Something like: BEGIN LSN001 INSERT LSN002 KEEP LIVE LSN006 UPDATE LSN004 COMMIT LSN005 Or: KEEP LIVE LSN006 BEGIN LSN001 INSERT LSN002 UPDATE LSN004 COMMIT LSN005 KEEP LIVE LSN007 Or is the sequence ensured not only between commits but also with keep-alive messaging? De: Ashutosh Bapat Enviado: 25 de outubro de 2023 11:42 Para: José Neves Cc: Amit Kapila ; Andres Freund ; pgsql-hack...@postgresql.org Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client On Tue, Oct 24, 2023 at 8:53 PM José Neves wrote: > > Hi there, hope to find you well. > > I have a follow-up question to this already long thread. > > Upon deploying my PostgreSQL logical replication fed application on a stale > database, I ended up running out of space, as the replication slot is being > held back till the next time that we receive a data-changing event, and we > advance to that new LSN offset. > I think that the solution for this is to advance our LSN offset every time a > keep-alive message is received ('k' // 107). > My doubt is, can the keep-alive messages be received in between open > transaction events? I think not, but I would like to get your input to be > extra sure as if this happens, and I commit that offset, I may introduce > again faulty logic leading to data loss. > > In sum, something like this wouldn't happen: > BEGIN LSN001 > INSERT LSN002 > KEEP LIVE LSN003 > UPDATE LSN004 > COMMIT LSN005 > If the downstream acknowledges receipt of LSN003 and saves it locally and crashes, upon restart the upstream will resend all the transactions that committed after LSN003 including the one ended at LSN005. So this is safe. -- Best Wishes, Ashutosh
Re: POC, WIP: OR-clause support for indexes
Hi! On 15.10.2023 01:34, Alexander Korotkov wrote: Hi, Alena! Thank you for your work on the subject. On Wed, Oct 4, 2023 at 10:21 PM a.rybakina wrote: I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests. Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results: diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300 +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300 @@ -1925,17 +1925,20 @@ EXPLAIN (COSTS OFF) SELECT count(*) FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41; - QUERY PLAN - +QUERY PLAN +--- Aggregate -> Bitmap Heap Scan on tenk1 - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41)) + Recheck Cond: thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41)) -> BitmapOr - -> Bitmap Index Scan on tenk1_thous_tenthous - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) + -> BitmapOr + -> Bitmap Index Scan on tenk1_thous_tenthous + Index Cond: ((thousand = 42) AND (tenthous = 1)) + -> Bitmap Index Scan on tenk1_thous_tenthous + Index Cond: ((thousand = 42) AND (tenthous = 3)) -> Bitmap Index Scan on tenk1_thous_tenthous Index Cond: (thousand = 41) -(8 rows) +(11 rows) I think this query is not converted, because you only convert top-level ORs in the transform_ors() function. But in the example given, the target OR lays under AND, which in turn lays under another OR. I think you need to make transform_ors() recursive to handle cases like this. I wonder about the default value of the parameter or_transform_limit of 500. In [1] and [2] you show the execution time degradation from 0 to ~500 OR clauses. I made a simple SQL script with the query "SELECT * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid = 100;". The pgbench results for a single connection in prepared mode are the following. master: 936 tps patched (or_transform_limit == 0) :1414 tps So, transformation to ANY obviously accelerates the execution. I think it's important to identify the cases where this patch causes the degradation. Generally, I don't see why ANY could be executed slower than the equivalent OR clause. So, the possible degradation cases are slower plan generation and worse plans. I managed to find both. As you stated before, currently the OR transformation has a quadratic complexity depending on the number of or-clause-groups. I made a simple test to evaluate this. containing 1 or-clause-groups. SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 * bid = 1 OR ... OR aid + 1 * bid = 1; master: 316ms patched: 7142ms Note, that the current or_transform_limit GUC parameter is not capable of cutting such cases, because it cuts cases lower than the limit not higher than the limit. In the comment, you mention that we could invent something like hash to handle this. Hash should be nice, but the problem is that we currently don't have a generic facility to hash nodes (or even order them). It would be nice to add this facility, that would be quite a piece of work. I would propose to limit this patch for now to handle just a single Var node as a non-const side of the clause and implement a simple hash for Vars. Another problem is the possible generation of worse plans. I made an example table with two partial indexes. create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,100) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; Without the transformation of ORs to ANY, our planner manages to use both indexes with a Bitmap scan. # explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN -- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
On Wed, Oct 25, 2023 at 4:23 PM José Neves wrote: > > Ok, I see. In that situation is safe indeed, as the offset is lower than the > current transaction commit. > But I think that I asked the wrong question. I guess that the right question > is: Can we receive a keep-alive message with an LSN offset bigger than the > commit of the open or following transactions? > Something like: > > BEGIN LSN001 > INSERT LSN002 > KEEP LIVE LSN006 > UPDATE LSN004 > COMMIT LSN005 > > Or: > > KEEP LIVE LSN006 > BEGIN LSN001 > INSERT LSN002 > UPDATE LSN004 > COMMIT LSN005 > KEEP LIVE LSN007 > AFAIU the code in walsender this isn't possible. Keep alive sends the LSN of the last WAL record it read (sentPtr). Upon reading a commit WAL record, the whole transaction is decoded. Till that point sentPtr is not updated. Please take a look at XLogSendLogical(void) and the places where WalSndKeepaliveIfNecessary() is called. -- Best Wishes, Ashutosh Bapat
Re: ResourceOwner refactoring
On 10/07/2023 15:37, Peter Eisentraut wrote: A few suggestions on the API: > +static ResourceOwnerFuncs tupdesc_resowner_funcs = These aren't all "functions", so maybe another word like "info" or "description" would be more appropriate? > + .release_phase = RESOURCE_RELEASE_AFTER_LOCKS, > + .release_priority = RELEASE_PRIO_TUPDESC_REFS, I suggest assigning explicit non-zero numbers to the RESOURCE_RELEASE_* constants, as well as changing RELEASE_PRIO_FIRST to 1 and adding an assertion that the priority is not zero. Otherwise, someone might forget to assign these fields and would implicitly get phase 0 and priority 0, which would probably work in most cases, but wouldn't be the explicit intention. Done. Then again, you are using RELEASE_PRIO_FIRST for the pgcrypto patch. Is it the recommendation that if there are no other requirements, external users should use that? If we are going to open this up to external users, we should probably have some documented guidance around this. I added a brief comment about that in the ResourceReleasePhase typedef. I also added a section to the README on how to define your own resource kind. (The README doesn't go into any detail on how to choose the release priority though). > + .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning I think this can be refactored further. We really just need a function to describe a resource, like maybe static char * ResOwnerDescribeTupleDesc(Datum res) { TupleDesc tupdesc = (TupleDesc) DatumGetPointer(res); return psprintf("TupleDesc %p (%u,%d)", tupdesc, tupdesc->tdtypeid, tupdesc->tdtypmod); } and then the common code can generate the warnings from that like elog(WARNING, "%s leak: %s still referenced", kind->name, kind->describe(value)); That way, we could more easily develop further options, such as turning this warning into an error (useful for TAP tests). Also, maybe, you could supply a default description that just prints "%p", which appears to be applicable in about half the places. Refactored it that way. Possible bonus project: I'm not sure these descriptions are so useful anyway. It doesn't help me much in debugging if "File 55 was not released" or some such. If would be cool if ResourceOwnerRemember() in some debug mode could remember the source code location, and then print that together with the leak warning. Yeah, that would be useful. I remember I've hacked something like that as a one-off thing in the past, when debugging a leak. > +#define ResourceOwnerRememberTupleDesc(owner, tupdesc) \ > + ResourceOwnerRemember(owner, PointerGetDatum(tupdesc), &tupdesc_resowner_funcs) > +#define ResourceOwnerForgetTupleDesc(owner, tupdesc) \ > + ResourceOwnerForget(owner, PointerGetDatum(tupdesc), &tupdesc_resowner_funcs) I would prefer that these kinds of things be made inline functions, so that we maintain the type safety of the previous interface. And it's also easier when reading code to see type decorations. (But I suspect the above bonus project would require these to be macros?) > + if (context->resowner) > + ResourceOwnerForgetJIT(context->resowner, context); It seems most ResourceOwnerForget*() calls have this kind of check before them. Could this be moved inside the function? Many do, but I don't know if it's the majority. We could make ResurceOwnerForget(NULL) do nothing, but I think it's better to have the check in the callers. You wouldn't want to silently do nothing when the resource owner is not expected to be NULL. (I'm attaching new patch version in my reply to Andres shortly) -- Heikki Linnakangas Neon (https://neon.tech)
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Hi Is there any conclusion to this issue? Jehan-Guillaume de Rorthais 于2023年8月10日周四 23:03写道: > On Thu, 3 Aug 2023 11:02:43 +0200 > Alvaro Herrera wrote: > > > On 2023-Aug-03, tender wang wrote: > > > > > I think old "sub-FK" should not be dropped, that will be violates > foreign > > > key constraint. > > > > Yeah, I've been playing more with the patch and it is definitely not > > doing the right things. Just eyeballing the contents of pg_trigger and > > pg_constraint for partitions added by ALTER...ATTACH shows that the > > catalog contents are inconsistent with those added by CREATE TABLE > > PARTITION OF. > > Well, as stated in my orignal message, at the patch helps understanding the > problem and sketch a possible solution. It definitely is not complete. > > After DETACHing the table, we surely needs to check everything again and > recreating what is needed to keep the FK consistent. > > But should we keep the FK after DETACH? Did you check the two other > discussions > related to FK, self-FK & partition? Unfortunately, as Tender experienced, > the > more we dig the more we find bugs. Moreover, the second one might seems > unsolvable and deserve a closer look. See: > > * FK broken after DETACHing referencing part > https://www.postgresql.org/message-id/20230420144344.40744130%40karst > * Issue attaching a table to a partitioned table with an auto-referenced > foreign key > https://www.postgresql.org/message-id/20230707175859.17c91538%40karst > >
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On 2023-Oct-24, Jelte Fennema wrote: > Many usages of the foreach macro in the Postgres codebase only use the > ListCell variable to then get its value. This adds macros that > simplify iteration code for that very common use case. Instead of > passing a ListCell you can pass a variable of the type of its > contents. This IMHO improves readability of the code by reducing the > total amount of code while also essentially forcing the use of useful > variable names. +1 for getting rid of useless "lc" variables. Looking at for_each_ptr() I think it may be cleaner to follow palloc_object()'s precedent and make it foreach_object() instead (I have no love for the extra underscore, but I won't object to it either). And like foreach_node, have it receive a type name to add a cast to. I'd imagine something like SubscriptionRelState *rstate; foreach_object(SubscriptionRelState *, rstate, table_states_not_ready) { -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: trying again to get incremental backup
On 2023-10-24 Tu 12:08, Robert Haas wrote: It looks like each file entry in the manifest takes about 150 bytes, so 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? I suspect a few people have more files than that. They'll just have to Maybe someone on the list can see some way o wait to use this feature until we get incremental JSON parsing (or undo the decision to use JSON for the manifest). Robert asked me to work on this quite some time ago, and most of this work was done last year. Here's my WIP for an incremental JSON parser. It works and passes all the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. The reason I haven't posted it before is that it's about 50% slower in pure parsing speed than the current recursive descent parser in my testing. I've tried various things to make it faster, but haven't made much impact. One of my colleagues is going to take a fresh look at it, but maybe someone on the list can see where we can save some cycles. If we can't make it faster, I guess we could use the RD parser for non-incremental cases and only use the non-RD parser for incremental, although that would be a bit sad. However, I don't think we can make the RD parser suitable for incremental parsing - there's too much state involved in the call stack. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com json_incremental_parser-2023-09-25.patch.gz Description: application/gzip
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
On 2023-Oct-25, tender wang wrote: > Hi >Is there any conclusion to this issue? None yet. I intend to work on this at some point, hopefully soon. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar wrote: > > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila wrote: > > > > This and other results shared by you look promising. Will there be any > > improvement in workloads related to clog buffer usage? > > I did not understand this question can you explain this a bit? > I meant to ask about the impact of this patch on accessing transaction status via TransactionIdGetStatus(). Shouldn't we expect some improvement in accessing CLOG buffers? -- With Regards, Amit Kapila.
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Wed, 25 Oct 2023 at 13:52, Alvaro Herrera wrote: > Looking at for_each_ptr() I think it may be cleaner to follow > palloc_object()'s precedent and make it foreach_object() instead (I have > no love for the extra underscore, but I won't object to it either). And > like foreach_node, have it receive a type name to add a cast to. > > I'd imagine something like > > SubscriptionRelState *rstate; > > foreach_object(SubscriptionRelState *, rstate, table_states_not_ready) > { Could you clarify why you think it may be cleaner? I don't see much benefit to passing the type in there if all we use it for is adding a cast. It seems like extra things to type for little benefit. palloc_object uses the passed in type to not only do the cast, but also to determine the size of the the allocation. If foreach_object would allow us to remove the declaration further up in the function I do see a benefit though. I attached a new patchset which includes a 3rd patch that does this (the other patches are equivalent to v4). I quite like that it moves the type declaration to the loop itself, limiting its scope. But I'm not fully convinced it's worth the hackiness of introducing a second for loop that does a single iteration, just to be able to declare a variable of a different type though. But I don't know another way of achieving this. If this hack/trick is deemed acceptable, we can do the same for the other newly introduced macros. The type would not even need to be specified for oid/xid/int because it's already known to be Oid/TransactionId/int respectively. v5-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data v5-0003-Introduce-for_each_object.patch Description: Binary data v5-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch Description: Binary data
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila wrote: > > On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar wrote: > > > > On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila wrote: > > > > > > This and other results shared by you look promising. Will there be any > > > improvement in workloads related to clog buffer usage? > > > > I did not understand this question can you explain this a bit? > > > > I meant to ask about the impact of this patch on accessing transaction > status via TransactionIdGetStatus(). Shouldn't we expect some > improvement in accessing CLOG buffers? Yes, there should be because 1) Now there is no common lock so contention on a centralized control lock will be reduced when we are accessing the transaction status from pages falling in different SLRU banks 2) Buffers size is configurable so if the workload is accessing transactions status of different range then it would help in frequent buffer eviction but this might not be most common case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: RFC: Pluggable TOAST
Hi Nikita, > We need community feedback on previously discussed topic [1]. > There are some long-live issues in Postgres related to the TOAST mechanics, > like [2]. > Some time ago we already proposed a set of patches with an API allowing to > plug in > different TOAST implementations into a live database. The patch set > introduced a lot > of code and was quite crude in some places, so after several implementations > we decided > to try to implement it in the production environment for further check-up. > > The main idea behind pluggable TOAST is make it possible to easily plug in > and use different > implementations of large values storage, preserving existing mechanics to > keep backward > compatibilitну provide easy Postgres-way give users alternative mechanics > for storing large > column values in a more effective way - we already have custom and very > effective (up to tens > and even hundreds of times faster) TOAST implementations for bytea and JSONb > data types. > > As we see it - Pluggable TOAST proposes > 1) changes in TOAST pointer itself, extending it to store custom data - > current limitations > of TOAST pointer were discussed in [1] and [4]; > 2) API which allows calls of custom TOAST implementations for certain table > columns and > (a topic for discussion) certain datatypes. > > Custom TOAST could be also used in a not so trivial way - for example, > limited columnar storage could be easily implemented and plugged in without > heavy core modifications > of implementation of Pluggable Storage (Table Access Methods), preserving > existing data > and database structure, be upgraded, replicated and so on. > > Any thoughts and proposals are welcome. It seems to me that discarding the previous discussion and starting a new thread where you ask the community for *another* feedback is not going to be productive. Pretty sure it's not going to change. -- Best regards, Aleksander Alekseev
Re: race condition in pg_class
> On 25 Oct 2023, at 13:39, Smolkin Grigory wrote: > > We are running PG13.10 and recently we have encountered what appears to be a > bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZ > I've tried to reproduce this scenario with CREATE INDEX and various > concurrent statements, but no luck. Maybe it would be possible to reproduce with modifying tests for concurrent index creation. For example add “ANALYZE” here [0]. Keep in mind that for easier reproduction it would make sense to increase transaction count radically. Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Hi Andrei, On Wed, 2023-10-25 at 13:59 +0700, Andrei Lepikhov wrote: > But minmax_stats_since and changes in the UI of the reset routine > look like syntactic sugar here. > I can't convince myself that it is really needed. Do you have some > set of cases that can enforce the changes proposed? Yes, it looks strange, but it is needed in some way. The main purpose of this patch is to provide means for sampling solutions for collecting statistics data in series of samples. The first goal here - is to be sure that the statement was not evicted and come back between samples (especially between rare samples). This is the target of the stats_since field. The second goal - is the ability of getting all statistic increments for the interval between samples. All statistics provided by pg_stat_statements with except of min/max values can be calculated for the interval since the last sample knowing the values in the last and current samples. We need a way to get min/max values too. This is achieved by min/max reset performed by the sampling solution. The minmax_stats_since field is here for the same purpose - the sampling solution is need to be sure that no one have done a min/max reset between samples. And if such reset was performed at least we know when it was performed. regards, Andrei Zubkov Postgres Professional
Re: trying again to get incremental backup
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan wrote: > Robert asked me to work on this quite some time ago, and most of this > work was done last year. > > Here's my WIP for an incremental JSON parser. It works and passes all > the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. > The reason I haven't posted it before is that it's about 50% slower in > pure parsing speed than the current recursive descent parser in my > testing. I've tried various things to make it faster, but haven't made > much impact. One of my colleagues is going to take a fresh look at it, > but maybe someone on the list can see where we can save some cycles. > > If we can't make it faster, I guess we could use the RD parser for > non-incremental cases and only use the non-RD parser for incremental, > although that would be a bit sad. However, I don't think we can make the > RD parser suitable for incremental parsing - there's too much state > involved in the call stack. Yeah, this is exactly why I didn't want to use JSON for the backup manifest in the first place. Parsing such a manifest incrementally is complicated. If we'd gone with my original design where the manifest consisted of a bunch of lines each of which could be parsed separately, we'd already have incremental parsing and wouldn't be faced with these difficult trade-offs. Unfortunately, I'm not in a good position either to figure out how to make your prototype faster, or to evaluate how painful it is to keep both in the source tree. It's probably worth considering how likely it is that we'd be interested in incremental JSON parsing in other cases. Maintaining two JSON parsers is probably not a lot of fun regardless, but if each of them gets used for a bunch of things, that feels less bad than if one of them gets used for a bunch of things and the other one only ever gets used for backup manifests. Would we be interested in JSON-format database dumps? Incrementally parsing JSON LOBs? Either seems tenuous, but those are examples of the kind of thing that could make us happy to have incremental JSON parsing as a general facility. If nobody's very excited by those kinds of use cases, then this just boils down to whether we want to (a) accept that users with very large numbers of relation files won't be able to use pg_verifybackup or incremental backup, (b) accept that we're going to maintain a second JSON parser just to enable that use cas and with no other benefit, or (c) undertake to change the manifest format to something that is straightforward to parse incrementally. I think (a) is reasonable short term, but at some point I think we should do better. I'm not really that enthused about (c) because it means more work for me and possibly more arguing, but if (b) is going to cause a lot of hassle then we might need to consider it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add connection active, idle time to pg_stat_activity
Hi Sergey, I've done a review of this patch. I found the patch idea very useful, thank you for the patch. I've noted something observing this patch: 1. Patch can't be applied on the current master. My review is based on application of this patch over ac68323a878 2. Being applied over ac68323a878 patch works as expected. 3. Field names seems quite long to me (and they should be uniformly named with the same statistics in other views. For example "running" term is called "active" in pg_stat_database) 4. Meaningless spaces at the end of line: - backend_status.c:586 - monitoring.sgml:5857 5. Patch adds usecs_diff = secs * 100 + usecs; at backend_status.c:pgstat_report_activity() to optimize calculations. But pgstat_count_conn_active_time((PgStat_Counter) secs * 100 + usecs); and pgstat_count_conn_txn_idle_time((PgStat_Counter) secs * 100 + usecs); are left in place after that. 6. I'm not sure that I can understand the comment /* Keep statistics for pg_stat_database intact */ at backend_status.c:600 correctly. Can you please explain it a little? 7. Tests seems incomplete. It looks like we can check increments in all fields playing with transactions in tests. Also, I have a thought about other possible improvements fitting to this patch. The view pg_stat_session is really needed in Postgres but I think it should have much more statistics. I mean all resource statistics related to sessions. Every backend has instrumentation that tracks resource consumption. Data of this instrumentation goes to the cumulative statistics system and is used in monitoring extensions (like pg_stat_statements). I think pg_stat_session view is able to add one more dimension of monitoring - a dimension of sessions. In my opinion this view should provide resource consumption statistics of current sessions in two cumulative sets of statistics - since backend start and since transaction start. Such view will be really useful in monitoring of long running sessions and transactions providing resource consumption information besides timing statistics. regards, Andrei Zubkov Postgres Professional
Re: Add connection active, idle time to pg_stat_activity
Hi, > I've done a review of this patch. I found the patch idea very useful, > thank you for the patch. I've noted something observing this patch: > 1. Patch can't be applied on the current master. My review is based on >application of this patch over ac68323a878 On top of that not sure if I see the patch on the November commitfest [1]. Please make sure it's there so that cfbot will check the patch. [1]: https://commitfest.postgresql.org/45/ -- Best regards, Aleksander Alekseev
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On 19.10.2023 15:40, Andrei Zubkov wrote: Hi hackers, New version 23 attached. It contains rebase to the current master. Noted that v1.11 adds new fields to the pg_stat_sstatements view, but leaves the PGSS_FILE_HEADER constant unchanged. It this correct? Hi! Thank you for your work on the subject. 1. I didn't understand why we first try to find pgssEntry with a false top_level value, and later find it with a true top_level value. /* * Remove the entry if it exists, starting with the non-top-level entry. */ *key.toplevel = false;* entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL); SINGLE_ENTRY_RESET(entry); */* Also reset the top-level entry if it exists. */ key.toplevel = true;* entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL); SINGLE_ENTRY_RESET(entry); I looked through this topic and found some explanation in this email [0], but I didn't understand it. Can you explain it to me? 2. And honestly, I think you should change "Remove the entry if it exists, starting with the non-top-level entry." on "Remove the entry or reset min/max time statistic information and the timestamp if it exists, starting with the non-top-level entry." And the same with the comment bellow: "Also reset the top-level entry if it exists." "Also remove the entry or reset min/max time statistic information and the timestamp if it exists." In my opinion, this is necessary because the minmax_only parameter is set by the user, so both ways are possible. 0 - https://www.postgresql.org/message-id/62d16845-e74e-a6f9-9661-022e44f48922%40inbox.ru -- Regards, Alena Rybakina
Re: Add connection active, idle time to pg_stat_activity
Hi Aleksander, On Wed, 2023-10-25 at 16:17 +0300, Aleksander Alekseev wrote: > On top of that not sure if I see the patch on the November commitfest > [1]. Please make sure it's there so that cfbot will check the patch. Yes, this patch is listed on the November commitfest. cfbot says rebase needed since 2023-08-21. regards, Andrei Zubkov
Re: Add connection active, idle time to pg_stat_activity
Hi, > On Wed, 2023-10-25 at 16:17 +0300, Aleksander Alekseev wrote: > > On top of that not sure if I see the patch on the November commitfest > > [1]. Please make sure it's there so that cfbot will check the patch. > > Yes, this patch is listed on the November commitfest. cfbot says rebase > needed since 2023-08-21. You are right, I missed the corresponding entry [1]. [1]: https://commitfest.postgresql.org/45/3405/ -- Best regards, Aleksander Alekseev
Re: trying again to get incremental backup
On 2023-10-25 We 09:05, Robert Haas wrote: On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan wrote: Robert asked me to work on this quite some time ago, and most of this work was done last year. Here's my WIP for an incremental JSON parser. It works and passes all the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. The reason I haven't posted it before is that it's about 50% slower in pure parsing speed than the current recursive descent parser in my testing. I've tried various things to make it faster, but haven't made much impact. One of my colleagues is going to take a fresh look at it, but maybe someone on the list can see where we can save some cycles. If we can't make it faster, I guess we could use the RD parser for non-incremental cases and only use the non-RD parser for incremental, although that would be a bit sad. However, I don't think we can make the RD parser suitable for incremental parsing - there's too much state involved in the call stack. Yeah, this is exactly why I didn't want to use JSON for the backup manifest in the first place. Parsing such a manifest incrementally is complicated. If we'd gone with my original design where the manifest consisted of a bunch of lines each of which could be parsed separately, we'd already have incremental parsing and wouldn't be faced with these difficult trade-offs. Unfortunately, I'm not in a good position either to figure out how to make your prototype faster, or to evaluate how painful it is to keep both in the source tree. It's probably worth considering how likely it is that we'd be interested in incremental JSON parsing in other cases. Maintaining two JSON parsers is probably not a lot of fun regardless, but if each of them gets used for a bunch of things, that feels less bad than if one of them gets used for a bunch of things and the other one only ever gets used for backup manifests. Would we be interested in JSON-format database dumps? Incrementally parsing JSON LOBs? Either seems tenuous, but those are examples of the kind of thing that could make us happy to have incremental JSON parsing as a general facility. If nobody's very excited by those kinds of use cases, then this just boils down to whether we want to (a) accept that users with very large numbers of relation files won't be able to use pg_verifybackup or incremental backup, (b) accept that we're going to maintain a second JSON parser just to enable that use cas and with no other benefit, or (c) undertake to change the manifest format to something that is straightforward to parse incrementally. I think (a) is reasonable short term, but at some point I think we should do better. I'm not really that enthused about (c) because it means more work for me and possibly more arguing, but if (b) is going to cause a lot of hassle then we might need to consider it. I'm not too worried about the maintenance burden. The RD routines were added in March 2013 (commit a570c98d7fa) and have hardly changed since then. The new code is not ground-breaking - it's just a different (and fairly well known) way of doing the same thing. I'd be happier if we could make it faster, but maybe it's just a fact that keeping an explicit stack, which is how this works, is slower. I wouldn't at all be surprised if there were other good uses for incremental JSON parsing, including some you've identified. That said, I agree that JSON might not be the best format for backup manifests, but maybe that ship has sailed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [patch] pg_basebackup: mention that spread checkpoints are the default in --help
On 19.10.23 11:39, Michael Banck wrote: Hi, I believed that spread (not fast) checkpoints are the default in pg_basebackup, but noticed that --help does not specify which is which - contrary to the reference documentation. So I propose the small attached patch to clarify that. > printf(_(" -c, --checkpoint=fast|spread\n" >- " set fast or spread checkpointing\n")); >+ " set fast or spread (default) checkpointing\n")); Could we do like -c, --checkpoint=fast|spread set fast or spread checkpointing (default: spread) This seems to be easier to read.
Re: [dynahash] do not refill the hashkey after hash_search
On Wed, Oct 25, 2023 at 12:48:52PM +0700, John Naylor wrote: > On Wed, Oct 25, 2023 at 12:21 PM Tom Lane wrote: >> >> John Naylor writes: >> > I'd prefer just adding "Assert(hentry->event == oldn);" and declaring >> > hentry PG_USED_FOR_ASSERTS_ONLY. >> >> I'm not aware of any other places where we have Asserts checking >> that hash_search() honored its contract. Why do we need one here? > > [removing old CC] > The author pointed out here that we're not consistent in this regard: > > https://www.postgresql.org/message-id/CAEG8a3KEO_Kdt2Y5hFNWMEX3DpCXi9jtZOJY-GFUEE9QLgF%2Bbw%40mail.gmail.com > > ...but I didn't try seeing where the balance lay. We can certainly > just remove redundant assignments. While it probably doesn't hurt anything, IMHO it's unnecessary to verify that hash_search() works every time it is called. This behavior seems unlikely to change anytime soon, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
libpq async connection and multiple hosts
Hello, We are aware that, using async connection functions (`PQconnectStart`, `PQconnectPoll`), the `connect_timeout` parameter is not supported; this is documented at https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PQCONNECTSTARTPARAMS """ The connect_timeout connection parameter is ignored when using PQconnectPoll; it is the application's responsibility to decide whether an excessive amount of time has elapsed. Otherwise, PQconnectStart followed by a PQconnectPoll loop is equivalent to PQconnectdb. """ However, ISTM that connecting to multiple hosts is not supported either. I have a couple of issues I am looking into in psycopg 3: - https://github.com/psycopg/psycopg/issues/602 - https://github.com/psycopg/psycopg/issues/674 Do we have to reimplement the connection attempts loop too? Are there other policies that we would need to reimplement? Is `target_session_attrs` taken care of by PQconnectPoll? On my box (testing with psql and libpq itself), PQconnect("host=8.8.8.8") fails after 2m10s. Is this the result of some unspecified socket connection timeout on my Ubuntu machine?. If we need to reimplement async connection to "host=X,Y", we will need to use a timeout even if the user didn't specify one, otherwise we will never stop the connection attempt to X and move to Y. What timeout can we specify that will not upset anyone? Thank you very much -- Daniele
Re: Synchronizing slots from primary to standby
Hi, On 10/9/23 12:30 PM, shveta malik wrote: PFA v22 patch-set. It has below changes: patch 001: 1) Now physical walsender wakes up logical walsender(s) by using a new CV as suggested in [1] Thanks! I think that works fine as long as the standby is up and running and catching up. The problem I see with the current WalSndWaitForStandbyConfirmation() implementation is that if the standby is not running then: + for (;;) + { + ListCell *l; + longsleeptime = -1; will loop until we reach the "terminating walsender process due to replication timeout" if we explicitly want to end with SIGINT or friends. For example a scenario like: - standby down - pg_recvlogical running then CTRL-C on pg_recvlogical would not "respond" immediately but when we reach the replication timeout. So it seems that we should use something like WalSndWait() instead of ConditionVariableTimedSleep() here: + /* +* Sleep until other physical walsenders awaken us or until a timeout +* occurs. +*/ + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); + + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, sleeptime, + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION); In that case I think that WalSndWait() should take care of the new CV WalSndCtl->wal_confirm_rcv_cv too. The wait on the socket should allow us to stop waiting when, for example, CTRL-C on pg_recvlogical is triggered. Then we would need to deal with this scenario: Standby down or not catching up and exited WalSndWait() due to the socket to break the loop or shutdown the walsender. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: trying again to get incremental backup
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan wrote: > I'm not too worried about the maintenance burden. > > That said, I agree that JSON might not be the best format for backup > manifests, but maybe that ship has sailed. I think it's a decision we could walk back if we had a good enough reason, but it would be nicer if we didn't have to, because what we have right now is working. If we change it for no real reason, we might introduce new bugs, and at least in theory, incompatibility with third-party tools that parse the existing format. If you think we can live with the additional complexity in the JSON parsing stuff, I'd rather go that way. -- Robert Haas EDB: http://www.enterprisedb.com
Re: libpq async connection and multiple hosts
On Wed, 25 Oct 2023 at 17:03, Daniele Varrazzo wrote: > However, ISTM that connecting to multiple hosts is not supported > either. I have a couple of issues I am looking into in psycopg 3: > > - https://github.com/psycopg/psycopg/issues/602 > - https://github.com/psycopg/psycopg/issues/674 Another approach is to use tcp_user_timeout instead of connect_timeout to skip non-responsive hosts. It's not completely equivalent though to connection_timeout though, since it also applies when the connection is actually being used. Also it only works on Linux afaik. It could be nice to add support for BSD its TCP_CONNECTIONTIMEOUT socket option. > Do we have to reimplement the connection attempts loop too? If you want to support connection_timeout, it seems yes. > Are there other policies that we would need to reimplement? Is > `target_session_attrs` taken care of by PQconnectPoll? Afaict from the code target_session_attrs are handled inside PQconnectPoll, so you would not have to re-implement that. PQconnectPoll would simply fail if target_session_attrs don't match for the server. You should implement load_balance_hosts=random though by randomizing your hosts list.
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Hi Alena, On Wed, 2023-10-25 at 16:25 +0300, Alena Rybakina wrote: > Hi! Thank you for your work on the subject. > 1. I didn't understand why we first try to find pgssEntry with a > false top_level value, and later find it with a true top_level value. In case of pg_stat_statements the top_level field is the bool field of the pgssHashKey struct used as the key for pgss_hash hashtable. When we are performing a reset only userid, dbid and queryid values are provided. We need to reset both top-level and non-top level entries. We have only one way to get them all from a hashtable - search for entries having top_level=true and with top_level=false in their keys. > 2. And honestly, I think you should change > "Remove the entry if it exists, starting with the non-top-level > entry." on > "Remove the entry or reset min/max time statistic information and > the timestamp if it exists, starting with the non-top-level entry." > And the same with the comment bellow: > "Also reset the top-level entry if it exists." > "Also remove the entry or reset min/max time statistic information > and the timestamp if it exists." There are four such places actually - every time when the SINGLE_ENTRY_RESET macro is used. The logic of reset performed is described a bit in this macro definition. It seems quite redundant to repeat this description four times. But I've noted that "remove" terms should be replaced by "reset". I've attached v24 with commits updated. regards, Andrei Zubkov Postgres Professional From 4fadc88d5e6c1afe7558393ba99c28d070ac7244 Mon Sep 17 00:00:00 2001 From: Andrei Zubkov Date: Wed, 25 Oct 2023 17:58:57 +0300 Subject: [PATCH 1/2] pg_stat_statements tests: Add NOT NULL checking of pg_stat_statements_reset This is preliminary patch. It adds NOT NULL checking for the result of pg_stat_statements_reset() function. It is needed for upcoming patch "Track statement entry timestamp" that will change the result type of this function to the timestamp of a reset performed. Author: Andrei Zubkov (zubkov) Reviewed by: Julien Rouhaud (rjuju), Hayato Kuroda (ha-kun), Yuki Seino (seinoyu), Chengxi Sun (martin-sun), Anton Melnikov (antonmel), Darren Rush (darrenr), Michael Paquier (michael-kun), Sergei Kornilov (melkij), Alena Rybakina (a.rybakina), Andrei Lepikhov (lepikhov) Discussion: https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru --- .../pg_stat_statements/expected/cursors.out | 28 +-- contrib/pg_stat_statements/expected/dml.out | 28 +-- .../expected/level_tracking.out | 80 .../pg_stat_statements/expected/planning.out | 18 +- .../pg_stat_statements/expected/select.out| 44 ++-- .../expected/user_activity.out| 120 +-- .../pg_stat_statements/expected/utility.out | 192 +- contrib/pg_stat_statements/expected/wal.out | 10 +- contrib/pg_stat_statements/sql/cursors.sql| 6 +- contrib/pg_stat_statements/sql/dml.sql| 6 +- .../pg_stat_statements/sql/level_tracking.sql | 12 +- contrib/pg_stat_statements/sql/planning.sql | 4 +- contrib/pg_stat_statements/sql/select.sql | 10 +- .../pg_stat_statements/sql/user_activity.sql | 15 +- contrib/pg_stat_statements/sql/utility.sql| 34 ++-- contrib/pg_stat_statements/sql/wal.sql| 2 +- 16 files changed, 307 insertions(+), 302 deletions(-) diff --git a/contrib/pg_stat_statements/expected/cursors.out b/contrib/pg_stat_statements/expected/cursors.out index 46375ea9051..0fc4b2c098d 100644 --- a/contrib/pg_stat_statements/expected/cursors.out +++ b/contrib/pg_stat_statements/expected/cursors.out @@ -3,10 +3,10 @@ -- -- These tests require track_utility to be enabled. SET pg_stat_statements.track_utility = TRUE; -SELECT pg_stat_statements_reset(); - pg_stat_statements_reset --- - +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t (1 row) -- DECLARE @@ -20,13 +20,13 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; ---+--+--- 2 |0 | CLOSE cursor_stats_1 2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT $1 - 1 |1 | SELECT pg_stat_statements_reset() + 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t (3 rows) -SELECT pg_stat_statements_reset(); - pg_stat_statements_reset --- - +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t (1 row) -- FETCH @@ -59,12 +59,12 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; 1 |0 | DECLARE cursor_stats_2 CURSOR WITH HOLD FOR SELECT $1 1 |1 | FETCH 1 IN cursor_stats_1 1 |1 | FETCH 1 IN cursor_stats_2 - 1 |1 | SELECT pg_stat_statements_reset() + 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t (9 rows) -SELECT pg_stat_statements_re
Re: Document aggregate functions better w.r.t. ORDER BY
On Tue, Oct 24, 2023 at 06:45:48PM -0700, David G. Johnston wrote: > I'd prefer to keep pointing out that the ones documented are those whose > outputs will vary due to ordering. Okay, I re-added it in the attached patch, and tightened up the text. > I've been sympathetic to the user comments that we don't have enough > examples. Good point. > Just using array_agg for that purpose, showing both DISTINCT and ORDER BY > seems > like a fair compromise (removes two from my original proposal). The examples > in the section we tell them to go see aren't of that great quality. If you > strongly dislike having the function table contain the examples we should at > least improve the page we are sending them to. (As an aside to this, I've > personally always found the syntax block with the 5 syntaxes shown there to be > intimidating/hard-to-read). I think you are right that it belongs in the syntax section; we cover ordering extensively there. We already have queries there, but not output, so I moved the relevant examples to there and replaced the example that had no output. > I'd at least suggest you reconsider the commentary and examples surrounding > jsonb_object_agg. I moved that as well, and tightened the example. > The same goes for the special knowledge of floating point behavior for why > we've chosen to document avg/sum, something that typically doesn't care about > order, as having an optional order by. The floating example seems too obscure to mention in our function docs. I can put a sentence in the syntax docs, but is there value in explaining that to users? How it that helpful? Example? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c3e940afe..d60c65f8fc 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ... aggregation. + + While all aggregates below accept an optional + ORDER BY clause (as outlined in ), the clause has only been added to + aggregates whose output is affected by ordering. + + General-Purpose Aggregate Functions @@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ... array_agg -array_agg ( anynonarray ) +array_agg ( anynonarray ORDER BY input_sort_columns ) anyarray @@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ... -array_agg ( anyarray ) +array_agg ( anyarray ORDER BY input_sort_columns ) anyarray @@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ... json_agg -json_agg ( anyelement ) +json_agg ( anyelement ORDER BY input_sort_columns ) json jsonb_agg -jsonb_agg ( anyelement ) +jsonb_agg ( anyelement ORDER BY input_sort_columns ) jsonb @@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ... json_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) json @@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ... jsonb_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) jsonb @@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ... string_agg ( value - bytea, delimiter bytea ) + bytea, delimiter bytea + ORDER BY input_sort_columns ) bytea @@ -20851,11 +20861,11 @@ SELECT NULLIF(value, '(none)') ... numeric -sum ( real ) +sum ( real ORDER BY input_sort_columns ) real -sum ( double precision ) +sum ( double precision ORDER BY input_sort_columns ) double precision @@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ... xmlagg -xmlagg ( xml ) +xmlagg ( xml ORDER BY input_sort_columns ) xml diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 3ba844057f..b64733d8aa 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -1647,7 +1647,26 @@ sqrt(2) are always just expressions and cannot be output-column names or numbers. For example: -SELECT array_agg(a ORDER BY b DESC) FROM table; +WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) +SELECT array_agg(v ORDER BY v DESC) FROM vals; + array_agg +- + {4,3,3,2,1} + +Since jsonb only keeps the last matching key, ordering +of its keys can be significant: + +WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1
Re: CRC32C Parallel Computation Optimization on ARM
+pg_crc32c +pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len) It looks like most of this function is duplicated from pg_comp_crc32c_armv8(). I understand that we probably need a separate function because of the runtime check, but perhaps we could create a common static inline helper function with a branch for when vmull_p64() can be used. It's callers would then just provide a boolean to indicate which branch to take. +# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too. +if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then + if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then +USE_ARMV8_VMULL=1 + fi +fi Hm. I wonder if we need to switch to a runtime check in some cases. For example, what happens if the ARMv8 intrinsics used today are found with the default compiler flags, but vmull_p64() is only available if -march=armv8-a+crypto is added? It looks like the precedent is to use a runtime check if we need extra CFLAGS to produce code that uses the intrinsics. Separately, I wonder if we should just always do runtime checks for the CRC stuff whenever we can produce code with the intrinics, regardless of whether we need extra CFLAGS. The check doesn't look terribly expensive, and it might allow us to simplify the code a bit (especially now that we support a few different architectures). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: libpq async connection and multiple hosts
On Wed, 25 Oct 2023 at 17:35, Jelte Fennema wrote: > You should implement load_balance_hosts=random though > by randomizing your hosts list. Good catch. So it seems that, if someone wants to build an equivalent an async version of PQconnectdb, they need to handle on their own: - connect_timeout - multiple host, hostaddr, port - load_balance_hosts=random Does this list sound complete? -- Daniele
Re: libpq async connection and multiple hosts
On Wed, 25 Oct 2023 at 17:35, Jelte Fennema wrote: > Another approach is to use tcp_user_timeout instead of connect_timeout > to skip non-responsive hosts. It's not completely equivalent though to > connection_timeout though, since it also applies when the connection > is actually being used. Also it only works on Linux afaik. It could be > nice to add support for BSD its TCP_CONNECTIONTIMEOUT socket option. This seems brittle and platform-dependent enough that we would surely receive less grief by hardcoding a default two minutes timeout. -- Daniele
Re: remaining sql/json patches
Hi! Amit, on previous email, patch #2 - I agree that it is not the best idea to introduce new type of logic into the parser, so this logic could be moved to the executor, or removed at all. What do you think of these options? On Wed, Oct 18, 2023 at 5:19 AM jian he wrote: > Hi. > based on v22. > > I added some tests again json_value for the sake of coverager test. > > A previous email thread mentioned needing to check *empty in > ExecEvalJsonExpr. > since JSON_VALUE_OP, JSON_QUERY_OP, JSON_EXISTS_OP all need to have > *empty cases, So I refactored a little bit. > might be helpful. Maybe we can also refactor *error cases. > > The following part is not easy to understand. > res = ExecPrepareJsonItemCoercion(jbv, > + jsestate->item_jcstates, > + &post_eval->jcstate); > + if (post_eval->jcstate && > + post_eval->jcstate->coercion && > + (post_eval->jcstate->coercion->via_io || > + post_eval->jcstate->coercion->via_populate)) > -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Should we represent temp files as unsigned long int instead of signed long int type?
Hi All, At present, we represent temp files as a signed long int number. And depending on the system architecture (32 bit or 64 bit), the range of signed long int varies, for example on a 32-bit system it will range from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a session to create more than 2 billion temporary files and that is not a small number at all, but still what if we make it an unsigned long int which will allow a session to create 4 billion temporary files if needed. I might be sounding a little stupid here because 2 billion temporary files is like 2000 peta bytes (2 billion * 1GB), considering each temp file is 1GB in size which is not a small data size at all, it is a huge amount of data storage. However, since the variable we use to name temporary files is a static long int (static long tempFileCounter = 0;), there is a possibility that this number will get exhausted soon if the same session is trying to create too many temp files via multiple queries. Just adding few lines of code related to this from postmaster.c: /* * Number of temporary files opened during the current session; * this is used in generation of tempfile names. */ static long tempFileCounter = 0; /* * Generate a tempfile name that should be unique within the current * database instance. */ snprintf(tempfilepath, sizeof(tempfilepath), "%s/%s%d.%ld", tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, tempFileCounter++); -- With Regards, Ashutosh Sharma.
Re: race condition in pg_class
Smolkin Grigory writes: > We are running PG13.10 and recently we have encountered what appears to be > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZE. > The problem is that after successfully creating index on relation (which > previosly didnt have any indexes), its pg_class.relhasindex remains set to > "false", which is illegal, I think. > Index was built using the following statement: > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); ALTER TABLE ADD CONSTRAINT would certainly have taken AccessExclusiveLock on the "example" table, which should be sufficient to prevent anything else from touching its pg_class row. The only mechanism I can think of that might bypass that is a manual UPDATE on pg_class, which would just manipulate the row as a row without concern for associated relation-level locks. Any chance that somebody was doing something like that? regards, tom lane
Re: ResourceOwner refactoring
Hi, On 2023-10-25 15:43:36 +0300, Heikki Linnakangas wrote: > On 10/07/2023 22:14, Andres Freund wrote: > > > /* > > > - * Initially allocated size of a ResourceArray. Must be power of two > > > since > > > - * we'll use (arraysize - 1) as mask for hashing. > > > + * Size of the fixed-size array to hold most-recently remembered > > > resources. > > >*/ > > > -#define RESARRAY_INIT_SIZE 16 > > > +#define RESOWNER_ARRAY_SIZE 32 > > > > That's 512 bytes, pretty large to be searched sequentially. It's somewhat > > sad > > that the array needs to include 8 byte of ResourceOwnerFuncs... > > Yeah. Early on I considered having a RegisterResourceKind() function that > you need to call first, and then each resource kind could be assigned a > smaller ID. If we wanted to keep the old mechanism of separate arrays for > each different resource kind, that'd be the way to go. But overall this > seems simpler, and the performance is decent despite that linear scan. Realistically, on a 64bit platform, the compiler / we want 64bit alignment for ResourceElem->item, so making "kind" smaller isn't going to do much... > > Due to 'narr' being before the large 'arr', 'nhash' is always on a separate > > cacheline. I.e. the number of cachelines accessed in happy paths is higher > > than necessary, also relevant because there's more dependencies on narr, > > nhash > > than on the contents of those. > > > > I'd probably order it so that both of the large elements (arr, locks) are at > > the end. > > > > It's hard to know with changes this small, but it does seem to yield a small > > and repeatable performance benefit (pipelined pgbench -S). > > Swapped the 'hash' and 'arr' parts of the struct, so that 'arr' and 'locks' > are at the end. I don't remember what the layout was then, but now there are 10 bytes of holes on x86-64 (and I think most other 64bit architectures). struct ResourceOwnerData { ResourceOwner parent; /* 0 8 */ ResourceOwner firstchild; /* 8 8 */ ResourceOwner nextchild;/*16 8 */ const char * name; /*24 8 */ _Bool releasing;/*32 1 */ _Bool sorted; /*33 1 */ /* XXX 6 bytes hole, try to pack */ ResourceElem * hash; /*40 8 */ uint32 nhash;/*48 4 */ uint32 capacity; /*52 4 */ uint32 grow_at; /*56 4 */ uint32 narr; /*60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ ResourceElem arr[32]; /*64 512 */ /* --- cacheline 9 boundary (576 bytes) --- */ uint32 nlocks; /* 576 4 */ /* XXX 4 bytes hole, try to pack */ LOCALLOCK *locks[15];/* 584 120 */ /* size: 704, cachelines: 11, members: 14 */ /* sum members: 694, holes: 2, sum holes: 10 */ }; While somewhat annoying from a human pov, it seems like it would make sense to rearrange a bit? At least moving nlocks into the first cacheline would be good (not that we guarantee cacheline alignment rn, though it sometimes works out to be aligned). We also can make narr, nlocks uint8s. With that narrowing and reordering, we end up with 688 bytes. Not worth for most structs, but here perhaps worth doing? Moving the lock length to the start of the struct would make sense to keep things that are likely to be used in a "stalling" manner at the start of the struct / in one cacheline. It's not too awful to have it be in this order: struct ResourceOwnerData { ResourceOwner parent; /* 0 8 */ ResourceOwner firstchild; /* 8 8 */ ResourceOwner nextchild;/*16 8 */ const char * name; /*24 8 */ _Bool releasing;/*32 1 */ _Bool sorted; /*33 1 */ uint8 narr; /*34 1 */ uint8 nlocks; /*35 1 */ uint32 nhash;/*36 4 */ uint32 capacity; /*40 4 */ uint32 grow_at; /*44 4 */ ResourceElem arr[32]; /*48 512 */ /* --- cacheline 8 boundary (512 bytes) was 48 bytes ago --- */ ResourceElem * hash; /* 560 8 */
Does UCS_BASIC have the right CTYPE?
UCS_BASIC is defined in the standard as a collation based on comparing the code point values, and in UTF8 that is satisfied with memcmp(), so the collation locale for UCS_BASIC in Postgres is simply "C". But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In Postgres, the answer is 'á', but intuitively, one could reasonably expect the answer to be 'Á'. Reading the standard, it seems that LOWER()/UPPER() are defined in terms of the Unicode General Category (Section 4.2, " is a pair of functions..."). It is somewhat ambiguous about the case mappings, but I could guess that it means the Default Case Algorithm[1]. That seems to suggest the standard answer should be 'Á' regardless of any COLLATE clause (though I could be misreading). I'm a bit confused by that... what's the standard-compatible way to specify the locale for UPPER()/LOWER()? If there is none, then it makes sense that Postgres overloads the COLLATE clause for that purpose so that users can use a different locale if they want. But given that UCS_BASIC is a collation specified in the standard, shouldn't it have ctype behavior that's as close to the standard as possible? Regards, Jeff Davis [1] https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992
Remove dead code in pg_ctl.c
Hackers, It looks like this code was missed in 39969e2a when exclusive backup was removed. Regards, -Daviddiff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a99..4099d240e03 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -96,7 +96,6 @@ static time_t start_time; static char postopts_file[MAXPGPATH]; static char version_file[MAXPGPATH]; static char pid_file[MAXPGPATH]; -static char backup_file[MAXPGPATH]; static char promote_file[MAXPGPATH]; static char logrotate_file[MAXPGPATH]; @@ -2447,7 +2446,6 @@ main(int argc, char **argv) snprintf(postopts_file, MAXPGPATH, "%s/postmaster.opts", pg_data); snprintf(version_file, MAXPGPATH, "%s/PG_VERSION", pg_data); snprintf(pid_file, MAXPGPATH, "%s/postmaster.pid", pg_data); - snprintf(backup_file, MAXPGPATH, "%s/backup_label", pg_data); /* * Set mask based on PGDATA permissions,
Re: walwriter interacts quite badly with synchronous_commit=off
Hi, On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote: > On 25/10/2023 02:09, Andres Freund wrote: > > Because of the inherent delay between the checks of > > XLogCtl->WalWriterSleeping > > and Latch->is_set, we also sometimes end up with multiple processes > > signalling > > walwriter, which can be bad, because it increases the likelihood that some > > of > > the signals may be received when we are already holding WALWriteLock, > > delaying > > its release... > > That can only happen when walwriter has just come out of "hibernation", ie. > when the system has been idle for a while. So probably not a big deal in > practice. Maybe I am missing something here - why can this only happen when hibernating? Even outside of that two backends can decide that that they need to wake up walwriter? We could prevent that, by updating state when requesting walwriter to be woken up. But with the changes we're discussing below, that should be rare. > > I think the most important optimization we need is to have > > XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed > > WAL. Unless walsender is hibernating, walsender will wake up on its own > > after > > wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this. > > > > E.g. a condition like > > if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * > > XLOG_BLCKSZ) > > return; > > drastically cuts down on the amount of wakeups, without - I think - loosing > > guarantees around synchronous_commit=off. > > In the patch, you actually did: > > > + if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * > > XLOG_BLCKSZ) > > + return; > > It means that you never wake up the walwriter to merely *write* the WAL. You > only wake it up if it's going to also fsync() it. I think that's correct and > appropriate, but it took me a while to reach that conclusion: Yea, after writing the email I got worried that just looking at Write would perhaps lead to not flushing data soon enough... > It might be beneficial to wake up the walwriter just to perform a write(), > to offload that work from the backend. And walwriter will actually also > perform an fsync() after finishing the current segment, so it would make > sense to also wake it up when 'asyncXactLSN' crosses a segment boundary. > However, if those extra wakeups make sense, they would also make sense when > there are no asynchronous commits involved. Therefore those extra wakeups > should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer(). > The condition you have in the patch is appropriate for > XLogSetAsyncXactLSN(). Yea. I agree we should wake up walsender in other situations too... > Another reason to write the WAL aggressively, even if you don't flush it, > would be to reduce the number of lost transactions on a segfault. But we > don't give any guarantees on that, and even with the current aggressive > logic, we only write when a page is full so you're anyway going to lose the > last partial page. Wal writer does end up writing the trailing partially filled page during the next wal_writer_delay cycle. > It also took me a while to convince myself that this calculation matches the > calculation that XLogBackgroundFlush() uses to determine whether it needs to > flush or not. XLogBackgroundFlush() first divides the request and result > with XLOG_BLCKSZ and then compares the number of blocks, whereas here you > perform the calculation in bytes. I think the result is the same, but to > make it more clear, let's do it the same way in both places. > > See attached. It's the same logic as in your patch, just formulatd more > clearly IMHO. Yep, makes sense! > > Because of the frequent wakeups, we do something else that's not smart: We > > write out 8k blocks individually, many times a second. Often thousands of > > 8k pwrites a second. > > Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal > writer gets woken up by the async commits rather than the timeout, the write > pattern is a bit silly: > > $ strace -p 1099926 # walwriter > strace: Process 1099926 attached > epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, > u64=94261056289248}}], 1, 1991) = 1 > read(3, > "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 128 > pwrite64(5, > "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"..., > 1007616, 49152) = 1007616 > fdatasync(5)= 0 > pwrite64(5, "\24\321\5\0\1\0\0\0\0 > \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768) > = 16384 > epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232, > u64=94261056289248}}], 1, 2000) = 1 > read(3, > "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 128 > pwrite64(5, > "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"..., > 1040384, 1073152) =
Re: Custom tstzrange with importance factored in
On Fri, 2023-10-06 at 16:55 +0300, Rares Pop (Treelet) wrote: > I essentially want to be able to aggregate multiple tstzranges - each > range with its own importance. The aggregation would be like a a > join/intersect where ranges with higher importance override the ones > with lower importance. It may be possible without a new data type. Can you describe the semantics more precisely? Regards, Jeff Davis
Re: Should we represent temp files as unsigned long int instead of signed long int type?
Ashutosh Sharma writes: > At present, we represent temp files as a signed long int number. And > depending on the system architecture (32 bit or 64 bit), the range of > signed long int varies, for example on a 32-bit system it will range > from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a > session to create more than 2 billion temporary files and that is not > a small number at all, but still what if we make it an unsigned long > int which will allow a session to create 4 billion temporary files if > needed. AFAIK, nothing particularly awful will happen if that counter wraps around. Perhaps if you gamed the system really hard, you could cause a collision with a still-extant temp file from the previous cycle, but I seriously doubt that could happen by accident. So I don't think there's anything to worry about here. Maybe we could make that filename pattern %lu not %ld, but minus sign is a perfectly acceptable filename character, so such a change would be cosmetic. regards, tom lane
Re: Should we represent temp files as unsigned long int instead of signed long int type?
On Wed, Oct 25, 2023 at 1:28 PM Ashutosh Sharma wrote: > At present, we represent temp files as a signed long int number. And > depending on the system architecture (32 bit or 64 bit), the range of > signed long int varies, for example on a 32-bit system it will range > from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a > session to create more than 2 billion temporary files and that is not > a small number at all, but still what if we make it an unsigned long > int which will allow a session to create 4 billion temporary files if > needed. I might be sounding a little stupid here because 2 billion > temporary files is like 2000 peta bytes (2 billion * 1GB), considering > each temp file is 1GB in size which is not a small data size at all, > it is a huge amount of data storage. However, since the variable we > use to name temporary files is a static long int (static long > tempFileCounter = 0;), there is a possibility that this number will > get exhausted soon if the same session is trying to create too many > temp files via multiple queries. I think we use signed integer types in a bunch of places where an unsigned integer type would be straight-up better, and this is one of them. I don't know whether it really matters, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: trying again to get incremental backup
On 2023-10-25 We 11:24, Robert Haas wrote: On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan wrote: I'm not too worried about the maintenance burden. That said, I agree that JSON might not be the best format for backup manifests, but maybe that ship has sailed. I think it's a decision we could walk back if we had a good enough reason, but it would be nicer if we didn't have to, because what we have right now is working. If we change it for no real reason, we might introduce new bugs, and at least in theory, incompatibility with third-party tools that parse the existing format. If you think we can live with the additional complexity in the JSON parsing stuff, I'd rather go that way. OK, I'll go with that. It will actually be a bit less invasive than the patch I posted. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: trying again to get incremental backup
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan wrote: > OK, I'll go with that. It will actually be a bit less invasive than the > patch I posted. Why's that? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
David Rowley writes: > I've attached a patch which builds on the previous patch and relaxes > the rule that the StringInfo must be NUL-terminated. The rule is > only relaxed for StringInfos that are initialized with > initReadOnlyStringInfo. Yeah, that's probably a reasonable way to frame it. > There's also an existing confusing comment in logicalrep_read_tuple() > which seems to think we're just setting the NUL terminator to conform > to StringInfo's practises. This is misleading as the NUL is required > for LOGICALREP_COLUMN_TEXT mode as we use the type's input function > instead of the receive function. You don't have to look very hard to > find an input function that needs a NUL terminator. Right, input functions are likely to expect this. > I'm a bit less confident that the type's receive function will never > need to be NUL terminated. cstring_recv() came to mind as one I should > look at, but on looking I see it's not required as it just reads the > remaining bytes from the input StringInfo. Is it safe to assume this? I think that we can make that assumption starting with v17. Back-patching it would be hazardous perhaps; but if there's some function out there that depends on NUL termination, testing should expose it before too long. Wouldn't hurt to mention this explicitly as a possible incompatibility in the commit message. Looking over the v5 patch, I have some nits: * In logicalrep_read_tuple, s/input function require that/input functions require that/ (or fix the grammatical disagreement some other way) * In exec_bind_message, you removed the comment pointing out that we are scribbling directly on the message buffer, even though we still are. This patch does nothing to make that any safer, so I object to removing the comment. * In stringinfo.h, I'd suggest adding text more or less like this within or at the end of the "As a special case, ..." para in the first large comment block: * Also, it is caller's option whether a read-only string buffer has * a terminating '\0' or not. This depends on the intended usage. That's partially redundant with some other comments, but this para is defining the API for read-only buffers, so I think it would be good to include it here. regards, tom lane
Re: POC, WIP: OR-clause support for indexes
On Sat, Oct 14, 2023 at 6:37 PM Alexander Korotkov wrote: > Regarding the GUC parameter, I don't see we need a limit. It's not > yet clear whether a small number or a large number of OR clauses are > more favorable for transformation. I propose to have just a boolean > enable_or_transformation GUC. That's a poor solution. So is the GUC patch currently has (or_transform_limit). What you need is a heuristic that figures out fairly reliably whether the transformation is going to be better or worse. Or else, do the whole thing in some other way that is always same-or-better. In general, adding GUCs makes sense when the user knows something that we can't know. For example, shared_buffers makes some sense because, even if we discovered how much memory the machine has, we can't know how much of it the user wants to devote to PostgreSQL as opposed to anything else. And track_io_timing makes sense because we can't know whether the user wants to pay the price of gathering that additional data. But GUCs are a poor way of handling cases where the real problem is that we don't know what code to write. In this case, some queries will be better with enable_or_transformation=on, and some will be better with enable_or_transformation=off. Since we don't know which will work out better, we make the user figure it out and set the GUC, possibly differently for each query. That's terrible. It's the query optimizer's whole job to figure out which transformations will speed up the query. It shouldn't turn around and punt the decision back to the user. Notice that superficially-similar GUCs like enable_seqscan aren't really the same thing at all. That's just for developer testing and debugging. Nobody expects that you have to adjust that GUC on a production system - ever. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove dead code in pg_ctl.c
On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote: > It looks like this code was missed in 39969e2a when exclusive backup was > removed. Indeed. I'll plan on committing this shortly. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: walwriter interacts quite badly with synchronous_commit=off
On 25/10/2023 21:59, Andres Freund wrote: On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote: On 25/10/2023 02:09, Andres Freund wrote: Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping and Latch->is_set, we also sometimes end up with multiple processes signalling walwriter, which can be bad, because it increases the likelihood that some of the signals may be received when we are already holding WALWriteLock, delaying its release... That can only happen when walwriter has just come out of "hibernation", ie. when the system has been idle for a while. So probably not a big deal in practice. Maybe I am missing something here - why can this only happen when hibernating? Even outside of that two backends can decide that that they need to wake up walwriter? Ah sure, multiple backends can decide to wake up walwriter at the same time. I thought you meant that the window for that was somehow wider when XLogCtl->WalWriterSleeping. We could prevent that, by updating state when requesting walwriter to be woken up. But with the changes we're discussing below, that should be rare. One small easy thing we could do to reduce the redundant wakeups: only wake up walwriter if asyncXactLSN points to different page than prevAsyncXactLSN. -- Heikki Linnakangas Neon (https://neon.tech)
Re: doc: a small improvement about pg_am description
On Wed, 2023-10-25 at 17:25 +0900, Yugo NAGATA wrote: > It seems to me that this description says pg_am contains only > index access methods but not table methods. I wonder it is missed > to fix this when tableam was supported and other documentation > was changed in b73c3a11963c8bb783993cfffabb09f558f86e37. Thank you for the report. That section should not refer to pg_am directly now that there's CREATE ACCESS METHOD. I committed a fix for that which also fixes the problem you describe. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Guiding principle for dropping LLVM versions?
On Wed, Oct 25, 2023 at 7:12 PM Tom Lane wrote: > Thomas Munro writes: > > 3. We exclude OSes that don't bless an LLVM release (eg macOS running > > an arbitrarily picked version), and animals running only to cover > > ancient LLVM compiled from source for coverage (Andres's sid > > menagerie). > > Seems generally reasonable. Maybe rephrase 3 as "We consider only > an OS release's default LLVM version"? Or a bit more forgivingly, > "... only LLVM versions available from the OS vendor"? Also, > what's an OS vendor? You rejected macOS which is fine, but > I think the packages available from MacPorts or Homebrew should > be considered. OK. For me the key differences are that they are independent of OS releases and time lines, they promptly add new releases, they have a wide back-catalogue of the old releases and they let the user decide which to use. So I don't think they constrain us and it makes no sense to try to apply 'end of support' logic to them. https://formulae.brew.sh/formula/llvm https://ports.macports.org/search/?q=llvm&name=on (Frustratingly, the ancient releases of clang don't actually seem to work on MacPorts at least on aarch64, and they tell you so when you try to install them.) The BSDs may be closer to macOS in that respect too, since they have ports separate from OS releases and they offer a rolling and wide range of LLVMs and generally default to a very new one, so I don't think they constrain us either. It's really Big Linux that is drawing the lines in the sand here, though (unusually) not RHEL-and-frenemies as they've opted for rolling to the latest in every minor release as Devrim explained.
Re: Remove dead code in pg_ctl.c
On Wed, Oct 25, 2023 at 03:02:01PM -0500, Nathan Bossart wrote: > On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote: >> It looks like this code was missed in 39969e2a when exclusive backup was >> removed. > > Indeed. I'll plan on committing this shortly. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Container Types
On Tue, 2022-12-20 at 10:24 +0100, Vik Fearing wrote: > Obviously there would have to be an actual type in order to store it > in > a table, but what I am most interested in here is being able to > create > them on the fly. I do not think it is feasible to create N new types > for every type like we do for arrays on the off-chance you would want > to > put it in a PERIOD for example. By "on the fly" do you mean when creating real objects, like a table? In that case it might not be so hard, because we can just create an ordinary entry in pg_type. But for this to be a complete feature, I think we need the container types to be useful when constructed within a query, too. E.g. SELECT two_things(v1, v2) FROM foo; where the result of two_things is some new type two_things_int_text which is based on the types of v1 and v2 and has never been used before. I don't think it's reasonable to create a permanent pg_type entry on the fly to answer a read-only query. But we could introduce some notion of an ephemeral in-memory pg_type entry with its own OID, and create that on the fly. One way to do that might be to reserve some of the system OID space (e.g. 15000-16000) for OIDs for temporary catalog entries, and then have some in-memory structure that holds those temporary entries. Any lookups in that range would search the in-memory structure instead of the real catalog. All of this is easier said than done, but I think it could work. We'd also need to think about how to infer types through a container type, e.g. SELECT second_thing(two_things(v1,v2)) FROM foo; should infer that the return type of second_thing() is the type of v2. To do that, perhaps pg_proc entries can include some kind of type sublanguage to do this inference, e.g. "a, b -> b" for second_thing(), or "a, b -> a" for first_thing(). Regards, Jeff Davis
Re: Partial aggregates pushdown
On Tue, Oct 24, 2023 at 12:12:41AM +, fujii.y...@df.mitsubishielectric.co.jp wrote: > Hi Mr.Momjian. > > > Fujii-san, to get this patch closer to finished, can I modify this version > > of the patch to improve some wording and post an > > updated version you can use for future changes? > Yes, I greatly appreciate your offer. > I would very much appreciate your modifications. I am almost done updating the patch, but I got stuck on how the feature is supposed to work. This documentation sentence is where I got confused: check_partial_aggregate_support (boolean) If this option is false, postgres_fdw assumes that for each built-in aggregate function, the partial aggregate function is defined on the remote server without checking the remote server version. If this option is true, during query planning, postgres_fdw connects to the remote server and checks if the remote server version is older than the local server version. If so, postgres_fdw -->assumes that for each built-in aggregate function, the partial aggregate function is not defined -->on the remote server unless the partial aggregate function and the aggregate -->function match. Otherwise postgres_fdw assumes that for each built-in aggregate function, the partial aggregate function is defined on the remote server. The default is false. What does that marked sentence mean? What is match? Are one or both of these remote? It sounds like you are checking the local aggregate against the remote partial aggregate, but I don't see any code that does this in the patch. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: libpq async connection and multiple hosts
On Wed, 25 Oct 2023 at 18:54, Daniele Varrazzo wrote: > - connect_timeout > - multiple host, hostaddr, port > - load_balance_hosts=random > > Does this list sound complete? I think you'd also want to resolve the hostnames to IPs yourself and iterate over those one-by-one. Otherwise if the first IP returned for the hostname times out, you will never connect to the others.
Re: Improving btree performance through specializing by key shape, take 2
On Mon, Sep 25, 2023 at 9:13 AM Matthias van de Meent wrote: > I think it's fairly complete, and mostly waiting for review. > > > I don't have time to do a comprehensive (or even a fairly > > cursory) analysis of which parts of the patch are helping, and which > > are marginal or even add no value. > > It is a shame that you don't have the time to review this patch. I didn't say that. Just that I don't have the time (or more like the inclination) to do most or all of the analysis that might allow us to arrive at a commitable patch. Most of the work with something like this is the analysis of the trade-offs, not writing code. There are all kinds of trade-offs that you could make with something like this, and the prospect of doing that myself is kind of daunting. Ideally you'd have made a significant start on that at this point. > > I have long understood that you gave up on the idea of keeping the > > bounds across levels of the tree (which does make sense to me), but > > yesterday the issue became totally muddled by this high key business. > > That's why I rehashed the earlier discussion, which I had previously > > understood to be settled. > > Understood. I'll see if I can improve the wording to something that is > more clear about what the optimization entails. Cool. -- Peter Geoghegan
Re: post-recovery amcheck expectations
On Tue, Oct 24, 2023 at 8:05 PM Noah Misch wrote: > Can't it still happen if the sequence of unfortunately timed crashes causes > deletions from left to right? Take this example, expanding the one above. > Half-kill 4, crash, half-kill 3, crash, half-kill 2 in: > > * 1 > * // | \\ > * 4 <-> 3 <-> 2 <-> 1 > > (That's not to say it has ever happened outside of a test.) Hmm. Perhaps you're right. I thought that this wasn't possible in part due to the fact that you'd have to access all of these leaf pages in the same order each time, without ever passing over a previous half-dead page. But I suppose that there's nothing stopping the index tuples from being deleted from each page in an order that leaves open the possibility of something like this. (It's extremely unlikely, of course, but that wasn't ever in question.) I withdraw my suggestion about the wording from your patch. It seems committable. Thanks -- Peter Geoghegan
Re: Container Types
Hi, On 2023-10-25 15:03:04 -0700, Jeff Davis wrote: > On Tue, 2022-12-20 at 10:24 +0100, Vik Fearing wrote: > > Obviously there would have to be an actual type in order to store it > > in > > a table, but what I am most interested in here is being able to > > create > > them on the fly. I do not think it is feasible to create N new types > > for every type like we do for arrays on the off-chance you would want > > to > > put it in a PERIOD for example. > > By "on the fly" do you mean when creating real objects, like a table? > In that case it might not be so hard, because we can just create an > ordinary entry in pg_type. > > But for this to be a complete feature, I think we need the container > types to be useful when constructed within a query, too. E.g. > > SELECT two_things(v1, v2) FROM foo; > > where the result of two_things is some new type two_things_int_text > which is based on the types of v1 and v2 and has never been used > before. > > I don't think it's reasonable to create a permanent pg_type entry on > ethe fly to answer a read-only query. But we could introduce some notion > of an ephemeral in-memory pg_type entry with its own OID, and create > that on the fly. I don't think particularly like the idea of an in-memory pg_type entry. But I'm not sure we need that anyway - we already have this problem with record types. We support both named record types (tables and explicitly created composite types) and ad-hoc ones (created if you write ROW(foo, bar) or something like that). If a record's typmod is negative, it refers to an anonymous row type, if positive it's a named typmod. We even have support for sharing such ad-hoc rowtypes across backends for parallel query... I'd look whether you can generalize that infrastructure. Greetings, Andres Freund
Re: Document aggregate functions better w.r.t. ORDER BY
On Wed, Oct 25, 2023 at 8:36 AM Bruce Momjian wrote: > On Tue, Oct 24, 2023 at 06:45:48PM -0700, David G. Johnston wrote: > > I'd prefer to keep pointing out that the ones documented are those whose > > outputs will vary due to ordering. > > Okay, I re-added it in the attached patch, and tightened up the text. > Thanks > I think you are right that it belongs in the syntax section; we cover > ordering extensively there. We already have queries there, but not > output, so I moved the relevant examples to there and replaced the > example that had no output. > Thanks > > The same goes for the special knowledge of floating point behavior for > why > > we've chosen to document avg/sum, something that typically doesn't care > about > > order, as having an optional order by. > > The floating example seems too obscure to mention in our function docs. > I can put a sentence in the syntax docs, but is there value in > explaining that to users? How it that helpful? Example? > > Yeah, we punt on the entire concept in the data type section: "Managing these errors and how they propagate through calculations is the subject of an entire branch of mathematics and computer science and will not be discussed here," ... Also, I'm now led to believe that the relevant IEEE 754 floating point addition is indeed commutative. Given that, I am inclined to simply not add the order by clause at all to those four functions. (actually, you already got rid of the avg()s but the sum()s are still present, so just those two). David J.
Re: Document aggregate functions better w.r.t. ORDER BY
On Wed, Oct 25, 2023 at 04:14:11PM -0700, David G. Johnston wrote: > Yeah, we punt on the entire concept in the data type section: > > "Managing these errors and how they propagate through calculations is the > subject of an entire branch of mathematics and computer science and will not > be > discussed here," ... > > Also, I'm now led to believe that the relevant IEEE 754 floating point > addition > is indeed commutative. Given that, I am inclined to simply not add the order > by clause at all to those four functions. (actually, you already got rid of > the > avg()s but the sum()s are still present, so just those two). Ah, yes, sum() removed. Updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c3e940afe..3b49e63987 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ... aggregation. + + While all aggregates below accept an optional + ORDER BY clause (as outlined in ), the clause has only been added to + aggregates whose output is affected by ordering. + + General-Purpose Aggregate Functions @@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ... array_agg -array_agg ( anynonarray ) +array_agg ( anynonarray ORDER BY input_sort_columns ) anyarray @@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ... -array_agg ( anyarray ) +array_agg ( anyarray ORDER BY input_sort_columns ) anyarray @@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ... json_agg -json_agg ( anyelement ) +json_agg ( anyelement ORDER BY input_sort_columns ) json jsonb_agg -jsonb_agg ( anyelement ) +jsonb_agg ( anyelement ORDER BY input_sort_columns ) jsonb @@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ... json_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) json @@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ... jsonb_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) jsonb @@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ... string_agg ( value - bytea, delimiter bytea ) + bytea, delimiter bytea + ORDER BY input_sort_columns ) bytea @@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ... xmlagg -xmlagg ( xml ) +xmlagg ( xml ORDER BY input_sort_columns ) xml diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 3ba844057f..b64733d8aa 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -1647,7 +1647,26 @@ sqrt(2) are always just expressions and cannot be output-column names or numbers. For example: -SELECT array_agg(a ORDER BY b DESC) FROM table; +WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) +SELECT array_agg(v ORDER BY v DESC) FROM vals; + array_agg +- + {4,3,3,2,1} + +Since jsonb only keeps the last matching key, ordering +of its keys can be significant: + +WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') ) +SELECT jsonb_object_agg(k, v) FROM vals; + jsonb_object_agg + + {"key0": "1", "key1": "2"} + +WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') ) +SELECT jsonb_object_agg(k, v ORDER BY v) FROM vals; + jsonb_object_agg + + {"key0": "1", "key1": "3"} @@ -1673,6 +1692,14 @@ SELECT string_agg(a ORDER BY a, ',') FROM table; -- incorrect expressions must match regular arguments of the aggregate; that is, you cannot sort on an expression that is not included in the DISTINCT list. +For example: + +WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) +SELECT array_agg(DISTINCT v) FROM vals; + array_agg +--- + {1,2,3,4} +
Re: pg_stat_statements and "IN" conditions
On Tue, Oct 17, 2023 at 10:15:41AM +0200, Dmitry Dolgov wrote: > In the current patch version I didn't add anything yet to address the > question of having more parameters to tune constants merging. The main > obstacle as I see it is that the information for that has to be > collected when jumbling various query nodes. Anything except information > about the ArrayExpr itself would have to be acquired when jumbling some > other parts of the query, not directly related to the ArrayExpr. It > seems to me this interdependency between otherwise unrelated nodes > outweigh the value it brings, and would require some more elaborate (and > more invasive for the purpose of this patch) mechanism to implement. typedef struct ArrayExpr { + pg_node_attr(custom_query_jumble) + Hmm. I am not sure that this is the best approach implementation-wise. Wouldn't it be better to invent a new pg_node_attr (these can include parameters as well!), say query_jumble_merge or query_jumble_agg_location that aggregates all the parameters of a list to be considered as a single element. To put it short, we could also apply the same property to other parts of a parsed tree, and not only an ArrayExpr's list. /* GUC parameters */ extern PGDLLIMPORT int compute_query_id; - +extern PGDLLIMPORT bool query_id_const_merge; Not much a fan of this addition as well for an in-core GUC. I would suggest pushing the GUC layer to pg_stat_statements, maintaining the computation method to use as a field of JumbleState as I suspect that this is something we should not enforce system-wide, but at extension-level instead. + /* +* Indicates the constant represents the beginning or the end of a merged +* constants interval. +*/ + boolmerged; Not sure that this is the best thing to do either. Instead of this extra boolean flag, could it be simpler if we switch LocationLen so as we track the start position and the end position of a constant in a query string, so as we'd use one LocationLen for a whole set of Const nodes in an ArrayExpr? Perhaps this could just be a refactoring piece of its own? + /* +* If the first expression is a constant, verify if the following elements +* are constants as well. If yes, the list is eligible for merging, and the +* order of magnitude need to be calculated. +*/ + if (IsA(firstExpr, Const)) + { + foreach(temp, elements) + if (!IsA(lfirst(temp), Const)) + return false; This path should be benchmarked, IMO. -- Michael signature.asc Description: PGP signature
Re: Document aggregate functions better w.r.t. ORDER BY
On Wed, Oct 25, 2023 at 4:22 PM Bruce Momjian wrote: > On Wed, Oct 25, 2023 at 04:14:11PM -0700, David G. Johnston wrote: > > Yeah, we punt on the entire concept in the data type section: > > > > "Managing these errors and how they propagate through calculations is the > > subject of an entire branch of mathematics and computer science and will > not be > > discussed here," ... > > > > Also, I'm now led to believe that the relevant IEEE 754 floating point > addition > > is indeed commutative. Given that, I am inclined to simply not add the > order > > by clause at all to those four functions. (actually, you already got rid > of the > > avg()s but the sum()s are still present, so just those two). > > Ah, yes, sum() removed. Updated patch attached. > > The paragraph leading into the last added example needs to be tweaked: If DISTINCT is specified within an aggregate, the data is sorted in ascending order while extracting unique values. You can add an ORDER BY clause, limited to expressions matching the regular arguments of the aggregate, to sort the output in descending order. (show existing - DISTINCT only - example here) WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) SELECT string_agg(DISTINCT v::text, ';' ORDER BY v::text DESC) FROM vals; string_agg --- 4;3;2;1 (existing note) Question: Do you know whether we for certain always sort ascending here to compute the unique values or whether if, say, there is an index on the column in descending order (or ascending and traversed backwards) that the data within the aggregate could, with an order by, be returned in descending order? If it is ascending, is that part of the SQL Standard (since it doesn't even allow an order by to give the user the ability the control the output ordering) or does the SQL Standard expect that even a random order would be fine since there are algorithms that can be used that do not involve sorting the input? It seems redundant to first say "regular arguments" then negate it in order to say "DISTINCT list". Using the positive form with "DISTINCT list" should get the point across sufficiently and succinctly. It also avoids me feeling like there should be an example of what happens when you do "sort on an expression that is not included in the DISTINCT list". Interestingly: WITH vals (v,l) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') ) SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM vals; ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list LINE 2: SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM... But both expressions in the argument list (el and semicolon) do appear in the ORDER BY... David J.
Re: Should we represent temp files as unsigned long int instead of signed long int type?
On Wed, Oct 25, 2023 at 03:07:39PM -0400, Tom Lane wrote: > AFAIK, nothing particularly awful will happen if that counter wraps > around. Perhaps if you gamed the system really hard, you could cause > a collision with a still-extant temp file from the previous cycle, > but I seriously doubt that could happen by accident. So I don't > think there's anything to worry about here. Maybe we could make > that filename pattern %lu not %ld, but minus sign is a perfectly > acceptable filename character, so such a change would be cosmetic. In the mood of removing long because it may be 4 bytes or 8 bytes depending on the environment, I'd suggest to change it to either int64 or uint64. Not that it matters much for this specific case, but that makes the code more portable. -- Michael signature.asc Description: PGP signature
Re: Should we represent temp files as unsigned long int instead of signed long int type?
Michael Paquier writes: > In the mood of removing long because it may be 4 bytes or 8 bytes > depending on the environment, I'd suggest to change it to either int64 > or uint64. Not that it matters much for this specific case, but that > makes the code more portable. Then you're going to need a not-so-portable conversion spec in the snprintf call. Not sure it's any improvement. regards, tom lane
Re: libpq async connection and multiple hosts
On Thu, 26 Oct 2023, 00:10 Jelte Fennema, wrote: > On Wed, 25 Oct 2023 at 18:54, Daniele Varrazzo > wrote: > > - connect_timeout > > - multiple host, hostaddr, port > > - load_balance_hosts=random > > > > Does this list sound complete? > > I think you'd also want to resolve the hostnames to IPs yourself and > iterate over those one-by-one. Otherwise if the first IP returned for > the hostname times out, you will never connect to the others. > For async connections we were already unpacking and processing the hosts list, in order to perform non-blocking resolution and populate the hostaddr. This already accounted for the possibility of one host resolving to more than one address. But then we would have packed everything back into a single conninfo and made a single connection attempt. https://github.com/psycopg/psycopg/blob/14740add6bb1aebf593a65245df21699daabfad5/psycopg/psycopg/conninfo.py#L278 The goal here was only non-blocking name resolution. Ahaini understand we should do is to split on the hosts for sync connections too, shuffle if requested, and make separate connection attempts. -- Daniele >
Re: Document aggregate functions better w.r.t. ORDER BY
On Wed, Oct 25, 2023 at 05:10:17PM -0700, David G. Johnston wrote: > The paragraph leading into the last added example needs to be tweaked: > > If DISTINCT is specified within an aggregate, the data is sorted in ascending > order while extracting unique values. You can add an ORDER BY clause, limited > to expressions matching the regular arguments of the aggregate, to sort the > output in descending order. > > (show existing - DISTINCT only - example here) > > > WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) > SELECT string_agg(DISTINCT v::text, ';' ORDER BY v::text DESC) FROM vals; > string_agg > --- > 4;3;2;1 > > > (existing note) I see what you mean. I added an example that doesn't match the existing paragraph. I have rewritten the paragraph and used a relevant example; patch attached. > Question: Do you know whether we for certain always sort ascending here to > compute the unique values or whether if, say, there is an index on the column > in descending order (or ascending and traversed backwards) that the data > within > the aggregate could, with an order by, be returned in descending order? If it > is ascending, is that part of the SQL Standard (since it doesn't even allow an > order by to give the user the ability the control the output ordering) or does > the SQL Standard expect that even a random order would be fine since there are > algorithms that can be used that do not involve sorting the input? I don't think order is ever guaranteed in the standard without an ORDER BY. > It seems redundant to first say "regular arguments" then negate it in order to > say "DISTINCT list". Using the positive form with "DISTINCT list" should get > the point across sufficiently and succinctly. It also avoids me feeling like > there should be an example of what happens when you do "sort on an expression > that is not included in the DISTINCT list". Agreed, I rewrote that. > Interestingly: > > WITH vals (v,l) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') ) > SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM vals; > > ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in > argument list > LINE 2: SELECT string_agg(DISTINCT l, ';' ORDER BY l, ';' DESC) FROM... > > But both expressions in the argument list (el and semicolon) do appear in the > ORDER BY... I think ORDER BY has to match DISTINCT columns, while you are using ';'. I used a simpler example with array_agg() in my patch to avoid the issue. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c3e940afe..3b49e63987 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ... aggregation. + + While all aggregates below accept an optional + ORDER BY clause (as outlined in ), the clause has only been added to + aggregates whose output is affected by ordering. + + General-Purpose Aggregate Functions @@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ... array_agg -array_agg ( anynonarray ) +array_agg ( anynonarray ORDER BY input_sort_columns ) anyarray @@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ... -array_agg ( anyarray ) +array_agg ( anyarray ORDER BY input_sort_columns ) anyarray @@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ... json_agg -json_agg ( anyelement ) +json_agg ( anyelement ORDER BY input_sort_columns ) json jsonb_agg -jsonb_agg ( anyelement ) +jsonb_agg ( anyelement ORDER BY input_sort_columns ) jsonb @@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ... json_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) json @@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ... jsonb_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_columns ) jsonb @@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ... string_agg ( value - bytea, delimiter bytea ) + bytea, delimiter bytea + ORDER BY input_sort_columns ) bytea @@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ... xmlagg -xmlagg ( xml ) +xmlagg ( xml ORDER BY input_sort_columns ) xml diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 3ba844057f..c5627001c7 100644 --- a/do
Re: Introduce a new view for checkpointer related stats
On Mon, Feb 13, 2023 at 11:31:03AM +0530, Bharath Rupireddy wrote: > Needed a rebase. Please review the attached v8 patch set. I was looking at this patch, and got a few comments. FWIW, I kind of agree with the feeling of Bertrand upthread that using "checkpoint_" in the attribute names for the new view is globally inconsistent. After 0003, we get: =# select attname from pg_attribute where attrelid = 'pg_stat_checkpointer'::regclass and attnum > 0; attname - timed_checkpoints requested_checkpoints checkpoint_write_time checkpoint_sync_time buffers_written_checkpoints stats_reset (6 rows) =# select attname from pg_attribute where attrelid = 'pg_stat_bgwriter'::regclass and attnum > 0; attname -- buffers_clean maxwritten_clean buffers_alloc stats_reset (4 rows) The view for the bgwriter does not do that. I'd suggest to use functions that are named as pg_stat_get_checkpoint_$att with shorter $atts. It is true that "timed" is a bit confusing, because it refers to a number of checkpoints, and that can be read as a time value (?). So how about num_timed? And for the others num_requested and buffers_written? + * Unlike the checkpoint fields, reqquests related fields are protected by s/reqquests/requests/. SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path) { + SlruShared shared = ctl->shared; int fd; int save_errno; int result; + /* update the stats counter of flushes */ + pgstat_count_slru_flush(shared->slru_stats_idx); Why is that in 0002? Isn't that something we should treat as a bug fix of its own, even backpatching it to make sure that the flush requests for individual commit_ts, multixact and clog files are counted in the stats? Saying that, I am OK with moving ahead with 0001 and 0002 to remove buffers_backend_fsync and buffers_backend from pg_stat_bgwriter, but it is better to merge them into a single commit. It helps with 0003 and this would encourage the use of pg_stat_io that does a better job overall with more field granularity. -- Michael signature.asc Description: PGP signature
Re: Document aggregate functions better w.r.t. ORDER BY
On Thu, 26 Oct 2023 at 13:10, David G. Johnston wrote: > Question: Do you know whether we for certain always sort ascending here to > compute the unique values or whether if, say, there is an index on the column > in descending order (or ascending and traversed backwards) that the data > within the aggregate could, with an order by, be returned in descending order? The way it's currently coded, we seem to always require ascending order. See addTargetToGroupList(). The call to get_sort_group_operators() only requests the ltOpr. A quick test creating an index on a column with DESC shows that we end up doing a backwards index scan so that we get the requested ascending order: create table b (b text); create index on b (b desc); explain select string_agg(distinct b,',') from b; QUERY PLAN -- Aggregate (cost=67.95..67.97 rows=1 width=32) -> Index Only Scan Backward using b_b_idx on b (cost=0.15..64.55 rows=1360 width=32) (2 rows) However, I think we'd best stay clear of offering any guarantees in the documents about this. If we did that it would be much harder in the future if we wanted to implement the DISTINCT aggregates by hashing. David
Re: remaining sql/json patches
Hi Nikita, On Thu, Oct 26, 2023 at 2:13 AM Nikita Malakhov wrote: > Amit, on previous email, patch #2 - I agree that it is not the best idea to > introduce > new type of logic into the parser, so this logic could be moved to the > executor, > or removed at all. What do you think of these options? Yes maybe, though I'd first like to have a good answer to why is that logic necessary at all. Maybe you think it's better to emit an error in the SQL/JSON layer of code than in the type input function if it's unsafe? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Remove dead code in pg_ctl.c
On 10/25/23 17:30, Nathan Bossart wrote: On Wed, Oct 25, 2023 at 03:02:01PM -0500, Nathan Bossart wrote: On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote: It looks like this code was missed in 39969e2a when exclusive backup was removed. Indeed. I'll plan on committing this shortly. Committed. Thank you, Nathan! -David
Re: Add null termination to string received in parallel apply worker
On Wed, 11 Oct 2023 at 19:54, Zhijie Hou (Fujitsu) wrote: > The parallel apply worker didn't add null termination to the string received > from the leader apply worker via the shared memory queue. This action doesn't > bring bugs as it's binary data but violates the rule established in > StringInfo, > which guarantees the presence of a terminating '\0' at the end of the string. Just for anyone not following the other thread that you linked, I just pushed f0efa5aec, and, providing that sticks, I think we can drop this discussion now. That commit relaxes the requirement that the StringInfo must be NUL terminated. David
Re: Is this a problem in GenericXLogFinish()?
Hello hackers, 24.10.2023 03:45, Jeff Davis wrote: Committed. I've encountered a case with a hash index, when an assert added by the commit fails: CREATE TABLE t (i INT); CREATE INDEX hi ON t USING HASH (i); INSERT INTO t SELECT 1 FROM generate_series(1, 1000); BEGIN; INSERT INTO t SELECT 1 FROM generate_series(1, 1000); ROLLBACK; CHECKPOINT; VACUUM t; Leads to: Core was generated by `postgres: law regression [local] VACUUM '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/211944' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=139924569388864, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x7f42b99ed476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x7f42b99d37f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x55f128e83f1b in ExceptionalCondition (conditionName=0x55f128f33520 "BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer)", fileName=0x55f128f333a8 "xloginsert.c", lineNumber=262) at assert.c:66 #6 0x55f1287edce3 in XLogRegisterBuffer (block_id=1 '\001', buffer=1808, flags=8 '\b') at xloginsert.c:262 #7 0x55f128742833 in _hash_freeovflpage (rel=0x7f42adb95c88, bucketbuf=1808, ovflbuf=1825, wbuf=1808, itups=0x7ffc3f18c010, itup_offsets=0x7ffc3f18bce0, tups_size=0x7ffc3f18ccd0, nitups=0, bstrategy=0x55f12a562820) at hashovfl.c:671 #8 0x55f128743567 in _hash_squeezebucket (rel=0x7f42adb95c88, bucket=6, bucket_blkno=7, bucket_buf=1808, bstrategy=0x55f12a562820) at hashovfl.c:1064 #9 0x55f12873ca2a in hashbucketcleanup (rel=0x7f42adb95c88, cur_bucket=6, bucket_buf=1808, bucket_blkno=7, bstrategy=0x55f12a562820, maxbucket=7, highmask=15, lowmask=7, tuples_removed=0x7ffc3f18fb28, num_index_tuples=0x7ffc3f18fb30, split_cleanup=false, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at hash.c:921 #10 0x55f12873bfcf in hashbulkdelete (info=0x7ffc3f18fc30, stats=0x0, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at hash.c:549 #11 0x55f128776fbb in index_bulk_delete (info=0x7ffc3f18fc30, istat=0x0, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at indexam.c:709 #12 0x55f1289ba03d in vac_bulkdel_one_index (ivinfo=0x7ffc3f18fc30, istat=0x0, dead_items=0x55f12a566778) at vacuum.c:2480 #13 0x55f128771286 in lazy_vacuum_one_index (indrel=0x7f42adb95c88, istat=0x0, reltuples=-1, vacrel=0x55f12a4b9c30) at vacuumlazy.c:2768 #14 0x55f1287704a3 in lazy_vacuum_all_indexes (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2338 #15 0x55f128770275 in lazy_vacuum (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2256 ... It looks like the buffer is not dirty in the problematic call. Best regards, Alexander
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Thu, 26 Oct 2023 at 08:43, Tom Lane wrote: > I think that we can make that assumption starting with v17. > Back-patching it would be hazardous perhaps; but if there's some > function out there that depends on NUL termination, testing should > expose it before too long. Wouldn't hurt to mention this explicitly > as a possible incompatibility in the commit message. > > Looking over the v5 patch, I have some nits: Thanks for looking at this again. I fixed up each of those and pushed the result, mentioning the incompatibility in the commit message. Now that that's done, I've attached a patch which makes use of the new initReadOnlyStringInfo initializer function for the original case mentioned when I opened this thread. I don't think there are any remaining objections to this, but I'll let it sit for a bit to see. David diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index c831a9395c..57bd7e82bc 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -723,12 +723,11 @@ array_agg_deserialize(PG_FUNCTION_ARGS) sstate = PG_GETARG_BYTEA_PP(0); /* -* Copy the bytea into a StringInfo so that we can "receive" it using the -* standard recv-function infrastructure. +* Initialize a StringInfo so that we can "receive" it using the standard +* recv-function infrastructure. */ - initStringInfo(&buf); - appendBinaryStringInfo(&buf, - VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate)); + initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate), + VARSIZE_ANY_EXHDR(sstate)); /* element_type */ element_type = pq_getmsgint(&buf, 4); @@ -815,7 +814,6 @@ array_agg_deserialize(PG_FUNCTION_ARGS) } pq_getmsgend(&buf); - pfree(buf.data); PG_RETURN_POINTER(result); } @@ -1124,12 +1122,11 @@ array_agg_array_deserialize(PG_FUNCTION_ARGS) sstate = PG_GETARG_BYTEA_PP(0); /* -* Copy the bytea into a StringInfo so that we can "receive" it using the -* standard recv-function infrastructure. +* Initialize a StringInfo so that we can "receive" it using the standard +* recv-function infrastructure. */ - initStringInfo(&buf); - appendBinaryStringInfo(&buf, - VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate)); + initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate), + VARSIZE_ANY_EXHDR(sstate)); /* element_type */ element_type = pq_getmsgint(&buf, 4); @@ -1187,7 +1184,6 @@ array_agg_array_deserialize(PG_FUNCTION_ARGS) memcpy(result->lbs, temp, sizeof(result->lbs)); pq_getmsgend(&buf); - pfree(buf.data); PG_RETURN_POINTER(result); } diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 3c3184f15b..bf61fd7dbc 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -5190,12 +5190,11 @@ numeric_avg_deserialize(PG_FUNCTION_ARGS) init_var(&tmp_var); /* -* Copy the bytea into a StringInfo so that we can "receive" it using the -* standard recv-function infrastructure. +* Initialize a StringInfo so that we can "receive" it using the standard +* recv-function infrastructure. */ - initStringInfo(&buf); - appendBinaryStringInfo(&buf, - VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate)); + initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate), + VARSIZE_ANY_EXHDR(sstate)); result = makeNumericAggStateCurrentContext(false); @@ -5222,7 +5221,6 @@ numeric_avg_deserialize(PG_FUNCTION_ARGS) result->nInfcount = pq_getmsgint64(&buf); pq_getmsgend(&buf); - pfree(buf.data); free_var(&tmp_var); @@ -5306,12 +5304,11 @@ numeric_deserialize(PG_FUNCTION_ARGS) init_var(&tmp_var); /* -* Copy the bytea into a StringInfo so that we can "receive" it using the -* standard recv-function infrastructure. +* Initialize a StringInfo so that we can "receive" it using the standard +* recv-function infrastructure. */ - initStringInfo(&buf); - appendBinaryStringInfo(&buf, - VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate)); + initReadOnlyStringInfo(&buf, VARDATA_ANY(sstate), + VARSIZE_ANY_EXHDR(sstate)); result = makeNumericAggStateCurrentContext(false); @@ -5342,7 +5339,6 @@ numeric_deserialize(PG_FUNCTION_ARGS) result->nInfcount = pq_getmsgint64(&buf); pq_getmsgend(&buf); -
Re: A performance issue with Memoize
On 20/10/2023 17:40, Richard Guo wrote: I noticed $subject with the query below. set enable_memoize to off; explain (analyze, costs off) select * from tenk1 t1 left join lateral (select t1.two as t1two, * from tenk1 t2 offset 0) s on t1.two = s.two; QUERY PLAN Nested Loop Left Join (actual time=0.050..59578.053 rows=5000 loops=1) -> Seq Scan on tenk1 t1 (actual time=0.027..2.703 rows=1 loops=1) -> Subquery Scan on s (actual time=0.004..4.819 rows=5000 loops=1) Filter: (t1.two = s.two) Rows Removed by Filter: 5000 -> Seq Scan on tenk1 t2 (actual time=0.002..3.834 rows=1 loops=1) Planning Time: 0.666 ms Execution Time: 60937.899 ms (8 rows) set enable_memoize to on; explain (analyze, costs off) select * from tenk1 t1 left join lateral (select t1.two as t1two, * from tenk1 t2 offset 0) s on t1.two = s.two; QUERY PLAN -- Nested Loop Left Join (actual time=0.061..122684.607 rows=5000 loops=1) -> Seq Scan on tenk1 t1 (actual time=0.026..3.367 rows=1 loops=1) -> Memoize (actual time=0.011..9.821 rows=5000 loops=1) Cache Key: t1.two, t1.two Cache Mode: binary Hits: 0 Misses: 1 Evictions: Overflows: 0 Memory Usage: 1368kB -> Subquery Scan on s (actual time=0.008..5.188 rows=5000 loops=1) Filter: (t1.two = s.two) Rows Removed by Filter: 5000 -> Seq Scan on tenk1 t2 (actual time=0.004..4.081 rows=1 loops=1) Planning Time: 0.607 ms Execution Time: 124431.388 ms (12 rows) The execution time (best of 3) is 124431.388 VS 60937.899 with and without memoize. The Memoize runtime stats 'Hits: 0 Misses: 1 Evictions: ' seems suspicious to me, so I've looked into it a little bit, and found that the MemoizeState's keyparamids and its outerPlan's chgParam are always different, and that makes us have to purge the entire cache each time we rescan the memoize node. But why are they always different? Well, for the query above, we have two NestLoopParam nodes, one (with paramno 1) is created when we replace outer-relation Vars in the scan qual 't1.two = s.two', the other one (with paramno 0) is added from the subquery's subplan_params, which was created when we replaced uplevel vars with Param nodes for the subquery. That is to say, the chgParam would be {0, 1}. When it comes to replace outer-relation Vars in the memoize keys, the two 't1.two' Vars are both replaced with the NestLoopParam with paramno 1, because it is the first NLP we see in root->curOuterParams that is equal to the Vars in memoize keys. That is to say, the memoize node's keyparamids is {1}. ... Any thoughts? Do you've thought about the case, fixed with the commit 1db5667? As I see, that bugfix still isn't covered by regression tests. Could your approach of a PARAM_EXEC slot reusing break that case? -- regards, Andrei Lepikhov Postgres Professional
Re: Open a streamed block for transactional messages during decoding
On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu) wrote: > > While reviewing the test_decoding code, I noticed that when skip_empty_xacts > option is specified, it doesn't open the streaming block( e.g. > pg_output_stream_start) before streaming the transactional MESSAGE even if > it's > the first change in a streaming block. > > It looks inconsistent with what we do when streaming DML > changes(e.g. pg_decode_stream_change()). > > Here is a small patch to open the stream block in this case. > The change looks good to me though I haven't tested it yet. BTW, can we change the comment: "Output stream start if we haven't yet, but only for the transactional case." to "Output stream start if we haven't yet for transactional messages"? I think we should backpatch this fix. What do you think? -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Thu, 2023-10-26 at 07:00 +0300, Alexander Lakhin wrote: > It looks like the buffer is not dirty in the problematic call. Thank you for the report! I was able to reproduce and observe that the buffer is not marked dirty. The call (hashovfl.c:671): XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD) is followed unconditionally by: PageSetLSN(BufferGetPage(wbuf), recptr) so if the Assert were not there, it would be setting the LSN on a page that's not marked dirty. Therefore I think this is a bug, or at least an interesting/unexpected case. Perhaps the registration and the PageSetLSN don't need to happen when nitups==0? Regards, Jeff Davis
Re: Document aggregate functions better w.r.t. ORDER BY
On Wed, Oct 25, 2023 at 7:13 PM David Rowley wrote: > On Thu, 26 Oct 2023 at 13:10, David G. Johnston > wrote: > > Question: Do you know whether we for certain always sort ascending here > to compute the unique values or whether if, say, there is an index on the > column in descending order (or ascending and traversed backwards) that the > data within the aggregate could, with an order by, be returned in > descending order? > > The way it's currently coded, we seem to always require ascending > order. See addTargetToGroupList(). The call to > get_sort_group_operators() only requests the ltOpr. > > A quick test creating an index on a column with DESC shows that we end > up doing a backwards index scan so that we get the requested ascending > order: > > create table b (b text); > create index on b (b desc); > explain select string_agg(distinct b,',') from b; > QUERY PLAN > > -- > Aggregate (cost=67.95..67.97 rows=1 width=32) >-> Index Only Scan Backward using b_b_idx on b (cost=0.15..64.55 > rows=1360 width=32) > (2 rows) > > However, I think we'd best stay clear of offering any guarantees in > the documents about this. If we did that it would be much harder in > the future if we wanted to implement the DISTINCT aggregates by > hashing. > > So, I think we are mischaracterizing the Standard here, if only in the specific case of array_agg. SQL Standard: 4.16.4 Every unary aggregate function takes an arbitrary as the argument; most unary aggregate functions can optionally be qualified with either DISTINCT or ALL. If ARRAY_AGG is specified, then an array value with one element formed from the evaluated for each row that qualifies. Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP, STDDEV_POP, or STDDEV_SAMP; redundant duplicates are not removed when computing these functions. 10.9 ::= ARRAY_AGG [ ORDER BY ] I would reword the existing note to be something like: The SQL Standard defines specific aggregates and their properties, including which of DISTINCT and/or ORDER BY is allowed. Due to the extensible nature of PostgreSQL it accepts either or both clauses for any aggregate. >From the most recent patch: -If DISTINCT is specified in addition to an -order_by_clause, then all the ORDER BY -expressions must match regular arguments of the aggregate; that is, -you cannot sort on an expression that is not included in the -DISTINCT list. +If DISTINCT is specified with an +order_by_clause, ORDER +BY expressions can only reference columns in the +DISTINCT list. For example: + +WITH vals (v1, v2) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') ) +SELECT array_agg(DISTINCT v2 ORDER BY v2 DESC) FROM vals; + array_agg +- + {Z,T,R,D,A} + The change to a two-column vals was mostly to try and find corner-cases that might need to be addressed. If we don't intend to show the error case of DISTINCT v1 ORDER BY v2 then we should go back to the original example and just add ORDER BY v DESC. I'm fine with not using string_agg here. +For example: + +WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) ) +SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals; + array_agg +--- + {4,3,2,1} + We get enough complaints regarding "apparent ordering" that I would like to add: As a reminder, while some DISTINCT processing algorithms produce sorted output as a side-effect, only by specifying ORDER BY is the output order guaranteed. David J.
Re: Container Types
On Wed, 2023-10-25 at 16:01 -0700, Andres Freund wrote: > I'd look whether you can generalize that infrastructure. I had briefly looked at using the record type mechanism before, and it seemed like a challenge because it doesn't really work when passing through a function call: CREATE TABLE t(a INT, b TEXT); INSERT INTO t VALUES(1, 'one'); CREATE FUNCTION id(RECORD) RETURNS RECORD LANGUAGE plpgsql AS $$ BEGIN RETURN $1; END; $$; SELECT t.a FROM t; -- 1 SELECT (id(t)).a FROM t; -- ERROR But now that I think about it, that's really a type inference limitation, and that needs to be solved regardless. After the type inference figures out what the right type is, then I think you're right that an OID is not required to track it, and however we do track it should be able to reuse some of the existing infrastructure for dealing with record types. Regards, Jeff Davis
Re: Show WAL write and fsync stats in pg_stat_io
On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote: > Any kind of feedback would be appreciated. This was registered in the CF, so I have given it a look. Note that 0001 has a conflict with pgstat_count_io_op_time(), so it cannot be applied. +pgstat_should_track_io_time(IOObject io_object, IOContext io_context) +{ + /* +* io times of IOOBJECT_WAL IOObject needs to be tracked when +* 'track_wal_io_timing' is set regardless of 'track_io_timing'. +*/ + if (io_object == IOOBJECT_WAL) + return track_wal_io_timing; + + return track_io_timing; I can see the temptation to do that, but I have mixed feelings about the approach of mixing two GUCs in a code path dedicated to pg_stat_io where now we only rely on track_io_timing. The result brings confusion, while making pg_stat_io, which is itself only used for block-based operations, harder to read. The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF) is quite tempting, actually, creating a neat separation between the existing pg_stat_io and pg_stat_wal (not a SRF), with a third view that provides more details about the contexts and backend types for the WAL stats with its relevant fields: https://www.postgresql.org/message-id/caakru_bm55pj3pprw0nd_-pawhlrkou69r816aeztbba-n1...@mail.gmail.com And perhaps just putting that everything that calls pgstat_count_io_op_time() under track_io_timing is just natural? What's the performance regression you would expect if both WAL and block I/O are controlled by that, still one would expect only one of them? On top of that pg_stat_io is now for block-based I/O operations, so that does not fit entirely in the picture, though I guess that Melanie has thought more on the matter than me. That may be also a matter of taste. + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(false); This is hidden in 0001, still would be better if handled as a patch on its own and optionally backpatch it as we did for the bgwriter with e64c733bb1? Side note: I think that we should spend more efforts in documenting what IOContext and IOOp mean. Not something directly related to this patch, still this patch or things similar make it a bit harder which part of it is used for what by reading pgstat.h. -- Michael signature.asc Description: PGP signature
Re: RFC: Logging plan of the running query
On 2023-10-25 12:40, Ashutosh Bapat wrote: On Wed, Oct 18, 2023 at 10:04 PM James Coleman wrote: While that decision as regards auto_explain has long since been made (and there would be work to undo it), I don't believe that we should repeat that choice here. I'm -10 on moving this into auto_explain. Right. However perhaps there is still an opportunity for moving some of the auto_explain code into core so as to enable deduplicating the code. Right. That's what I also think. Thanks for your comments. Attached patch adds a new function which assembles es->str for logging according to specified contents and format. This is called from both auto_explain and pg_log_query_plan(). On 2023-10-11 16:22, Ashutosh Bapat wrote: I am also wondering whether it's better to report the WARNING as status column in the output. Attached patch left as it was since the inconsistency with pg_terminate_backend() and pg_log_backend_memory_contexts() as you pointed out. -- Regards, -- Atsushi Torikoshi NTT DATA Group CorporationFrom 336ba3943f631dcbc0d1226ebd0a7675cf78c1f8 Mon Sep 17 00:00:00 2001 From: Atsushi Torikoshi Date: Thu, 26 Oct 2023 15:42:36 +0900 Subject: [PATCH v32] Add function to log the plan of the query Currently, we have to wait for the query execution to finish to check its plan. This is not so convenient when investigating long-running queries on production environments where we cannot use debuggers. To improve this situation, this patch adds pg_log_query_plan() function that requests to log the plan of the specified backend process. By default, only superusers are allowed to request to log the plans because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, at the next CHECK_FOR_INTERRUPTS(), the target backend logs its plan at LOG_SERVER_ONLY level, so that these plans will appear in the server log but not be sent to the client. Reviewed-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar, Masahiro Ikeda, Ekaterina Sokolova, Justin Pryzby, Kyotaro Horiguchi, Robert Treat, Alena Rybakina, Ashutosh Bapat Co-authored-by: James Coleman --- contrib/auto_explain/auto_explain.c | 23 +- doc/src/sgml/func.sgml | 49 + src/backend/access/transam/xact.c| 13 ++ src/backend/catalog/system_functions.sql | 2 + src/backend/commands/explain.c | 209 ++- src/backend/executor/execMain.c | 14 ++ src/backend/storage/ipc/procsignal.c | 4 + src/backend/storage/lmgr/lock.c | 9 +- src/backend/tcop/postgres.c | 4 + src/backend/utils/error/elog.c | 2 + src/backend/utils/init/globals.c | 2 + src/include/catalog/pg_proc.dat | 6 + src/include/commands/explain.h | 8 + src/include/miscadmin.h | 1 + src/include/storage/lock.h | 2 - src/include/storage/procsignal.h | 1 + src/include/tcop/pquery.h| 2 +- src/include/utils/elog.h | 1 + src/test/regress/expected/misc_functions.out | 54 - src/test/regress/sql/misc_functions.sql | 41 +++- 20 files changed, 399 insertions(+), 48 deletions(-) diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index c3ac27ae99..20a73df8c4 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -399,26 +399,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc) es->format = auto_explain_log_format; es->settings = auto_explain_log_settings; - ExplainBeginOutput(es); - ExplainQueryText(es, queryDesc); - ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length); - ExplainPrintPlan(es, queryDesc); - if (es->analyze && auto_explain_log_triggers) -ExplainPrintTriggers(es, queryDesc); - if (es->costs) -ExplainPrintJITSummary(es, queryDesc); - ExplainEndOutput(es); - - /* Remove last line break */ - if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n') -es->str->data[--es->str->len] = '\0'; - - /* Fix JSON to output an object */ - if (auto_explain_log_format == EXPLAIN_FORMAT_JSON) - { -es->str->data[0] = '{'; -es->str->data[es->str->len - 1] = '}'; - } + ExplainAssembleLogOutput(es, queryDesc, auto_explain_log_format, + auto_explain_log_triggers, + auto_explain_log_parameter_max_length); /* * Note: we rely on the existing logging of context or diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c3e940afe..8d77fe1a5b 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26406,6 +26406,25 @@ SELECT collation for ('foo' COLLATE "de_DE"); + + + + pg_log_query_plan + +pg_log_query_plan