Re: amcheck support for BRIN indexes
Hi, I would like share some thoughts about 'heap all consistent' part and one of the open questions: The idea behind 'heap all consistent' is to use heap data to validate the index. BRIN doesn't store heap tuples, so there is no straightforward way to check if every tuple was indexed or not. We have range data, so we need to do something with every heap tuple and corresponding range. Something that will tell us if the range data covers the heap tuple or not. What options I see here: 1) We can use the addValue() function. It returns FALSE if range data was not changed (in other words range data already covers heap tuple data that we pass to the function). It's very easy to do, we can use heap tuples directly. But the problem I see here is that addValue() can return TRUE even if heap tuple data have been already covered by the range, but range data itself changed for some reason (some new algorithm were applied for instance). So I think we can have some false positives that we can do nothing about. 2) We can use the consistent() function. It requires ScanKey and returns true if the range data satisfies ScanKey's condition. So we need to convert every heap tuple into ScanKey somehow. This approach is implemented now in the patch, so I tried to describe all details about heap tuple to ScanKey conversion in the comment: /* * Prepare ScanKey for index attribute. * * ConsistentFn requires ScanKey, so we need to generate ScanKey for every * attribute somehow. We want ScanKey that would result in TRUE for every heap * tuple within the range when we use its indexed value as sk_argument. * To generate such a ScanKey we need to define the right operand type and the strategy number. * Right operand type is a type of data that index is built on, so it's 'opcintype'. * There is no strategy number that we can always use, * because every opclass defines its own set of operators it supports and strategy number * for the same operator can differ from opclass to opclass. * So to get strategy number we look up an operator that gives us desired behavior * and which both operand types are 'opcintype' and then retrieve the strategy number for it. * Most of the time we can use '='. We let user define operator name in case opclass doesn't * support '=' operator. Also, if such operator doesn't exist, we can't proceed with the check. * * Generated once, and will be reused for all heap tuples. * Argument field will be filled for every heap tuple before * consistent function invocation, so leave it NULL for a while. * */ With this approach function brin_check() has optional parameter 'consistent_operator_names' that it seems to me could be very confusing for users. In general I think this is the most complex approach both in terms of use and implementation. 3) The approach that seems to me the most clear and straightforward: to add new optional function to BRIN opclass API. The function that would say if passed value is covered with the current range data. it's exactly what we want to know, so we can use heap data directly here. Something like that: bool withinRange(BrinDesc *bdesc, BrinValues *column, Datum val, bool isnull) It could be an optional function that will be implemented for all core BRIN opclasses. So if somebody wants to use it for some custom opclass they will need to implement it too, but it's not required. I understand that adding something to the index opclass API requires very strong arguments. So the argument here is that it will let to do brin check very robust (without possible false positives as in the first approach) and easy to use (no additional parameters in the check function). Also, the withinRange() function could be written in such a way that it would be more efficient for our task than addValue() or consistent(). I'd be glad to hear your thoughts! Best regards, Arseniy Mukhin
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Vignesh! On Fri, Jun 20, 2025 at 3:42 PM vignesh C wrote: > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov wrote: > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > > wrote: > > > > > Regarding assertion failure, I've found that assert in > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > > is a "chance that we have to retry". > > > > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > > backwards. The retry mentioned there is just a retry to reserve the > > > > slot's position again if the required WAL is already removed. Such a > > > > retry can only get the position later than the previous restart_lsn. > > > > > > We analyzed the assertion failure happened at > > > pg_basebackup/020_pg_receivewal, > > > and confirmed that restart_lsn can go backward. This meant that Assert() > > > added > > > by the ca307d5 can cause crash. > > > > > > Background > > > === > > > When pg_receivewal starts the replication and it uses the replication > > > slot, it > > > sets as the beginning of the segment where restart_lsn exists, as the > > > startpoint. > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests > > > WALs > > > from 0/B0. > > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > > > What happened here > > > == > > > Based on above theory, walsender sent from the beginning of segment > > > (0/B0). > > > When walreceiver receives, it tried to send reply. At that time the > > > flushed > > > location of WAL would be 0/B0. walsender sets the received lsn as > > > restart_lsn > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward > > > (0/B000D0->0/B0). > > > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the > > > data.restart_lsn > > > was 0/B0. It could not satisfy the assertion added in > > > InvalidatePossiblyObsoleteSlot(). > > > > Thank you for your detailed explanation! > > > > > Note > > > > > > 1. > > > In this case, starting from the beginning of the segment is not a > > > problem, because > > > the checkpoint process only removes WAL files from segments that precede > > > the > > > restart_lsn's wal segment. The current segment (0/B0) will not be > > > removed, > > > so there is no risk of data loss or inconsistency. > > > > > > 2. > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts > > > the > > > requested streaming position to the start of the segment, and it replies > > > the > > > received LSN as flushed. > > > > > > 3. > > > I considered the theory above, but I could not reproduce > > > 040_standby_failover_slots_sync > > > because it is a timing issue. Have someone else reproduced? > > > > > > We are still investigating failure caused at > > > 040_standby_failover_slots_sync. > > > > I didn't manage to reproduce this. But as I see from the logs [1] on > > mamba that START_REPLICATION command was issued just before assert > > trap. Could it be something similar to what I described in [2]. > > Namely: > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > > the beginning of the segment. > > Here is my analysis for the 040_standby_failover_slots_sync test > failure where in physical replication slot can point to backward lsn: > In certain rare cases the restart LSN can go backwards. This scenario > can be reproduced by performing checkpoint continuously and slowing > the WAL applying. I have a patch with changes to handle this. > > Here's a summary of the sequence of events: > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL > contents: > After standby writes the received WAL contents in XLogWalRcvWrite, the > standby sends this lsn 0/40369C8 as the confirmed flush location. The > primary acknowledges this and updates the replication slot's > restart_lsn accordingly: > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set > restart_lsn to 0/40369C8 > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: > START_REPLICATION SLOT "sb1_slot" 0/300 TIMELINE 1 > Checkpoint persists the new restart_lsn: > > 2) Shortly after receiving the new LSN, a checkpoint occurs which > saves this restart_lsn: > 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote > 0 buffers (0.0%), wrote 0 SLRU buffers;
Re: pgv18: simple table scan take more time than pgv14
Hi, On Fri, 2025-06-20 at 09:46 +0800, James Pang wrote: > same OS RHEL8, install pgv14.11 and pgv18.beta1_3, both installation > by RPM from pgdg, and use similar postgresql.conf. > 18.beta1 > postgres=# \timing on > Timing is on. > postgres=# select * from tt where b ~~ 'a%'; > a | b > ---+--- > (0 rows) > > Time: 78.532 ms > postgres=# select * from tt where b ~~ 'a%' > postgres-# ; > a | b > ---+--- > (0 rows) > > Time: 83.516 ms > postgres=# select * from tt where b ~~ 'a%'; > a | b > ---+--- > (0 rows) It's likely because v18 beta RPMs are built with --enable-debug and --enable-cassert . See the note on yum.postgresql.org: https://yum.postgresql.org/rpmchart/#pg18 Regards, -- Devrim Gündüz Open Source Solution Architect, PostgreSQL Major Contributor BlueSky: @devrim.gunduz.org , @gunduz.org signature.asc Description: This is a digitally signed message part
Allow the "operand" input of width_bucket() to be NaN
The attached patch does what was discussed in the pgsql-docs thread at [1], namely change the four-argument variants of width_bucket() to allow their first argument to be NaN, treating that value as larger than any non-NaN. While these functions are defined by the SQL standard, it doesn't appear to have anything to say about NaN values. So it's up to us to decide what's the most consistent behavior. The arguments for changing this are: 1. It's consistent with the array variant of width_bucket(), which treats NaN as a valid input larger than any non-NaN. 2. The first argument is probably coming from a table, so it's more likely for it to be NaN than is the case for the histogram endpoints. It'd be better not to throw an error in such cases. Of course, #2 is a bit of a value judgment. One could alternatively argue that accepting NaN risks "garbage in, garbage out" results, since the result will not be visibly distinct from the results for ordinary values. Thoughts? (I'm envisioning this as a v19 change.) regards, tom lane [1] https://www.postgresql.org/message-id/flat/2BD74F86-5B89-4AC1-8F13-23CED3546AC1%40gmail.com diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index ba66a9c4ce6..7b97d2be6ca 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -4067,8 +4067,9 @@ float84ge(PG_FUNCTION_ARGS) * with the specified characteristics. An operand smaller than the * lower bound is assigned to bucket 0. An operand greater than or equal * to the upper bound is assigned to an additional bucket (with number - * count+1). We don't allow "NaN" for any of the float8 inputs, and we - * don't allow either of the histogram bounds to be +/- infinity. + * count+1). We don't allow the histogram bounds to be NaN or +/- infinity, + * but we do allow those values for the operand (taking NaN to be larger + * than any other value, as we do in comparisons). */ Datum width_bucket_float8(PG_FUNCTION_ARGS) @@ -4084,12 +4085,11 @@ width_bucket_float8(PG_FUNCTION_ARGS) (errcode(ERRCODE_INVALID_ARGUMENT_FOR_WIDTH_BUCKET_FUNCTION), errmsg("count must be greater than zero"))); - if (isnan(operand) || isnan(bound1) || isnan(bound2)) + if (isnan(bound1) || isnan(bound2)) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_WIDTH_BUCKET_FUNCTION), - errmsg("operand, lower bound, and upper bound cannot be NaN"))); + errmsg("lower and upper bounds cannot be NaN"))); - /* Note that we allow "operand" to be infinite */ if (isinf(bound1) || isinf(bound2)) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_WIDTH_BUCKET_FUNCTION), @@ -4097,15 +4097,15 @@ width_bucket_float8(PG_FUNCTION_ARGS) if (bound1 < bound2) { - if (operand < bound1) - result = 0; - else if (operand >= bound2) + if (isnan(operand) || operand >= bound2) { if (pg_add_s32_overflow(count, 1, &result)) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("integer out of range"))); } + else if (operand < bound1) + result = 0; else { if (!isinf(bound2 - bound1)) @@ -4135,7 +4135,7 @@ width_bucket_float8(PG_FUNCTION_ARGS) } else if (bound1 > bound2) { - if (operand > bound1) + if (isnan(operand) || operand > bound1) result = 0; else if (operand <= bound2) { diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 58ad1a65ef7..c9233565d57 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -1960,8 +1960,9 @@ generate_series_numeric_support(PG_FUNCTION_ARGS) * with the specified characteristics. An operand smaller than the * lower bound is assigned to bucket 0. An operand greater than or equal * to the upper bound is assigned to an additional bucket (with number - * count+1). We don't allow "NaN" for any of the numeric inputs, and we - * don't allow either of the histogram bounds to be +/- infinity. + * count+1). We don't allow the histogram bounds to be NaN or +/- infinity, + * but we do allow those values for the operand (taking NaN to be larger + * than any other value, as we do in comparisons). */ Datum width_bucket_numeric(PG_FUNCTION_ARGS) @@ -1979,17 +1980,13 @@ width_bucket_numeric(PG_FUNCTION_ARGS) (errcode(ERRCODE_INVALID_ARGUMENT_FOR_WIDTH_BUCKET_FUNCTION), errmsg("count must be greater than zero"))); - if (NUMERIC_IS_SPECIAL(operand) || - NUMERIC_IS_SPECIAL(bound1) || - NUMERIC_IS_SPECIAL(bound2)) + if (NUMERIC_IS_SPECIAL(bound1) || NUMERIC_IS_SPECIAL(bound2)) { - if (NUMERIC_IS_NAN(operand) || - NUMERIC_IS_NAN(bound1) || - NUMERIC_IS_NAN(bound2)) + if (NUMERIC_IS_NAN(bound1) || NUMERIC_IS_NAN(bound2)) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_WIDTH_BUCKET_FUNCTION), - errmsg("operand, lower bound, and upper bound cannot be NaN"))); - /* We allow "operand" to be infinite; cmp_numerics will cope */ + errmsg("lower and upper bo
Re: [PATCH] pg_bsd_indent: improve formatting of multiline comments
On Fri, Jun 20, 2025 at 5:01 PM Aleksander Alekseev wrote: > > Hi, > > > Given the quality of BSD indent code, I have _always_ found it easier to > > modify pgindent. ;- > > :D Initially I thought that the problem was simple enough to solve it > in C, but this turned out not to be true. > > > It's going to be simpler to modify pgindent then. PFA the updated patch. > > I noticed a mistake in v2. Here is the corrected patch. Changes > comparing to the previous version: > Thanks! Now it affects 4 times more files (380). What I noticed: 1) Most of the comments are bordered comments like this: -/* +/* + * * Public IO related functions operating on IO Handles * */ Do we want to skip such comments? I have also seen comments with '' border. 2) Some comments like this: before: /* Author: Linus Tolke (actually most if the code is "borrowed" from the distribution and just slightly modified) */ after: /* * Author: Linus Tolke (actually most if the code is "borrowed" from the distribution and just slightly modified) */ I guess closing */ on the separate line is the trigger? If I'm not wrong there are only 3 such comments, maybe it is easier to fix them by hand?) 3) It seems all geqo related file contains such comment: -/* contributed by: +/* + * contributed by: =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= * Martin Utesch * Institute of Automatic Control * = Best regards, Arseniy Mukhin
Re: bt_index_parent_check and concurrently build indexes
> On 21 Jun 2025, at 21:10, Mihail Nikalayeu wrote: > > Rebased IMO the patch is RfC, I've just updated the status of the CF iteam. Thanks. Best regards, Andrey Borodin.
Re: bt_index_parent_check and concurrently build indexes
Hello, everyone! Rebased. Best regards, Mikhail. v4-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patch Description: Binary data
Re: pg18: Virtual generated columns are not (yet) safe when superuser selects from them
On Sat, Jun 21, 2025 at 1:29 PM jian he wrote: > > ( the following excerpted from create_type.sql) > > BEGIN; > CREATE TYPE int42; > -- Make dummy I/O routines using the existing internal support for int4, text > CREATE FUNCTION int42_in(cstring) >RETURNS int42 >AS 'int4in' >LANGUAGE internal STRICT IMMUTABLE; > CREATE FUNCTION int42_out(int42) >RETURNS cstring >AS 'int4out' >LANGUAGE internal STRICT IMMUTABLE; > CREATE TYPE int42 ( >internallength = 4, >input = int42_in, >output = int42_out, >alignment = int4, >default = 42, >passedbyvalue > ); > COMMIT; > > > CREATE TABLE gtest1 (a int42 GENERATED ALWAYS AS ('1') VIRTUAL); > CREATE TABLE gtest2 (a int42 GENERATED ALWAYS AS ('1'::int42) VIRTUAL); > ERROR: generation expression uses user-defined type > LINE 1: CREATE TABLE gtest2 (a int42 GENERATED ALWAYS AS ('1'::int42... > ^ > DETAIL: Virtual generated columns that make use of user-defined types > are not yet supported. > > Do we need error out for the first case? > I think these two cases both should error out. If generated column expressions do not allow user-defined types or functions, it makes sense to also disallow virtual generated columns from using user-defined types. Attached patch change CheckAttributeType to do the job. related tests also added. Note: Support for composite types in virtual generated columns is currently partial. for example: CREATE TYPE double_int as (a int, b int); --ok CREATE TABLE gtest4 ( a int, b double_int GENERATED ALWAYS AS ((a * 2, a * 3)) VIRTUAL ); --not ok. CREATE TABLE gtest4 ( a int, b double_int GENERATED ALWAYS AS ((a * 2, a * 3)::double_int) VIRTUAL ); v1-0001-disallow-user-defined-type-for-virtual-generated-colum.no-cfbot Description: Binary data
Re: Custom reloptions in TableAM
Hi! Thank you very much, I'll check it out. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX
> On Jun 11, 2025, at 9:00 AM, Sami Imseih wrote: > > >> IMO, having this GUC to force the use of invisible indexes is quite >> strange. In my view, it detracts from the guarantees that you're meant >> to get from disabling indexes. What if some connection has >> use_invisible_index set to true? The DBA might assume all is well >> after having seen nobody complain and then drop the index. The user >> might then complain. > > Sure, this may occur. I can also imagine cases where an index is made > visible only for certain workloads, intentionally. But such efforts should > be coordinated by application teams and DBAs. Someone would need to modify > this GUC at the connection level, alter the database, or change the session > via application code. An ad-hoc connection enabling this GUC is unlikely to > be an issue. > > I don't see how we could provide the INVISIBLE index DDL without also > providing this boolean GUC. If a user creates an index that is initially > INVISIBLE, they need a GUC to try it out before deciding to make it > visible. > > It was also pointed out in the thread above that this GUC can serve as a > backstop for replicas if the DDL to make an index visible is delayed. > Hello, Thank you everyone for all the discussions and also to Robert Treat for feedback and the operational considerations. It seems like there are multiple ways to solve this problem, which is encouraging. From the discussion, there appears to be consensus on few things as well, including the DDL approach, which I personally am a proponent for as well. I believe this is a valuable feature for DBAs and engineers working with large databases. Esp since it provides the confidence to "turn off" an index to observe the impact through their observability tools and make an informed decision about whether to drop it. If they're wrong, they can quickly rollback by making the index visible again, rather than waiting for a full index rebuild that can take 30 minutes to hours. The primary use case I have in mind is for helping engineers (ones not so seasoned like DBAs) decide whether to drop *existing* indexes. For new indexes, I expect most users would create them in visible mode (the default). Or so has been my experience so far. The GUC component opens the door for additional workflows, such as creating an index as initially invisible (like Sami points out) and testing its performance before making it visible. I originally wasn't thinking it this way, but this demonstrates the flexibility of the feature and accommodates different development approaches. As Robert noted, both approaches have trade-offs around operational safety and granular control. However, I think the DDL approach provides the right balance of simplicity and system-wide consistency that most users need, while the GUC still enables experimentation for those who want it. I'm very much committed to iterating on this patch to address any remaining feedback and help make progress on this. Is there something we can do here in the essence of "start small, think big", perhaps? Thanks Shayon
Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX
On Sat, Jun 21, 2025 at 8:38 AM Shayon Mukherjee wrote: > > > On Jun 11, 2025, at 9:00 AM, Sami Imseih wrote: > > > IMO, having this GUC to force the use of invisible indexes is quite >> strange. In my view, it detracts from the guarantees that you're meant >> to get from disabling indexes. What if some connection has >> use_invisible_index set to true? The DBA might assume all is well >> after having seen nobody complain and then drop the index. The user >> might then complain. >> > > Sure, this may occur. I can also imagine cases where an index is made > visible only for certain workloads, intentionally. But such efforts should > be coordinated by application teams and DBAs. Someone would need to modify > this GUC at the connection level, alter the database, or change the session > via application code. An ad-hoc connection enabling this GUC is unlikely to > be an issue. > > I don't see how we could provide the INVISIBLE index DDL without also > providing this boolean GUC. If a user creates an index that is initially > INVISIBLE, they need a GUC to try it out before deciding to make it > visible. > > It was also pointed out in the thread above that this GUC can serve as a > backstop for replicas if the DDL to make an index visible is delayed. > > > Hello, > > Thank you everyone for all the discussions and also to Robert Treat for > feedback and the operational considerations. > > It seems like there are multiple ways to solve this problem, which is > encouraging. From the discussion, there appears to be consensus on few > things as well, including the DDL approach, which I personally am a > proponent for as well. > > I believe this is a valuable feature for DBAs and engineers working with > large databases. Esp since it provides the confidence to "turn off" an > index to observe the impact through their observability tools and make an > informed decision about whether to drop it. If they're wrong, they can > quickly rollback by making the index visible again, rather than waiting for > a full index rebuild that can take 30 minutes to hours. > > The primary use case I have in mind is for helping engineers (ones not so > seasoned like DBAs) decide whether to drop *existing* indexes. For new > indexes, I expect most users would create them in visible mode (the > default). Or so has been my experience so far. > What I would be using this for is when the server is choosing the wrong index, often in multi column index scenarios. The server can be obtuse in those situations. So I see this as a query optimization aid rather than a 'should I drop this?' Given that there are several ways to do that already. I can see scenarios where I'd want the index backed constraint to never be used for some/all queries. ALTER driving this seems ok. It seems more of a planner directive to me but having potential permanent configuration (vs mostly temporary needs) tips the scale IMO. ENABLE | DISABLE seems off. I would take it further to, ENABLE | DISABLE OPTIMIZATION for clarify and to leave room for syntax expansion. Nice stuff. Did not review patch merlin
Re: problems with toast.* reloptions
> I think we need to do something like the following to fix this: > > * Teach autovacuum to combine the TOAST reloptions with the main relation's > when processing TOAST tables (with the toast.* ones winning if both are > set). > > * Teach autovacuum to resolve reloptions for parameters like > vacuum_truncate instead of relying on vacuum_rel() to fill it in. >> These two points make sense here, yes. I investigated that this afternoon and identified two potential implementation approaches: 1) Create functions like resolve_toast_vac_opts() and resolve_toast_rel_opts(). These would then be used in table_recheck_autovac(), NeedsAutoVacTableForXidWraparound(), and do_autovacuum() after the toast table check. 2) When updating a table's relopt, also update the relopt of its associated TOAST table if it's not already set. Similarly, when creating a new TOAST table, it would inherit the parent's relopt. Option 2 seems more reasonable to me, as it avoids requiring customers to manually resolve these options, when they have different settings for the parent and TOAST tables."