Re: Some shared memory chunks are allocated even if related processes won't start
On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote: > Dear hackers, > > While reading codes, I found that ApplyLauncherShmemInit() and > AutoVacuumShmemInit() are always called even if they would not be > launched. Note that there are situations where the autovacuum launcher is started even though autovacuum is nominally turned off, and I suspect your proposal would break that. IIRC this occurs when the Xid or multixact counters cross the max_freeze_age threshold. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Porque Kim no hacía nada, pero, eso sí, con extraordinario éxito" ("Kim", Kipling)
Re: Avoid stack frame setup in performance critical routines using tail calls
Hi, On 2024-03-04 17:43:50 +1300, David Rowley wrote: > On Thu, 29 Feb 2024 at 00:29, David Rowley wrote: > > I switched over to working on doing what you did in 0002 for > > generation.c and slab.c. > > > > See the attached patch which runs the same test as in [1] (aset.c is > > just there for comparisons between slab and generation) > > > > The attached includes some additional tuning to generation.c: > > I've now pushed this. Thanks for working on all these, much appreciated! Greetings, Andres Freund
Re: Comments on Custom RMGRs
> On 29 Feb 2024, at 19:47, Danil Anisimow wrote: > > Answering your questions might take some time as I want to write a sample > patch for pg_stat_statements and make some tests. > What do you think about putting the patch to commitfest as it closing in a > few hours? I’ve switched the patch to “Waiting on Author” to indicate that currently patch is not available yet. Please, flip it back when it’s available for review. Thanks! Best regards, Andrey Borodin.
Re: initdb's -c option behaves wrong way?
> On 4 Mar 2024, at 02:01, Tom Lane wrote: > > Daniel Gustafsson writes: >> I took the liberty to add this to the upcoming CF to make sure we don't lose >> track of it. > > Thanks for doing that, because the cfbot pointed out a problem: > I should have written pg_strncasecmp not strncasecmp. If this > version tests cleanly, I'll push it. +1, LGTM. -- Daniel Gustafsson
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi, On Sun, Mar 03, 2024 at 03:44:34PM -0600, Nathan Bossart wrote: > On Sun, Mar 03, 2024 at 11:40:00PM +0530, Bharath Rupireddy wrote: > > On Sat, Mar 2, 2024 at 3:41 AM Nathan Bossart > > wrote: > >> Would you ever see "conflict" as false and "invalidation_reason" as > >> non-null for a logical slot? > > > > No. Because both conflict and invalidation_reason are decided based on > > the invalidation reason i.e. value of slot_contents.data.invalidated. > > IOW, a logical slot that reports conflict as true must have been > > invalidated. > > > > Do you have any thoughts on reverting 007693f and introducing > > invalidation_reason? > > Unless I am misinterpreting some details, ISTM we could rename this column > to invalidation_reason and use it for both logical and physical slots. I'm > not seeing a strong need for another column. Yeah having two columns was more for convenience purpose. Without the "conflict" one, a slot conflicting with recovery would be "a logical slot having a non NULL invalidation_reason". I'm also fine with one column if most of you prefer that way. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CF entries for 17 to be reviewed
> On 3 Mar 2024, at 01:19, Melanie Plageman wrote: > > I'm not > sure if the ones that do have a target version = 17 are actually all > the patches targeting 17. Yes, for me it’s only a hint where to bump things up. I will extend scope on other versions when I fill finish a pass though entries with version 17. Here’s my take on “Miscellaneous” section. * Permute underscore separated components of columns before fuzzy matching The patch received some review, but not for latest version. I pinged the reviewer for an update. * Add pg_wait_for_lockers() function Citus-related patch, but may be of use to other distributed systems. This patch worth attention at least because author replied to himself 10+ times. Interesting addition, some feedback from Andres and Laurenz. * Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set Relatively simple change to improve user-friendliness. Daniel Gustafsson expressed interest recently. * Fix log_line_prefix to display the transaction id (%x) for statements not in a transaction block Reasonable log improvement, code is simple but in a tricky place. There was some feedback, I've asked if respondent can be a reviewer. * Add Index-level REINDEX with multiple jobs An addition to DBA toolset. Some unaddressed feedback, pinged authors. * Add LSN <-> time conversion facility There's ongoing discussion between Melanie and Tomas. Relatively heavyweight patchset, but given current solid technical level of the discussion this might land into 17. Take your chance to chime-in with review! :) * date_trunc function in interval version Some time tricks. There are review notes by Tomas. I pinged authors. * Adding comments to help understand psql hidden queries A couple of patches for psql --echo-hidden. Seems useful and simple. No reviews at all though. I moved the patch to "Clients" to reflect actual patch purpose and lighten generic “Miscellaneous". * Improving EXPLAIN's display of SubPlan nodes Some EXPLAIN changes, Alexander Alekseev was looking into this. I've asked him if he would be the reviewer. * Should we remove -Wdeclaration-after-statement? Not really a patch, kind of an opinion poll. The result is now kind of -BigNumber, I see no chances for this to get into 17, but maybe in future. * Add to_regtypemod() SQL function Cool nice function, some reviewers approved the patch. I've took a glance on the patch, seems nice, switched to "Ready for Committer". Some unrelated changes to genbki.pl, but according to thread it was needed for something. * un-revert MAINTAIN privilege and pg_maintain predefined role The work seems to be going on. * Checkpoint extension hook The patch is not provided yet. I've pinged the thread. Stay tuned, I hope everyone interested in reviewing will find themself a cool interesting patch or two. Best regards, Andrey Borodin.
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Fri, Mar 01, 2024 at 03:03:14PM -0600, Justin Pryzby wrote: > I think if the user sets something "explicitly", the catalog should > reflect what they set. Tablespaces have dattablespace, but AMs don't -- > it's a simpler case. Okay. > For 001: we don't *need* to support "ALTER SET AM default" for leaf > tables. It doesn't do anything that's not already possible. But, if > AMs for partitioned tables are optional rather than required, then seems > to be needed to allow (re)settinng relam=0. Indeed, for non-partitioned tables DEFAULT is a sugar flavor. Not mandatory, still it's nice to have to not have to type an AM. > But for partitioned tables, I think it should set relam=0 directly. > Currently it 1) falls through to default_table_am; and 2) detects that > it's the default, so then sets relam to 0. > > Since InvalidOid is already taken, I guess you might need to introduce a > boolean flag, like set_relam, indicating that the statement has an > ACCESS METHOD clause. Yes, I don't see an alternative. The default needs a different field to be tracked down to the execution. >> + * method defined so as their children can inherit it; however, this is >> handled > > so that > >> + * Do nothing: access methods is a setting that partitions can > > method (singular), or s/is/are/ Indeed. Fixed both. > In any case, it'd be a bit confusing for the error message to still say: > > postgres=# CREATE TABLE a(i int) PARTITION BY RANGE(a) USING heap2; > ERROR: specifying a table access method is not supported on a partitioned > table I was looking at this one as well and I don't see why we could not remove it, so you are right (missed the tablespace part last week). A partitioned table created as a partition of a partitioned table would inherit the relam of its parent (0 if default is set, or non-0 is something is set). I have added some regression tests for that. And I'm finishing with the attached. To summarize SET ACCESS METHOD on a partitioned table, the semantics are: - DEFAULT sets the relam to 0, any partitions with storage would use the GUC at creation time. Partitioned tables use a relam of 0. - If a value is set for the am, relam becomes non-0. Any partitions created on it inherit it (partitioned as well as non-partitioned tables). - No USING clause means to set its relam to 0. 0001 seems OK here, 0002 needs more eyes. The bulk of the changes is in the regression tests to cover all the cases I could think of. -- Michael From 4d8a5de79b829e4f1be875c668f85bbfa6d3f37a Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 1 Mar 2024 10:22:22 +0900 Subject: [PATCH v3 1/2] Add DEFAULT option to ALTER TABLE SET ACCESS METHOD --- src/backend/commands/tablecmds.c| 3 ++- src/backend/parser/gram.y | 10 -- src/test/regress/expected/create_am.out | 21 + src/test/regress/sql/create_am.sql | 11 +++ doc/src/sgml/ref/alter_table.sgml | 6 -- 5 files changed, 46 insertions(+), 5 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index f798794556..3b1c2590fd 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -15212,7 +15212,8 @@ ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname) Oid amoid; /* Check that the table access method exists */ - amoid = get_table_am_oid(amname, false); + amoid = get_table_am_oid(amname ? amname : default_table_access_method, + false); if (rel->rd_rel->relam == amoid) return; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 130f7fc7c3..c6e2f679fd 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -338,6 +338,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type alter_identity_column_option_list %type alter_identity_column_option %type set_statistics_value +%type set_access_method_name %type createdb_opt_list createdb_opt_items copy_opt_list transaction_mode_list @@ -2859,8 +2860,8 @@ alter_table_cmd: n->newowner = $3; $$ = (Node *) n; } - /* ALTER TABLE SET ACCESS METHOD */ - | SET ACCESS METHOD name + /* ALTER TABLE SET ACCESS METHOD { | DEFAULT } */ + | SET ACCESS METHOD set_access_method_name { AlterTableCmd *n = makeNode(AlterTableCmd); @@ -3076,6 +3077,11 @@ set_statistics_value: | DEFAULT { $$ = NULL; } ; +set_access_method_name: + ColId { $$ = $1; } + | DEFAULT { $$ = NULL; } + ; + PartitionBoundSpec: /* a HASH partition */ FOR VALUES WITH '(' hash_partbound ')' diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out index b50293d514..e843d39ee7 100644 --- a/src/test/regress/expected/create_am.out +++ b/src/test/regress/expected/create_am.out @@ -283,6 +283,27 @@ SELECT COUNT(a), COUNT(1) FILTER(WHERE a=1) FROM heaptable;
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 29, 2024 at 03:38:59PM +0530, Amit Kapila wrote: > On Thu, Feb 29, 2024 at 9:13 AM Peter Smith wrote: > > > > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila > > wrote: > > > > > > > > Also, adding wait sounds > > > more like a boolean. So, I don't see the proposed names any better > > > than the current one. > > > > > > > Anyway, the point is that the current GUC name 'standby_slot_names' is > > not ideal IMO because it doesn't have enough meaning by itself -- e.g. > > you have to read the accompanying comment or documentation to have any > > idea of its purpose. > > > > Yeah, one has to read the description but that is true for other > parameters like "temp_tablespaces". I don't have any better ideas but > open to suggestions. What about "non_lagging_standby_slots"? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for giving comments! > > While reading codes, I found that ApplyLauncherShmemInit() and > > AutoVacuumShmemInit() are always called even if they would not be > > launched. > > Note that there are situations where the autovacuum launcher is started > even though autovacuum is nominally turned off, and I suspect your > proposal would break that. IIRC this occurs when the Xid or multixact > counters cross the max_freeze_age threshold. Right. In GetNewTransactionId(), SetTransactionIdLimit() and some other places, PMSIGNAL_START_AUTOVAC_LAUNCHER is sent to postmaster when the xid exceeds autovacuum_freeze_max_age. This work has already been written in the doc [1]: ``` To ensure that this does not happen, autovacuum is invoked on any table that might contain unfrozen rows with XIDs older than the age specified by the configuration parameter autovacuum_freeze_max_age. (This will happen even if autovacuum is disabled.) ``` This means that my first idea won't work well. Even if the postmaster does not initially allocate shared memory, backends may request to start auto vacuum and use the region. However, the second idea is still valid, which allows the allocation of shared memory dynamically. This is a bit efficient for the system which tuples won't be frozen. Thought? [1]: https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Injection points: some tools to wait and wake
On Mon, 4 Mar 2024 at 06:27, Michael Paquier wrote: > I have mentioned that on a separate thread, Yeah, I didn't read all emails related to this feature > Perhaps we could consider that as an exception in "contrib", or have a > separate path for test modules we're OK to install (the calls had > better be superuser-only if we do that). Yeah, it makes sense that you'd want to backport fixes/changes to this. As long as you put a disclaimer in the docs that you can do that for this module, I think it would be fine. Our tests fairly regularly break anyway when changing minor versions of postgres in our CI, e.g. due to improvements in the output of isolationtester. So if changes to this module require some changes that's fine by me. Seems much nicer than having to copy-paste the code.
Re: Synchronizing slots from primary to standby
On Mon, Mar 4, 2024 at 9:35 AM Peter Smith wrote: > > OK, if the code will remain as-is wouldn't it be better to anyway > change the function name to indicate what it really does? > > e.g. NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys > This seems too long. I would prefer the current name NeedToWaitForWal as waiting for WAL means waiting to flush the WAL and waiting to replicate it to standby. On similar lines, the variable name standby_slot_oldest_flush_lsn looks too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)? Apart from this, I have made minor modifications in the attached. -- With Regards, Amit Kapila. diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index aa477015bd..1530e7720c 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -2390,8 +2390,8 @@ validate_standby_slots(char **newval) { /* * We cannot validate the replication slot if the replication slots' -* data has not been initialized. This is ok as we will validate the -* specified slot when waiting for them to catch up. See +* data has not been initialized. This is ok as we will anyway validate +* the specified slot when waiting for them to catch up. See * StandbySlotsHaveCaughtup for details. */ } @@ -2473,8 +2473,7 @@ assign_standby_slot_names(const char *newval, void *extra) char *standby_slot_names_cpy = extra; /* -* The standby slots may have changed, so we need to recompute the oldest -* LSN. +* The standby slots may have changed, so we must recompute the oldest LSN. */ standby_slot_oldest_flush_lsn = InvalidXLogRecPtr; @@ -2664,6 +2663,7 @@ StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel) if (caught_up_slot_num != list_length(standby_slot_names_list)) return false; + /* The standby_slot_oldest_flush_lsn must not retreat. */ Assert(XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) || min_restart_lsn >= standby_slot_oldest_flush_lsn); diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index b71408d533..580f9dabd3 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1805,10 +1805,10 @@ NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, /* * Check if the standby slots have caught up to the flushed position. It -* is good to wait up to flushed position and then let it send the changes -* to logical subscribers one by one which are already covered in flushed -* position without needing to wait on every change for standby -* confirmation. +* is good to wait up to the flushed position and then let the WalSender +* send the changes to logical subscribers one by one which are already +* covered by the flushed position without needing to wait on every change +* for standby confirmation. */ if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) return true;
Re: date_trunc function in interval version
Tomas Vondra wrote on 18.02.2024 01:29: Hi, Please don't too-post on this list. The custom is to bottom-post or reply inline, and it's much easier to follow such replies. On 12/23/23 23:45, Przemysław Sztoch wrote: date_bin has big problem with DST. In example, if you put origin in winter zone, then generated bin will be incorrect for summer input date. date_trunc is resistant for this problem. My version of date_trunc is additionally more flexible, you can select more granular interval, 12h, 8h, 6h, 15min, 10 min etc... I'm not very familiar with date_bin(), but is this issue inherent or could we maybe fix date_bin() to handle DST better? Apparently the functionality is identical to date_bin. When I saw date_bin in the documentation, I thought it solved all my problems. Unfortunately, DST problems have many corner cases. I tried to change date_bin several times, but unfortunately in some cases it would start working differently than before. In any case, the patch needs to add the new stuff to the SGML docs (to doc/src/sgml/func.sgml), which now documents the date_trunc(text,...) variant only. Updated. -- Przemysław Sztoch | Mobile +48 509 99 00 66 diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e5fa82c161..95cdfab2d0 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9472,6 +9472,23 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); + + + + date_trunc + + date_trunc ( interval, timestamp with time zone ) + timestamp with time zone + + + Truncate to specified precision in the specified time zone. Interval has to be a divisor of a day, week or century. + + + date_trunc('30 minutes'::interval, timestamp '2001-02-16 20:38:40+00') + 2001-02-16 20:30:00+00 + + + date_trunc ( text, timestamp with time zone, text ) @@ -9487,6 +9504,24 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); + + + date_trunc ( interval, timestamp with time zone, text ) + timestamp with time zone + + + Truncate to specified precision in the specified time zone. Interval has to be a divisor of a day, week or century. + + + date_trunc('3 hour'::interval, timestamptz '2001-02-16 21:38:40+00', 'Europe/Warsaw') + 2001-02-16 20:00:00+00 + + + date_trunc('15 minutes'::interval, timestamptz '2001-02-16 21:38:40+00', 'Europe/Warsaw') + 2001-02-16 21:30:00+00 + + + date_trunc ( text, interval ) diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 7a016a6923..e376968c49 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -4999,6 +4999,177 @@ timestamptz_trunc_zone(PG_FUNCTION_ARGS) PG_RETURN_TIMESTAMPTZ(result); } +/* + * Common code for timestamptz_trunc_int() and timestamptz_trunc_int_zone(). + * + * tzp identifies the zone to truncate with respect to. We assume + * infinite timestamps have already been rejected. + */ +static TimestampTz +timestamptz_trunc_int_internal(Interval *interval, TimestampTz timestamp, pg_tz *tzp) +{ + TimestampTz result; + int tz; + int interval_parts = 0; + boolbad_interval = false; + boolredotz = false; + fsec_t fsec; + struct pg_tm tt, + *tm = &tt; + + if (interval->month != 0) + { + interval_parts++; + /* 1200 = hundred years */ + if ((1200/interval->month) * interval->month != 1200) + bad_interval = true; + } + if (interval->day != 0) + { + interval_parts++; + if (interval->day != 1 && interval->day != 7) + bad_interval = true; + } + if (interval->time != 0) + { + interval_parts++; + if (interval->time > USECS_PER_SEC) + { + if ((interval->time % USECS_PER_SEC) != 0) + bad_interval = true; + if ((USECS_PER_DAY/interval->time) * interval->time != USECS_PER_DAY) + bad_interval = true; + } + else if (interval->time < USECS_PER_SEC && (USECS_PER_SEC/interval->time) * interval->time != USECS_PER_SEC) + bad_interval = true; + } + if (interval_parts != 1 || bad_interval) + { + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("interval has to be a divisor of a day, week or century.")))
Re: PostgreSQL Contributors Updates
On Sun, Mar 3, 2024 at 9:28 PM Joe Conway wrote: > > All, > > The PostgreSQL Contributor Page > (https://www.postgresql.org/community/contributors/) includes people who > have made substantial, long-term contributions of time and effort to the > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > following people for their contributions. > > New PostgreSQL Contributors: > > * Bertrand Drouvot > * Gabriele Bartolini > * Richard Guo > > New PostgreSQL Major Contributors: > > * Alexander Lakhin > * Daniel Gustafsson > * Dean Rasheed > * John Naylor > * Melanie Plageman > * Nathan Bossart > Hearty congratulations. Well deserved. -- Best Wishes, Ashutosh Bapat
Re: PostgreSQL Contributors Updates
On Sun, Mar 3, 2024 at 9:28 PM Joe Conway wrote: > > All, > > The PostgreSQL Contributor Page > (https://www.postgresql.org/community/contributors/) includes people who > have made substantial, long-term contributions of time and effort to the > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > following people for their contributions. > > New PostgreSQL Contributors: > > * Bertrand Drouvot > * Gabriele Bartolini > * Richard Guo > > New PostgreSQL Major Contributors: > > * Alexander Lakhin > * Daniel Gustafsson > * Dean Rasheed > * John Naylor > * Melanie Plageman > * Nathan Bossart > > Thank you and congratulations to all! > Congratulations to all! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
Hi, On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote: > Here is the V104 patch which addressed above and Peter's comments. Thanks! A few more random comments: 1 === +The function may be blocked if the specified slot is a failover enabled s/blocked/waiting/ ? 2 === +* specified slot when waiting for them to catch up. See +* StandbySlotsHaveCaughtup for details. s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ? 3 === + /* Now verify if the specified slots really exist and have correct type */ remove "really"? 4 === + /* +* Don't need to wait for the standbys to catch up if there is no value in +* standby_slot_names. +*/ + if (standby_slot_names_list == NIL) + return true; + + /* +* Don't need to wait for the standbys to catch up if we are on a standby +* server, since we do not support syncing slots to cascading standbys. +*/ + if (RecoveryInProgress()) + return true; + + /* +* Don't need to wait for the standbys to catch up if they are already +* beyond the specified WAL location. +*/ + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && + standby_slot_oldest_flush_lsn >= wait_for_lsn) + return true; What about using OR conditions instead? 5 === +static bool +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, +uint32 *wait_event) Not a big deal but does it need to return a bool? (I mean it all depends of the *wait_event value). Is it for better code readability in the caller? 6 === +static bool +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, +uint32 *wait_event) Same questions as for NeedToWaitForStandby(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PostgreSQL Contributors Updates
On Sun, Mar 3, 2024 at 9:28 PM Joe Conway wrote: > > All, > > The PostgreSQL Contributor Page > (https://www.postgresql.org/community/contributors/) includes people who > have made substantial, long-term contributions of time and effort to the > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > following people for their contributions. > > New PostgreSQL Contributors: > > * Bertrand Drouvot > * Gabriele Bartolini > * Richard Guo > > New PostgreSQL Major Contributors: > > * Alexander Lakhin > * Daniel Gustafsson > * Dean Rasheed > * John Naylor > * Melanie Plageman > * Nathan Bossart > > Thank you and congratulations to all! +1. Congratulations to all! -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Mar 4, 2024 at 4:52 PM Bertrand Drouvot wrote: > > On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote: > > Here is the V104 patch which addressed above and Peter's comments. > > Thanks! > > > 4 === > > + /* > +* Don't need to wait for the standbys to catch up if there is no > value in > +* standby_slot_names. > +*/ > + if (standby_slot_names_list == NIL) > + return true; > + > + /* > +* Don't need to wait for the standbys to catch up if we are on a > standby > +* server, since we do not support syncing slots to cascading > standbys. > +*/ > + if (RecoveryInProgress()) > + return true; > + > + /* > +* Don't need to wait for the standbys to catch up if they are already > +* beyond the specified WAL location. > +*/ > + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && > + standby_slot_oldest_flush_lsn >= wait_for_lsn) > + return true; > > What about using OR conditions instead? > I think we can use but it seems code is easier to follow this way but this is just a matter of personal choice. > 5 === > > +static bool > +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > +uint32 *wait_event) > > Not a big deal but does it need to return a bool? (I mean it all depends of > the *wait_event value). Is it for better code readability in the caller? > Yes, I think so. Adding checks based on wait_events sounds a bit awkward. -- With Regards, Amit Kapila.
Re: a wrong index choose when statistics is out of date
David Rowley writes: > On Sun, 3 Mar 2024 at 20:08, Andy Fan wrote: >> The issue can be reproduced with the following steps: >> >> create table x_events (.., created_at timestamp, a int, b int); >> >> create index idx_1 on t(created_at, a); >> create index idx_2 on t(created_at, b); >> >> query: >> select * from t where create_at = current_timestamp and b = 1; >> >> index (created_at, a) rather than (created_at, b) may be chosen for the >> above query if the statistics think "create_at = current_timestamp" has >> no rows, then both index are OK, actually it is true just because >> statistics is out of date. > > I don't think there's really anything too special about the fact that > the created_at column is always increasing. We commonly get 1-row > estimates after multiplying the selectivities from individual stats. > Your example just seems like yet another reason that this could > happen. You are right about there are more cases which lead this happen. However this is the only case where the created_at = $1 trick can works, which was the problem I wanted to resove when I was writing. > I've been periodically talking about introducing "risk" as a factor > that the planner should consider. I did provide some detail in [1] > about the design that was in my head at that time. I'd not previously > thought that it could also solve this problem, but after reading your > email, I think it can. Haha, I remeber you were against "risk factor" before at [1], and at that time we are talking about the exact same topic as here, and I proposaled another risk factor. Without an agreement, I did it in my own internal version and get hurted then, something like I didn't pay enough attention to Bitmap Index Scan and Index scan. Then I forget the "risk factor". > > I don't think it would be right to fudge the costs in any way, but I > think the risk factor for IndexPaths could take into account the > number of unmatched index clauses and increment the risk factor, or > "certainty_factor" as it is currently in my brain-based design. That > way add_path() would be more likely to prefer the index that matches > the most conditions. This is somehow similar with my proposal at [1]? What do you think about the treat 'col op const' as 'col op $1' for the marked column? This could just resolve a subset of questions in your mind, but the method looks have a solid reason. Currently I treat the risk factor as what you did before, but this maybe another time for me to switch my mind again. [1] https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com -- Best Regards Andy Fan
Re: a wrong index choose when statistics is out of date
Andrei Lepikhov writes: > On 3/3/2024 14:01, Andy Fan wrote: >> 1. We can let the user define the column as the value is increased day by >> day. the syntax may be: >> ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED. >> then when a query like 'create_at op const', the statistics module >> can >> treat it as 'created_at = $1'. so the missing statistics doesn't make >> difference. Then I think the above issue can be avoided. > Let me write some words to support your efforts in that way. > I also have some user cases where they periodically insert data in large > chunks. These chunks contain 'always increased' values, and it causes > trouble each time they start an analytic query over this new data before > the analyze command. > I have thought about that issue before but invented nothing special > except a more aggressive analysis of such tables. I have to say we run into a exactly same sistuation and use the same trick to solve the problem, and we know no matter how aggressive it is, the problem may still happen. > Your trick can work, but it needs a new parameter in pg_type and a lot > of additional code for such a rare case. > I'm looking forward to the demo patch. Maybe my word "auto_increased" is too like a type, but actually what I want to is adding a new attribute for pg_attribute which ties with one column in one relation. When we figure out a selective on this *column*, we do such trick. This probably doesn't need much code. -- Best Regards Andy Fan
Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)
Em seg., 4 de mar. de 2024 às 03:18, Andrey M. Borodin escreveu: > > > > On 14 Jan 2024, at 18:55, John Naylor wrote: > > > > On Sat, Jan 13, 2024 at 9:36 PM Ranier Vilela > wrote: > >> > >> Em ter., 9 de jan. de 2024 às 06:31, John Naylor < > johncnaylo...@gmail.com> escreveu: > > > >>> This just moves an operation from one place to the other, while > >>> obliterating the explanatory comment, so I don't see an advantage. > >> > >> Well, I think that is precisely the case for using memset. > >> The way initialization is currently done is much slower and harmful to > the branch. > >> Of course, the gain should be small, but it is fully justified for > switching to memset. > > > > We haven't seen any evidence or reasoning for that. Simple > > rules-of-thumb are not enough. > > > > Hi Ranier, > > I’ll mark CF entry [0] as “Returned with Feedback”. Feel free to reopen > item in this CF or submit to the next, if you want to continue working on > this. > > I took a glance into the patch, and I would agree that setting field > nonzero values with memset() is somewhat unusual. Please provide stronger > evidence to do so. > I counted the calls with non-zero memset in the entire postgres code and they are about 183 calls. I counted the calls with non-zero memset in the entire postgres code and they are about 183 calls. At least 4 calls with -1 File contrib\pg_trgm\trgm_op.c: memset(lastpos, -1, sizeof(int) * len); File src\backend\executor\execPartition.c: memset(pd->indexes, -1, sizeof(int) * partdesc->nparts); File src\backend\partitioning\partprune.c: memset(subplan_map, -1, nparts * sizeof(int)); memset(subpart_map, -1, nparts * sizeof(int)); Does filling a memory area, one by one, with branches, need strong evidence to prove that it is better than filling a memory area, all at once, without branches? best regards, Ranier Vilela
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada wrote: > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor wrote: > > Right, I should have said "reset". Resetting a context will delete > > it's children as well, and seems like it should work to reset the tree > > context, and we don't have to know whether that context actually > > contains leaves at all. That should allow copying "tree context" to > > "leaf context" in the case where we have no special context for > > leaves. > > Resetting the tree->context seems to work. But I think we should note > for callers that the dsa_area passed to RT_CREATE should be created in > a different context than the context passed to RT_CREATE because > otherwise RT_FREE() will also free the dsa_area. For example, the > following code in test_radixtree.c will no longer work: > > dsa = dsa_create(tranche_id); > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > : > rt_free(radixtree); > dsa_detach(dsa); // dsa is already freed. > > So I think that a practical usage of the radix tree will be that the > caller creates a memory context for a radix tree and passes it to > RT_CREATE(). That sounds workable to me. > I've attached an update patch set: > > - 0008 updates RT_FREE_RECURSE(). Thanks! > - 0009 patch is an updated version of cleanup radix tree memory handling. Looks pretty good, as does the rest. I'm going through again, squashing and making tiny adjustments to the template. The only thing not done is changing the test with many values to resemble the perf test more. I wrote: > > Secondly, I thought about my recent work to skip checking if we first > > need to create a root node, and that has a harmless (for vacuum at > > least) but slightly untidy behavior: When RT_SET is first called, and > > the key is bigger than 255, new nodes will go on top of the root node. > > These have chunk '0'. If all subsequent keys are big enough, the > > orginal root node will stay empty. If all keys are deleted, there will > > be a chain of empty nodes remaining. Again, I believe this is > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > on this, but likely not today. > > This turns out to be a lot trickier than it looked, so it seems best > to allow a trivial amount of waste, as long as it's documented > somewhere. It also wouldn't be terrible to re-add those branches, > since they're highly predictable. I put a little more work into this, and got it working, just needs a small amount of finicky coding. I'll share tomorrow. I have a question about RT_FREE_RECURSE: + check_stack_depth(); + CHECK_FOR_INTERRUPTS(); I'm not sure why these are here: The first seems overly paranoid, although harmless, but the second is probably a bad idea. Why should the user be able to to interrupt the freeing of memory? Also, I'm not quite happy that RT_ITER has a copy of a pointer to the tree, leading to coding like "iter->tree->ctl->root". I *think* it would be easier to read if the tree was a parameter to these iteration functions. That would require an API change, so the tests/tidstore would have some churn. I can do that, but before trying I wanted to see what you think -- is there some reason to keep the current way?
Re: Eager aggregation, take 3
Richard Guo writes: > Hi All, > > Eager aggregation is a query optimization technique that partially > pushes a group-by past a join, and finalizes it once all the relations > are joined. Eager aggregation reduces the number of input rows to the > join and thus may result in a better overall plan. This technique is > thoroughly described in the 'Eager Aggregation and Lazy Aggregation' > paper [1]. This is a really helpful and not easy task, even I am not sure when I can spend time to study this, I want to say "Thanks for working on this!" first and hope we can really progress on this topic. Good luck! -- Best Regards Andy Fan
Re: CF entries for 17 to be reviewed
> On 4 Mar 2024, at 13:42, Andrey M. Borodin wrote: > > Here’s my take on “Miscellaneous” section. I’ve read other small sections. Monitoring & Control * Logging parallel worker draught The patchset on improving loggin os resource starvation when parallel workers are spawned. Some major refactoring proposed. I've switched to WoA. * System username in pg_stat_activity There's active discussion on extending or not pg_stat_activity. Testing * Add basic tests for the low-level backup method Michael Paquier provided feedback, so I switched to WoA. System Administration * recovery modules There was a very active discussion, but after April 2023 it stalled, and only some rebases are there. Maybe a fresh look could revive the thread. * Possibility to disable `ALTER SYSTEM` The discussion seems active, but inconclusive. * Add checkpoint/redo LSNs to recovery errors. Michael Paquier provided feedback, so I switched to WoA. Security * add not_before and not_after timestamps to sslinfo extension and pg_stat_ssl Most recent version provided by Daniel Gustafsson, but the thread stalled in September 2023. Maybe Jacob or some other reviewer could refresh it, IMO this might land in 17. Thanks! Best regards, Andrey Borodin.
Re: Some shared memory chunks are allocated even if related processes won't start
On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote: > However, the second idea is still valid, which allows the allocation > of shared memory dynamically. This is a bit efficient for the system > which tuples won't be frozen. Thought? I think it would be worth allocating AutoVacuumShmem->av_workItems using dynamic shmem allocation, particularly to prevent workitems from being discarded just because the array is full¹; but other than that, the struct is just 64 bytes long so I doubt it's useful to allocate it dynamically. ¹ I mean, if the array is full, just allocate another array, point to it from the original one, and keep going. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." -- Paul Graham, http://www.paulgraham.com/opensource.html
Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery
Hi hackers, I wanted to hook into the EXPLAIN output for queries and add some extra information, but since there is no standard_ExplainOneQuery() I had to copy the code and create my own version. Since the pattern with other hooks for a function WhateverFunction() seems to be that there is a standard_WhateverFunction() for each WhateverFunction_hook, I created a patch to follow this pattern for your consideration. I was also considering adding a callback so that you can annotate any node with explanatory information that is not a custom scan node. This could be used to propagate and summarize information from custom scan nodes, but I had no immediate use for that so did not add it here. I would still be interested in hearing if you think this is something that would be useful to the community. Best wishes, Mats Kindahl, Timescale From eaab4d7c088ff3ee9b0e6ec3de96677bafd184c0 Mon Sep 17 00:00:00 2001 From: Mats Kindahl Date: Mon, 4 Mar 2024 12:38:05 +0100 Subject: Improve support for ExplainOneQuery hook There is a hook ExplainOneQuery_hook, but there is no corresponding standard_ExplainOneQuery and the code that belongs there is written in-line in ExplainOneQuery(). As a result, anybody adding a hook for ExplainOneQuery_hook has to copy the code from ExplainOneQuery() to run the standard ExplainOneQuery before adding their own messages. This commit fixes this by refactoring ExplainOneQuery() and factor out the standard code into standard_ExplainOneQuery() and call it from ExplainOneQuery() in a similar manner to how it is done for other hooks. --- src/backend/commands/explain.c | 106 ++--- src/include/commands/explain.h | 4 ++ 2 files changed, 61 insertions(+), 49 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 78754bc6ba..3b7bed3ca2 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -391,63 +391,71 @@ ExplainOneQuery(Query *query, int cursorOptions, /* if an advisor plugin is present, let it manage things */ if (ExplainOneQuery_hook) - (*ExplainOneQuery_hook) (query, cursorOptions, into, es, - queryString, params, queryEnv); + (*ExplainOneQuery_hook)(query, cursorOptions, into, es, + queryString, params, queryEnv); else - { - PlannedStmt *plan; - instr_time planstart, - planduration; - BufferUsage bufusage_start, - bufusage; - MemoryContextCounters mem_counters; - MemoryContext planner_ctx = NULL; - MemoryContext saved_ctx = NULL; - - if (es->memory) - { - /* - * Create a new memory context to measure planner's memory - * consumption accurately. Note that if the planner were to be - * modified to use a different memory context type, here we would - * be changing that to AllocSet, which might be undesirable. - * However, we don't have a way to create a context of the same - * type as another, so we pray and hope that this is OK. - */ - planner_ctx = AllocSetContextCreate(CurrentMemoryContext, -"explain analyze planner context", -ALLOCSET_DEFAULT_SIZES); - saved_ctx = MemoryContextSwitchTo(planner_ctx); - } + standard_ExplainOneQuery(query, cursorOptions, into, es, + queryString, params, queryEnv); +} - if (es->buffers) - bufusage_start = pgBufferUsage; - INSTR_TIME_SET_CURRENT(planstart); +void +standard_ExplainOneQuery(Query *query, int cursorOptions, + IntoClause *into, ExplainState *es, + const char *queryString, ParamListInfo params, + QueryEnvironment *queryEnv) +{ + PlannedStmt *plan; + instr_time planstart, +planduration; + BufferUsage bufusage_start, +bufusage; + MemoryContextCounters mem_counters; + MemoryContext planner_ctx = NULL; + MemoryContext saved_ctx = NULL; + + if (es->memory) + { + /* + * Create a new memory context to measure planner's memory consumption + * accurately. Note that if the planner were to be modified to use a + * different memory context type, here we would be changing that to + * AllocSet, which might be undesirable. However, we don't have a way + * to create a context of the same type as another, so we pray and + * hope that this is OK. + */ + planner_ctx = AllocSetContextCreate(CurrentMemoryContext, + "explain analyze planner context", + ALLOCSET_DEFAULT_SIZES); + saved_ctx = MemoryContextSwitchTo(planner_ctx); + } - /* plan the query */ - plan = pg_plan_query(query, queryString, cursorOptions, params); + if (es->buffers) + bufusage_start = pgBufferUsage; + INSTR_TIME_SET_CURRENT(planstart); - INSTR_TIME_SET_CURRENT(planduration); - INSTR_TIME_SUBTRACT(planduration, planstart); + /* plan the query */ + plan = pg_plan_query(query, queryString, cursorOptions, params); - if (es->memory) - { - MemoryContextSwitchTo(saved_ctx); - MemoryContextMemConsumed(planner_ctx, &mem_counters); - } + INSTR_TIME_SET_CURRENT(planduration); + INSTR_TIME_SUBTRACT(planduration, plans
Re: Shared detoast Datum proposal
On 3/4/24 02:29, Andy Fan wrote: > > Tomas Vondra writes: > >> On 3/3/24 07:10, Andy Fan wrote: >>> >>> Hi, >>> >>> Here is a updated version, the main changes are: >>> >>> 1. an shared_detoast_datum.org file which shows the latest desgin and >>> pending items during discussion. >>> 2. I removed the slot->pre_detoast_attrs totally. >>> 3. handle some pg_detoast_datum_slice use case. >>> 4. Some implementation improvement. >>> >> >> I only very briefly skimmed the patch, and I guess most of my earlier >> comments still apply. > > Yes, the overall design is not changed. > >> But I'm a bit surprised the patch needs to pass a >> MemoryContext to so many places as a function argument - that seems to >> go against how we work with memory contexts. Doubly so because it seems >> to only ever pass CurrentMemoryContext anyway. So what's that about? > > I think you are talking about the argument like this: > > /* -- > - * detoast_attr - > + * detoast_attr_ext - > * > * Public entry point to get back a toasted value from compression > * or external storage. The result is always non-extended varlena form. > * > + * ctx: The memory context which the final value belongs to. > + * > * Note some callers assume that if the input is an EXTERNAL or COMPRESSED > * datum, the result will be a pfree'able chunk. > * -- > */ > > +extern struct varlena * > +detoast_attr_ext(struct varlena *attr, MemoryContext ctx) > > This is mainly because 'detoast_attr' will apply more memory than the > "final detoast datum" , for example the code to scan toast relation > needs some memory as well, and what I want is just keeping the memory > for the final detoast datum so that other memory can be released sooner, > so I added the function argument for that. > What exactly does detoast_attr allocate in order to scan toast relation? Does that happen in master, or just with the patch? If with master, I suggest to ignore that / treat that as a separate issue and leave it for a different patch. In any case, the custom is to allocate results in the context that is set in CurrentMemoryContext at the moment of the call, and if there's substantial amount of allocations that we want to free soon, we either do that by explicit pfree() calls, or create a small temporary context in the function (but that's more expensive). I don't think we should invent a new convention where the context is passed as an argument, unless absolutely necessary. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Commitfest Manager for March
Hi, > >> The call for a CFM volunteer is still open. > > > > I always wanted to try. And most of the stuff I'm interested in is already > > committed. > > > > But given importance of last commitfest before feature freeze, we might be > > interested in more experienced CFM. > > If I can do something useful - I'm up for it. > > I'm absolutely convinced you have more than enough experience with postgres > hacking to do an excellent job. I'm happy to give a hand as well. > > Thanks for volunteering! > > (someone from pginfra will give you the required admin permissions on the CF > app) Thanks for volunteering, Andrey! If you need any help please let me know. -- Best regards, Aleksander Alekseev
Re: PostgreSQL Contributors Updates
> > All, > > > > The PostgreSQL Contributor Page > > (https://www.postgresql.org/community/contributors/) includes people who > > have made substantial, long-term contributions of time and effort to the > > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > > following people for their contributions. > > > > New PostgreSQL Contributors: > > > > * Bertrand Drouvot > > * Gabriele Bartolini > > * Richard Guo > > > > New PostgreSQL Major Contributors: > > > > * Alexander Lakhin > > * Daniel Gustafsson > > * Dean Rasheed > > * John Naylor > > * Melanie Plageman > > * Nathan Bossart > > > > Thank you and congratulations to all! > > +1. Congratulations to all! Congratulations to all! -- Best regards, Aleksander Alekseev
Re: Shared detoast Datum proposal
Tomas Vondra writes: >>> But I'm a bit surprised the patch needs to pass a >>> MemoryContext to so many places as a function argument - that seems to >>> go against how we work with memory contexts. Doubly so because it seems >>> to only ever pass CurrentMemoryContext anyway. So what's that about? >> >> I think you are talking about the argument like this: >> >> /* -- >> - * detoast_attr - >> + * detoast_attr_ext - >> * >> * Public entry point to get back a toasted value from compression >> * or external storage. The result is always non-extended varlena form. >> * >> + * ctx: The memory context which the final value belongs to. >> + * >> * Note some callers assume that if the input is an EXTERNAL or COMPRESSED >> * datum, the result will be a pfree'able chunk. >> * -- >> */ >> >> +extern struct varlena * >> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx) >> >> This is mainly because 'detoast_attr' will apply more memory than the >> "final detoast datum" , for example the code to scan toast relation >> needs some memory as well, and what I want is just keeping the memory >> for the final detoast datum so that other memory can be released sooner, >> so I added the function argument for that. >> > > What exactly does detoast_attr allocate in order to scan toast relation? > Does that happen in master, or just with the patch? It is in the current master, for example the TupleTableSlot creation needed by scanning the toast relation needs a memory allocating. > If with master, I > suggest to ignore that / treat that as a separate issue and leave it for > a different patch. OK, I can make it as a seperate commit in the next version. > In any case, the custom is to allocate results in the context that is > set in CurrentMemoryContext at the moment of the call, and if there's > substantial amount of allocations that we want to free soon, we either > do that by explicit pfree() calls, or create a small temporary context > in the function (but that's more expensive). > > I don't think we should invent a new convention where the context is > passed as an argument, unless absolutely necessary. Hmm, in this case, if we don't add the new argument, we have to get the detoast datum in Context-1 and copy it to the desired memory context, which is the thing I want to avoid. I think we have a same decision to make on the TOAST cache method as well. -- Best Regards Andy Fan
Re: Extract numeric filed in JSONB more effectively
On 09.02.24 10:05, Andy Fan wrote: 2. Where is the current feature blocked for the past few months? It's error message compatible issue! Continue with above setup: master: select * from tb where (a->'b')::numeric > 3::numeric; ERROR: cannot cast jsonb string to type numeric select * from tb where (a->'b')::int4 > 3::numeric; ERROR: cannot cast jsonb string to type integer You can see the error message is different (numeric vs integer). Patched: We still can get the same error message as master BUT the code looks odd. select * from tb where (a->'b')::int4 > 3; QUERY PLAN --- Seq Scan on public.tb Output: a Filter: ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 'b'::text), '23'::oid))::integer > 3) (3 rows) You can see "jsonb_finish_numeric(.., '23::oid)" the '23::oid' is just for the *"integer"* output in error message: "cannot cast jsonb string to type*integer*" Now the sistuation is either we use the odd argument (23::oid) in jsonb_finish_numeric, or we use a incompatible error message with the previous version. I'm not sure which way is better, but this is the place the current feature is blocked. I'm not bothered by that. It also happens on occasion in the backend C code that we pass around extra information to be able to construct better error messages. The functions here are not backend C code, but they are internal functions, so similar considerations can apply. But I have a different question about this patch set. This has some overlap with the JSON_VALUE function that is being discussed at [0][1]. For example, if I apply the patch v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run select count(*) from tb where json_value(a, '$.a' returning numeric) = 2; and I get a noticeable performance boost over select count(*) from tb where cast (a->'a' as numeric) = 2; So some questions to think about: 1. Compare performance of base case vs. this patch vs. json_value. 2. Can json_value be optimized further? 3. Is this patch still needed? 3a. If yes, should the internal rewriting make use of json_value or share code with it? [0]: https://www.postgresql.org/message-id/flat/CA+HiwqE4XTdfb1nW=ojoy_tqsrhyt-q_kb6i5d4xckyrlc1...@mail.gmail.com [1]: https://commitfest.postgresql.org/47/4377/
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
Hi hackers, >> I did not see any test addition for this, can we add it? > > > Ok, added it in the attached version. > > This was an experimental patch, I was looking for the comment on the proposed > approach -- whether we could simply skip the throwaway NOT NULL constraint for > deferred PK constraint. Moreover, skip that for any PK constraint. I confirm that the patch fixes the bug. All the tests pass. Looks like RfC to me. -- Best regards, Aleksander Alekseev
Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery
Hi, > I wanted to hook into the EXPLAIN output for queries and add some extra > information, but since there is no standard_ExplainOneQuery() I had to copy > the code and create my own version. > > Since the pattern with other hooks for a function WhateverFunction() seems to > be that there is a standard_WhateverFunction() for each > WhateverFunction_hook, I created a patch to follow this pattern for your > consideration. > > I was also considering adding a callback so that you can annotate any node > with explanatory information that is not a custom scan node. This could be > used to propagate and summarize information from custom scan nodes, but I had > no immediate use for that so did not add it here. I would still be interested > in hearing if you think this is something that would be useful to the > community. Thanks for the patch. LGTM. I registered the patch on the nearest open CF [1] and marked it as RfC. It is a pretty straightforward refactoring. [1]: https://commitfest.postgresql.org/48/4879/ -- Best regards, Aleksander Alekseev
Re: abi-compliance-checker
On 27.02.24 14:25, Alvaro Herrera wrote: I like this idea and I think we should integrate it with the objective of it becoming the workhorse of ABI-stability testing. However, I do not think that the subsequent patches should be part of the tree at all; certainly not the produced .xml files in your 0004, as that would be far too unstable and would cause a lot of pointless churn. Looking at this again, if we don't want the .xml files in the tree, then we don't really need this patch series. Most of the delicate work in the 0001 patch was to find the right abidw options combinations to reduce the variability in the .xml output files (--no-show-locs is an obvious example). If we don't want to record the .xml files in the tree, then we don't need all these complications. For example, if we want to check the postgres backend ABI across minor versions, we could just compile it multiple times and compare the binaries directly: git checkout REL_16_0 meson setup build-0 meson compile -C build-0 git checkout REL_16_STABLE meson setup build-1 meson compile -C build-1 abidiff --no-added-syms build-0/src/backend/postgres build-1/src/backend/postgres The way I see this working, is that we set up a buildfarm animal [per architecture] that runs the new rules produced by the abidw option and stores the result locally, so that for stable branches it can turn red when an ABI-breaking change with the previous minor release of the same branch is introduced. There's no point on it ever turning red in the master branch, since we're obviously not concerned with ABI changes there. Maybe the way forward here is to write a buildfarm module for this, and then see from there what further needs there are.
Re: type cache cleanup improvements
Hi Teodor, > I'd like to suggest two independent patches to improve performance of type > cache > cleanup. I found a case where type cache cleanup was a reason for low > performance. In short, customer makes 100 thousand temporary tables in one > transaction. > > 1 mapRelType.patch >It just adds a local map between relation and its type as it was suggested > in > comment above TypeCacheRelCallback(). Unfortunately, using syscache here was > impossible because this call back could be called outside transaction and it > makes impossible catalog lookups. > > 2 hash_seq_init_with_hash_value.patch > TypeCacheTypCallback() loop over type hash to find entry with given hash > value. Here there are two problems: 1) there isn't interface to dynahash to > search entry with given hash value and 2) hash value calculation algorithm is > differ from system cache. But coming hashvalue is came from system cache. > Patch > is addressed to both issues. It suggests hash_seq_init_with_hash_value() call > which inits hash sequential scan over the single bucket which could contain > entry with given hash value, and hash_seq_search() will iterate only over such > entries. Anf patch changes hash algorithm to match syscache. Actually, patch > makes small refactoring of dynahash, it makes common function hash_do_lookup() > which does initial lookup in hash. > > Some artificial performance test is in attachment, command to test is 'time > psql > < custom_types_and_array.sql', here I show only last rollback time and total > execution time: > 1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662 > Time: 33353,288 ms (00:33,353) > psql < custom_types_and_array.sql 0,82s user 0,71s system 1% cpu 1:28,36 > total > > 2) mapRelType.patch > Time: 7455,581 ms (00:07,456) > psql < custom_types_and_array.sql 1,39s user 1,19s system 6% cpu 41,220 total > > 3) hash_seq_init_with_hash_value.patch > Time: 24975,886 ms (00:24,976) > psql < custom_types_and_array.sql 1,33s user 1,25s system 3% cpu 1:19,77 > total > > 4) both > Time: 89,446 ms > psql < custom_types_and_array.sql 0,72s user 0,52s system 10% cpu 12,137 > total These changes look very promising. Unfortunately the proposed patches conflict with each other regardless the order of applying: ``` error: patch failed: src/backend/utils/cache/typcache.c:356 error: src/backend/utils/cache/typcache.c: patch does not apply ``` So it's difficult to confirm case 4, not to mention the fact that we are unable to test the patches on cfbot. Could you please rebase the patches against the recent master branch (in any order) and submit the result of `git format-patch` ? -- Best regards, Aleksander Alekseev
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for discussing! > > I think it would be worth allocating AutoVacuumShmem->av_workItems using > dynamic shmem allocation, particularly to prevent workitems from being > discarded just because the array is full¹; but other than that, the > struct is just 64 bytes long so I doubt it's useful to allocate it > dynamically. > > ¹ I mean, if the array is full, just allocate another array, point to it > from the original one, and keep going. OK, I understood that my initial proposal is not so valuable, so I can withdraw it. About the suggetion, you imagined AutoVacuumRequestWork() and brininsert(), right? I agreed it sounds good, but I don't think it can be implemented by current interface. An interface for dynamically allocating memory is GetNamedDSMSegment(), and it returns the same shared memory region if input names are the same. Therefore, there is no way to re-alloc the shared memory. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Replication conflicts not processed in ClientWrite
When a backend is blocked on writing data (such as with a network error or a very slow client), indicated with wait event ClientWrite, it appears to not properly notice that it's overrunning max_standby_streaming_delay, and therefore does not cancel the transaction on the backend. I've reproduced this repeatedly on Ubuntu 20.04 with PostgreSQL 15 out of the debian packages. Curiously enough, if I install the debug symbols and restart, in order to get a backtrace, it starts processing the cancellation again and can no longer reproduce. So it sounds like some timing issue around it. My simple test was, with session 1 on the standby and session 2 on the primary: Session 1: begin transaction isolation level repeatable read; Session 1: select count(*) from testtable; Session 2: alter table testtable rename to testtable2; Session 1: select * from testtable t1 cross join testtable t2; kill -STOP At this point, replication lag sartgs growing on the standby and it never terminates the session. If I then SIGCONT it, it will get terminated by replication conflict. If I kill the session hard, the replication lag recovers immediately. AFAICT if the confliact happens at ClientRead, for example, it's picked up immediately, but there's something in ClientWrite that prevents it. My first thought would be OpenSSL, but this is reproducible both on tls-over-tcp and on unix sockets. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit : > These are "my" animals (running at a local university). There's a couple > interesting details: Hi Tomas, do you still have the failing cluster data ? Noah pointed me to this thread, and it looks a bit similar to the FSM corruption issue I'm facing: https://www.postgresql.org/message-id/ 1925490.taCxCBeP46%40aivenlaptop So if you still have the data, it would be nice to see if you indeed have a corrupted FSM, and if you have indications when it happened. Best regards, -- Ronan Dunklau
RE: Synchronizing slots from primary to standby
On Monday, March 4, 2024 7:22 PM Bertrand Drouvot wrote: > > Hi, > > On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote: > > Here is the V104 patch which addressed above and Peter's comments. > > Thanks! > > A few more random comments: Thanks for the comments! > > 1 === > > +The function may be blocked if the specified slot is a failover > + enabled > > s/blocked/waiting/ ? Changed. > > 2 === > > +* specified slot when waiting for them to catch up. See > +* StandbySlotsHaveCaughtup for details. > > s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ? Changed. > > 3 === > > + /* Now verify if the specified slots really exist and have > + correct type */ > > remove "really"? Changed. > > 4 === > > + /* > +* Don't need to wait for the standbys to catch up if there is no > value in > +* standby_slot_names. > +*/ > + if (standby_slot_names_list == NIL) > + return true; > + > + /* > +* Don't need to wait for the standbys to catch up if we are on a > standby > +* server, since we do not support syncing slots to cascading > standbys. > +*/ > + if (RecoveryInProgress()) > + return true; > + > + /* > +* Don't need to wait for the standbys to catch up if they are already > +* beyond the specified WAL location. > +*/ > + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && > + standby_slot_oldest_flush_lsn >= wait_for_lsn) > + return true; > > What about using OR conditions instead? > > 5 === > > +static bool > +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > +uint32 *wait_event) > > Not a big deal but does it need to return a bool? (I mean it all depends of > the > *wait_event value). Is it for better code readability in the caller? > > 6 === > > +static bool > +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > +uint32 *wait_event) > > Same questions as for NeedToWaitForStandby(). I also feel the current style looks a bit cleaner, so didn’t change these. Best Regards, Hou zj
RE: Synchronizing slots from primary to standby
On Monday, March 4, 2024 9:55 AM Peter Smith wrote: > > On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Sunday, March 3, 2024 7:47 AM Peter Smith > wrote: > > > > > > > 3. > > > + > > > + > > > + Value * is not accepted as it is > > > inappropriate to > > > + block logical replication for physical slots that either lack > > > + associated standbys or have standbys associated that are > > > + not > > > enabled > > > + for replication slot synchronization. (see > > > + > > linkend="logicaldecoding-replication-slots-synchronization"/>). > > > + > > > + > > > > > > Why does the document need to provide an excuse/reason for the rule? > > > You could just say something like: > > > > > > SUGGESTION > > > The slots must be named explicitly. For example, specifying wildcard > > > values like * is not permitted. > > > > As suggested by Amit, I moved this to code comments. > > Was the total removal of this note deliberate? I only suggested removing the > *reason* for the rule, not the entire rule. Otherwise, the user won't know to > avoid doing this until they try it and get an error. OK, Added. > > > > > > > > 9. NeedToWaitForWal > > > > > > + /* > > > + * Check if the standby slots have caught up to the flushed position. > > > + It > > > + * is good to wait up to flushed position and then let it send the > > > + changes > > > + * to logical subscribers one by one which are already covered in > > > + flushed > > > + * position without needing to wait on every change for standby > > > + * confirmation. Note that after receiving the shutdown signal, an > > > + ERROR > > > + * is reported if any slots are dropped, invalidated, or inactive. > > > + This > > > + * measure is taken to prevent the walsender from waiting indefinitely. > > > + */ > > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > > + return true; > > > > > > I felt it was confusing things for this function to also call to the > > > other one -- it seems an overlapping/muddling of the purpose of these. > > > I think it will be easier to understand if the calling code just > > > calls to one or both of these functions as required. > > > > Same as Amit, I didn't change this. > > AFAICT my previous review comment was misinterpreted. Please see [1] for > more details. > > > > Here are some more review comments for v104-1 Thanks! > > == > Commit message > > 1. > Additionally, The SQL functions pg_logical_slot_get_changes, > pg_logical_slot_peek_changes and pg_replication_slot_advance are modified > to wait for the replication slots specified in 'standby_slot_names' to catch > up > before returning. > > ~ > > Maybe that should be expressed using similar wording as the docs... > > SUGGESTION > Additionally, The SQL functions ... are modified. Now, when used with > failover-enabled logical slots, these functions will block until all physical > slots > specified in 'standby_slot_names' have confirmed WAL receipt. Changed. > > == > doc/src/sgml/config.sgml > > 2. > +pg_logical_slot_peek_changes, > +when used with failover enabled logical slots, will block until all > +physical slots specified in > standby_slot_names have > +confirmed WAL receipt. > > /failover enabled logical slots/failover-enabled logical slots/ Changed. Note that for this comment and remaining comments, I used the later version we agreed(logical failover slot). > > == > doc/src/sgml/func.sgml > > 3. > +The function may be blocked if the specified slot is a failover > enabled > +slot and linkend="guc-standby-slot-names">standby_slot_names me> > +is configured. > > > /a failover enabled slot//a failover-enabled slot/ Changed. > > ~~~ > > 4. > +slot may return to an earlier position. The function may be blocked > if > +the specified slot is a failover enabled slot and > + linkend="guc-standby-slot-names">standby_slot_names me> > +is configured. > > /a failover enabled slot//a failover-enabled slot/ Changed. > > == > src/backend/replication/slot.c > > 5. > +/* > + * Wait for physical standbys to confirm receiving the given lsn. > + * > + * Used by logical decoding SQL functions that acquired failover enabled > slot. > + * It waits for physical standbys corresponding to the physical slots > +specified > + * in the standby_slot_names GUC. > + */ > +void > +WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > > /failover enabled slot/failover-enabled slot/ Changed. > > ~~~ > > 6. > + /* > + * Don't need to wait for the standby to catch up if the current > + acquired > + * slot is not a failover enabled slot, or there is no value in > + * standby_slot_names. > + */ > > /failover enabled slot/failover-enabled slot/ Changed. > > == > src/backend/replication/slotfuncs.c > > 7. > + > + /* > + * Wake up logical wals
RE: Synchronizing slots from primary to standby
On Monday, March 4, 2024 5:52 PM Amit Kapila wrote: > > On Mon, Mar 4, 2024 at 9:35 AM Peter Smith > wrote: > > > > OK, if the code will remain as-is wouldn't it be better to anyway > > change the function name to indicate what it really does? > > > > e.g. NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys > > > > This seems too long. I would prefer the current name NeedToWaitForWal as > waiting for WAL means waiting to flush the WAL and waiting to replicate it to > standby. On similar lines, the variable name standby_slot_oldest_flush_lsn > looks > too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)? > > Apart from this, I have made minor modifications in the attached. Thanks, I have merged it. Attach the V105 patch set which addressed Peter, Amit and Bertrand's comments. This version also includes the following changes: * We found a string matching issue for query_until() and fixed it. * Removed one un-used parameter from NeedToWaitForStandbys. * Disable the sub before testing the pg_logical_slot_get_changes in 040.pl, this is to prevent This test from catching the warning from another walsender. * Ran pgindent. Best Regards, Hou zj v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch Description: v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch Description: v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
On 3/4/24 14:16, Ronan Dunklau wrote: > Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit : >> These are "my" animals (running at a local university). There's a couple >> interesting details: > > Hi Tomas, > do you still have the failing cluster data ? > > Noah pointed me to this thread, and it looks a bit similar to the FSM > corruption issue I'm facing: https://www.postgresql.org/message-id/ > 1925490.taCxCBeP46%40aivenlaptop > > So if you still have the data, it would be nice to see if you indeed have a > corrupted FSM, and if you have indications when it happened. > Sorry, I nuked the buildroot so I don't have the data anymore. Let's see if it fails again. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On Mon, 4 Mar 2024 at 12:34, Aleksander Alekseev wrote: > > > This was an experimental patch, I was looking for the comment on the > > proposed > > approach -- whether we could simply skip the throwaway NOT NULL constraint > > for > > deferred PK constraint. Moreover, skip that for any PK constraint. > > I confirm that the patch fixes the bug. All the tests pass. Looks like > RfC to me. > I don't think that this is the right fix. ISTM that the real issue is that dropping a NOT NULL constraint should not mark the column as nullable if it is part of a PK, whether or not that PK is deferrable -- a deferrable PK still marks a column as not nullable. The reason pg_dump creates these throwaway NOT NULL constraints is to avoid a table scan to check for NULLs when the PK is later created. That rationale still applies to deferrable PKs, so we still want the throwaway NOT NULL constraints in that case, otherwise we'd be hurting performance of restore. Regards, Dean
Re: Some shared memory chunks are allocated even if related processes won't start
Hello Hayato, On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote: > OK, I understood that my initial proposal is not so valuable, so I can > withdraw it. Yeah, that's what it seems to me. > About the suggetion, you imagined AutoVacuumRequestWork() and > brininsert(), right? Correct. > I agreed it sounds good, but I don't think it can be implemented by > current interface. An interface for dynamically allocating memory is > GetNamedDSMSegment(), and it returns the same shared memory region if > input names are the same. Therefore, there is no way to re-alloc the > shared memory. Yeah, I was imagining something like this: the workitem-array becomes a struct, which has a name and a "next" pointer and a variable number of workitem slots; the AutoVacuumShmem struct has a pointer to the first workitem-struct and the last one; when a workitem is requested by brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a workitem-struct with a smallish number of elements; if we request another workitem and the array is full, we allocate another array via GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0 (so that the list can be followed by an autovacuum worker that's processing the database), and it's also set as the tail of the list in AutoVacuumShmem (so that we know where to store further work items). When all items in a workitem-struct are processed, we can free it (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems point to the next one in the list. This way, the "array" can grow arbitrarily. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Provide a pg_truncate_freespacemap function
Hello, As we are currently experiencing a FSM corruption issue [1], we need to rebuild FSM when we detect it. I noticed we have something to truncate a visibility map, but nothing for the freespace map, so I propose the attached (liberally copied from the VM counterpart) to allow to truncate a FSM without incurring downtime, as currently our only options are to either VACUUM FULL the table or stop the cluster and remove the FSM manually. Does that seem correct ? [1] https://www.postgresql.org/message-id/flat/ 1925490.taCxCBeP46%40aivenlaptop#7ace95c95cab17b6d92607e5362984ac -- Ronan Dunklau >From b3e28e4542311094ee7177b39cc9cdf3672d1b55 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 1 Mar 2024 08:57:49 +0100 Subject: [PATCH] Provide a pg_truncate_freespacemap function. Similar to the pg_truncate_visibilitymap function, this one allows to avoid downtime to fix an FSM in the case a corruption event happens --- contrib/pg_freespacemap/Makefile | 5 +- .../pg_freespacemap--1.2--1.3.sql | 13 + contrib/pg_freespacemap/pg_freespacemap.c | 58 +++ .../pg_freespacemap/pg_freespacemap.control | 2 +- 4 files changed, 75 insertions(+), 3 deletions(-) create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile index b48e4b255bc..0f4b52f5812 100644 --- a/contrib/pg_freespacemap/Makefile +++ b/contrib/pg_freespacemap/Makefile @@ -6,8 +6,9 @@ OBJS = \ pg_freespacemap.o EXTENSION = pg_freespacemap -DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \ - pg_freespacemap--1.0--1.1.sql +DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \ + pg_freespacemap--1.1--1.2.sql \ + pg_freespacemap--1.0--1.1.sql PGFILEDESC = "pg_freespacemap - monitoring of free space map" REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql new file mode 100644 index 000..1f25877a175 --- /dev/null +++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql @@ -0,0 +1,13 @@ +/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit + +CREATE FUNCTION pg_truncate_freespace_map(regclass) +RETURNS void +AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map' +LANGUAGE C STRICT +PARALLEL UNSAFE; + + + diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c index b82cab2d97e..17f6a631ad8 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.c +++ b/contrib/pg_freespacemap/pg_freespacemap.c @@ -9,8 +9,12 @@ #include "postgres.h" #include "access/relation.h" +#include "access/xloginsert.h" +#include "catalog/storage_xlog.h" #include "funcapi.h" #include "storage/freespace.h" +#include "storage/smgr.h" +#include "utils/rel.h" PG_MODULE_MAGIC; @@ -20,6 +24,9 @@ PG_MODULE_MAGIC; */ PG_FUNCTION_INFO_V1(pg_freespace); +/* Truncate the free space map, in case of corruption */ +PG_FUNCTION_INFO_V1(pg_truncate_freespace_map); + Datum pg_freespace(PG_FUNCTION_ARGS) { @@ -40,3 +47,54 @@ pg_freespace(PG_FUNCTION_ARGS) relation_close(rel, AccessShareLock); PG_RETURN_INT16(freespace); } + + +Datum +pg_truncate_freespace_map(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + Relation rel; + ForkNumber fork; + BlockNumber block; + + rel = relation_open(relid, AccessExclusiveLock); + + /* Only some relkinds have a freespacemap */ + if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind)) + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation \"%s\" is of wrong relation kind", + RelationGetRelationName(rel)), + errdetail_relkind_not_supported(rel->rd_rel->relkind))); + + + /* Forcibly reset cached file size */ + RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; + + /* Just pretend we're going to wipeout the whole rel */ + block = FreeSpaceMapPrepareTruncateRel(rel, 0); + + if (BlockNumberIsValid(block)) + { + fork = FSM_FORKNUM; + smgrtruncate(RelationGetSmgr(rel), &fork, 1, &block); + } + + if (RelationNeedsWAL(rel)) + { + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rlocator = rel->rd_locator; + xlrec.flags = SMGR_TRUNCATE_FSM; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + } + + relation_close(rel, AccessExclusiveLock); + + PG_RETURN_VOID(); +} diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control index ac8fc5050a9..1992320691b 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.control +++ b/contrib/pg_freespacemap/pg_freespacemap.control @@ -
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
> On 6 Jan 2024, at 00:03, Nathan Bossart wrote: > I gave it a try. Looking at this again I think this is about ready to go in. My only comment is that doc/src/sgml/archive-modules.sgml probably should be updated to refer to setting the errdetail, especially since we document the errormessage there. -- Daniel Gustafsson
Re: Statistics Import and Export
Hi, On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote: > On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker > wrote: > > > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), > > which address many of the concerns listed earlier. > > > > Leaving the export/import scripts off for the time being, as they haven't > > changed and the next likely change is to fold them into pg_dump. > > > > > > > v6 posted below. Thanks! I had in mind to look at it but it looks like a rebase is needed. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: remaining sql/json patches
Op 3/4/24 om 10:40 schreef Amit Langote: Hi Jian, Thanks for the reviews and sorry for the late reply. Replying to all emails in one. > [v40-0001-Add-SQL-JSON-query-functions.patch] > [v40-0002-Show-function-name-in-TableFuncScan.patch] > [v40-0003-JSON_TABLE.patch] In my hands (applying with patch), the patches, esp. 0001, do not apply. But I see the cfbot builds without problem so maybe just ignore these FAILED lines. Better get them merged - so I can test there... Erik checking file doc/src/sgml/func.sgml checking file src/backend/catalog/sql_features.txt checking file src/backend/executor/execExpr.c Hunk #1 succeeded at 48 with fuzz 2 (offset -1 lines). Hunk #2 succeeded at 88 (offset -1 lines). Hunk #3 succeeded at 2419 (offset -1 lines). Hunk #4 succeeded at 4195 (offset -1 lines). checking file src/backend/executor/execExprInterp.c Hunk #1 succeeded at 72 (offset -1 lines). Hunk #2 succeeded at 180 (offset -1 lines). Hunk #3 succeeded at 485 (offset -1 lines). Hunk #4 succeeded at 1560 (offset -1 lines). Hunk #5 succeeded at 4242 (offset -1 lines). checking file src/backend/jit/llvm/llvmjit_expr.c checking file src/backend/jit/llvm/llvmjit_types.c checking file src/backend/nodes/makefuncs.c Hunk #1 succeeded at 856 (offset -1 lines). checking file src/backend/nodes/nodeFuncs.c Hunk #1 succeeded at 233 (offset -1 lines). Hunk #2 succeeded at 517 (offset -1 lines). Hunk #3 succeeded at 1019 (offset -1 lines). Hunk #4 succeeded at 1276 (offset -1 lines). Hunk #5 succeeded at 1617 (offset -1 lines). Hunk #6 succeeded at 2381 (offset -1 lines). Hunk #7 succeeded at 3429 (offset -1 lines). Hunk #8 succeeded at 4164 (offset -1 lines). checking file src/backend/optimizer/path/costsize.c Hunk #1 succeeded at 4878 (offset -1 lines). checking file src/backend/optimizer/util/clauses.c Hunk #1 succeeded at 50 (offset -3 lines). Hunk #2 succeeded at 415 (offset -3 lines). checking file src/backend/parser/gram.y checking file src/backend/parser/parse_expr.c checking file src/backend/parser/parse_target.c Hunk #1 succeeded at 1988 (offset -1 lines). checking file src/backend/utils/adt/formatting.c Hunk #1 succeeded at 4465 (offset -1 lines). checking file src/backend/utils/adt/jsonb.c Hunk #1 succeeded at 2159 (offset -4 lines). checking file src/backend/utils/adt/jsonfuncs.c checking file src/backend/utils/adt/jsonpath.c Hunk #1 FAILED at 68. Hunk #2 succeeded at 1239 (offset -1 lines). 1 out of 2 hunks FAILED checking file src/backend/utils/adt/jsonpath_exec.c Hunk #1 succeeded at 229 (offset -5 lines). Hunk #2 succeeded at 2866 (offset -5 lines). Hunk #3 succeeded at 3751 (offset -5 lines). checking file src/backend/utils/adt/ruleutils.c Hunk #1 succeeded at 474 (offset -1 lines). Hunk #2 succeeded at 518 (offset -1 lines). Hunk #3 succeeded at 8303 (offset -1 lines). Hunk #4 succeeded at 8475 (offset -1 lines). Hunk #5 succeeded at 8591 (offset -1 lines). Hunk #6 succeeded at 9808 (offset -1 lines). Hunk #7 succeeded at 9858 (offset -1 lines). Hunk #8 succeeded at 10039 (offset -1 lines). Hunk #9 succeeded at 10909 (offset -1 lines). checking file src/include/executor/execExpr.h checking file src/include/nodes/execnodes.h checking file src/include/nodes/makefuncs.h checking file src/include/nodes/parsenodes.h checking file src/include/nodes/primnodes.h checking file src/include/parser/kwlist.h checking file src/include/utils/formatting.h checking file src/include/utils/jsonb.h checking file src/include/utils/jsonfuncs.h checking file src/include/utils/jsonpath.h checking file src/interfaces/ecpg/preproc/ecpg.trailer checking file src/test/regress/expected/sqljson_queryfuncs.out checking file src/test/regress/parallel_schedule checking file src/test/regress/sql/sqljson_queryfuncs.sql checking file src/tools/pgindent/typedefs.list
Re: remaining sql/json patches
On 2024-Mar-04, Erik Rijkers wrote: > In my hands (applying with patch), the patches, esp. 0001, do not apply. > But I see the cfbot builds without problem so maybe just ignore these FAILED > lines. Better get them merged - so I can test there... It's because of dbbca2cf299b. It should apply cleanly if you do "git checkout dbbca2cf299b^" first ... That commit is so recent that evidently the cfbot hasn't had a chance to try this patch again since it went in, which is why it's still green. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)
Re: pgsql: Improve performance of subsystems on top of SLRU
On 2024-Mar-03, Tom Lane wrote: > This is certainly simpler, but I notice that it holds the current > LWLock across the line > > ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember)); > > where the old code did not. Could the palloc take long enough that > holding the lock is bad? Hmm, I guess most of the time it shouldn't be much of a problem (if the length is small so we can palloc without malloc'ing); but it could be in the worst case. But the fix is simple: just release the lock before. There's no correctness argument for holding it all the way down. I was just confused about how the original code worked. > Also, with this coding the "lock = NULL;" assignment just before > "goto retry" is a dead store. Not sure if Coverity or other static > analyzers would whine about that. Oh, right. I removed that. On 2024-Mar-04, Dilip Kumar wrote: > I am not sure about the other changes, I mean that makes the code much > simpler but now we are not releasing the 'MultiXactOffsetCtl' related > bank lock, and later in the following loop, we are comparing that lock > against 'MultiXactMemberCtl' related bank lock. This makes code > simpler because now in the loop we are sure that we are always holding > the lock but I do not like comparing the bank locks for 2 different > SLRUs, although there is no problem as there would not be a common > lock address, True. This can be addressed in the same way Tom's first comment is: just release the lock before entering the second loop, and setting lock to NULL. This brings the code to a similar state than before, except that the additional LWLock * variables are in a tighter scope. That's in 0001. Now, I had a look at the other users of slru.c and noticed in subtrans.c that StartupSUBTRANS we have some duplicate code that I think could be rewritten by making the "while" block test the condition at the end instead of at the start; changed that in 0002. I'll leave this one for later, because I want to add some test code for it -- right now it's pretty much test-uncovered code. I also looked at slru.c for uses of shared->bank_locks and noticed a couple that could be made simpler. That's 0003 here. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "If it is not right, do not do it. If it is not true, do not say it." (Marcus Aurelius, Meditations) >From a6be7cac5de83a36e056147f3e81bea2eb1096bd Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Sun, 3 Mar 2024 15:20:36 +0100 Subject: [PATCH v2 1/3] rework locking code in GetMultiXactIdMembers Per Coverity --- src/backend/access/transam/multixact.c | 53 +++--- 1 file changed, 22 insertions(+), 31 deletions(-) diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index cd476b94fa..83b578dced 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1247,14 +1247,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members, MultiXactOffset offset; int length; int truelength; - int i; MultiXactId oldestMXact; MultiXactId nextMXact; MultiXactId tmpMXact; MultiXactOffset nextOffset; MultiXactMember *ptr; LWLock *lock; - LWLock *prevlock = NULL; debug_elog3(DEBUG2, "GetMembers: asked for %u", multi); @@ -1361,18 +1359,9 @@ retry: pageno = MultiXactIdToOffsetPage(multi); entryno = MultiXactIdToOffsetEntry(multi); - /* - * If this page falls under a different bank, release the old bank's lock - * and acquire the lock of the new bank. - */ + /* Acquire the bank lock for the page we need. */ lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno); - if (lock != prevlock) - { - if (prevlock != NULL) - LWLockRelease(prevlock); - LWLockAcquire(lock, LW_EXCLUSIVE); - prevlock = lock; - } + LWLockAcquire(lock, LW_EXCLUSIVE); slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi); offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno]; @@ -1407,17 +1396,19 @@ retry: if (pageno != prev_pageno) { + LWLock *newlock; + /* * Since we're going to access a different SLRU page, if this page * falls under a different bank, release the old bank's lock and * acquire the lock of the new bank. */ - lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno); - if (prevlock != lock) + newlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno); + if (newlock != lock) { -LWLockRelease(prevlock); -LWLockAcquire(lock, LW_EXCLUSIVE); -prevlock = lock; +LWLockRelease(lock); +LWLockAcquire(newlock, LW_EXCLUSIVE); +lock = newlock; } slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact); } @@ -1429,8 +1420,7 @@ retry: if (nextMXOffset == 0) { /* Corner case 2: next multixact is still being filled in */ - LWLockRelease(prevlock); - prevlock = NULL; + LWLockRelease(lock); CHECK_FO
Re: Experiments with Postgres and SSL
On 01/03/2024 23:49, Jacob Champion wrote: On Wed, Feb 28, 2024 at 4:10 AM Heikki Linnakangas wrote: I think we'd want to *avoid* changing the major protocol version in a way that would introduce a new roundtrip, though. I'm starting to get up to speed with this patchset. So far I'm mostly testing how it works; I have yet to take an in-depth look at the implementation. Thank you! I'll squint more closely at the MITM-protection changes in 0008 later. First impressions, though: it looks like that code has gotten much less straightforward, which I think is dangerous given the attack it's preventing. (Off-topic: I'm skeptical of future 0-RTT support. Our protocol doesn't seem particularly replay-safe to me.) Let's drop that patch. AFAICS it's not needed by the rest of the patches. If we're interested in ALPN negotiation in the future, we may also want to look at GREASE [1] to keep those options open in the presence of third-party implementations. Unfortunately OpenSSL doesn't do this automatically yet. Can you elaborate? Do we need to do something extra in the server to be compatible with GREASE? If we don't have a reason not to, it'd be good to follow the strictest recommendations from [2] to avoid cross-protocol attacks. (For anyone currently running web servers and Postgres on the same host, they really don't want browsers "talking" to their Postgres servers.) That would mean checking the negotiated ALPN on both the server and client side, and failing if it's not what we expect. Hmm, I thought that's what the patches does. But looking closer, libpq is not checking that ALPN was used. We should add that. Am I right? I'm not excited about the proliferation of connection options. I don't have a lot of ideas on how to fix it, though, other than to note that the current sslnegotiation option names are very unintuitive to me: - "postgres": only legacy handshakes - "direct": might be direct... or maybe legacy - "requiredirect": only direct handshakes... unless other options are enabled and then we fall back again to legacy? How many people willing to break TLS compatibility with old servers via "requiredirect" are going to be okay with lazy fallback to GSS or otherwise? Yeah, this is my biggest complaint about all this. Not so much the names of the options, but the number of combinations of different options, and how we're going to test them all. I don't have any great solutions, except adding a lot of tests to cover them, like Matthias did. Heikki mentioned possibly hard-coding a TLS alert if direct SSL is attempted without server TLS support. I think that's a cool idea, but without an official "TLS not supported" alert code (which, honestly, would be strange to standardize) I'm kinda -0.5 on it. If the client tells me about a handshake_failure or similar, I'm going to start investigating protocol versions and ciphersuites; I'm not going to think to myself that maybe the server lacks TLS support altogether. Agreed. (Plus, we need to have a good error message when connecting to older servers anyway.I think we should be able to key off of the EOF coming back from OpenSSL; it'd be a good excuse to give that part of the code some love.) Hmm, if OpenSSL sends ClientHello and the server responds with a Postgres error packet, OpenSSL will presumably consume the error packet or at least part of it. But with our custom BIO, we can peek at the server response before handing it to OpenSSL. If it helps, we could backport a nicer error message to old server versions, similar to what we did with SCRAM in commit 96d0f988b1. For the record, I'm adding some one-off tests for this feature to a local copy of my OAuth pytest suite, which is designed to do the kinds of testing you're running into trouble with. It's not in any way viable for a PG17 commit, but if you're interested I can make the patches available. Yes please, it would be nice to see what tests you've performed, and have it archived. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Synchronizing slots from primary to standby
Hi, On Mon, Mar 04, 2024 at 01:28:04PM +, Zhijie Hou (Fujitsu) wrote: > Attach the V105 patch set Thanks! Sorry I missed those during the previous review: 1 === Commit message: "these functions will block until" s/block/wait/ ? 2 === +when used with logical failover slots, will block until all s/block/wait/ ? It seems those are the 2 remaining "block" that could deserve the proposed above change. 3 === + invalidated = slot->data.invalidated != RS_INVAL_NONE; + inactive = slot->active_pid == 0; invalidated = (slot->data.invalidated != RS_INVAL_NONE); inactive = (slot->active_pid == 0); instead? I think it's easier to read and it looks like this is the way it's written in other places (at least the few I checked). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: LogwrtResult contended spinlock
Thanks for looking into this. On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis wrote: > > > 3. > > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI) > > If at all, a read > > barrier is warranted here, we can use atomic read with full barrier > > I don't think we need a full barrier but I'm fine with using > pg_atomic_read_membarrier_u64() if it's better for whatever reason. For the sake of clarity and correctness, I've used pg_atomic_read_membarrier_u64 everywhere for reading XLogCtl->LogwrtResult.Write and XLogCtl->LogwrtResult.Flush. > > 5. I guess we'd better use pg_atomic_read_u64_impl and > > pg_atomic_compare_exchange_u64_impl in > > pg_atomic_monotonic_advance_u64 > > to reduce one level of function indirections. > > Aren't they inlined? Yes, all of them are inlined. But, it seems like XXX_impl functions are being used in implementing exposed functions as a convention. Therefore, having pg_atomic_read_u64_impl and pg_atomic_compare_exchange_u64_impl doesn't sound bad IMV. > > 6. > > + * Full barrier semantics (even when value is unchanged). > > > > +currval = pg_atomic_read_u64(ptr); > > +if (currval >= target_) > > +{ > > +pg_memory_barrier(); > > I don't think they are exactly equivalent: in the current patch, the > first pg_atomic_read_u64() could be reordered with earlier reads; > whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it > could not be. I'm not sure whether that could create a performance > problem or not. I left it as-is. > > 9. > > +copyptr = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Copy); > > +if (startptr + count > copyptr) > > +ereport(WARNING, > > +(errmsg("request to read past end of generated WAL; > > request %X/%X, current position %X/%X", > > +LSN_FORMAT_ARGS(startptr + count), > > LSN_FORMAT_ARGS(copyptr; > > > > Any specific reason for this to be a WARNING rather than an ERROR? > > Good question. WaitXLogInsertionsToFinish() uses a LOG level message > for the same situation. They should probably be the same log level, and > I would think it would be either PANIC or WARNING. I have no idea why > LOG was chosen. WaitXLogInsertionsToFinish adjusts upto after LOG so that the wait is never past the current insert position even if a caller asks for reading WAL that doesn't yet exist. And the comment there says "Here we just assume that to mean that all WAL that has been reserved needs to be finished." In contrast, WALReadFromBuffers kind of enforces callers to do WaitXLogInsertionsToFinish (IOW asks callers to send in the WAL that exists in the server). Therefore, an ERROR seems a reasonable choice to me, if PANIC sounds rather strong affecting all the postgres processes. FWIW, a PANIC when requested to flush past the end of WAL in WaitXLogInsertionsToFinish instead of LOG seems to be good. CF bot animals don't complain - https://github.com/BRupireddy2/postgres/tree/be_harsh_when_request_to_flush_past_end_of_WAL_WIP. > 0001: > > * The comments on the two versions of the functions are redundant, and > the style in that header seems to be to omit the comment from the u64 > version. Removed comments atop 64-bit version. > * I'm not sure the AssertPointerAlignment is needed in the u32 version? Borrowed them from pg_atomic_read_u32 and pg_atomic_compare_exchange_u32, just like how they assert before calling XXX_impl versions. I don't see any problem with them. > 0002: > > * All updates to the non-shared LogwrtResult should update both values. > It's confusing to update those local values independently, because it > violates the invariant that LogwrtResult.Flush <= LogwrtResult.Write. > > > 2. I guess we need to update both the Write and Flush local copies in > > AdvanceXLInsertBuffer. > > I agree. Whenever we update the non-shared LogwrtResult, let's update > the whole thing. Otherwise it's confusing. Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs. > * pg_memory_barrier() is not needed right before a spinlock Got rid of it as we read both Flush and Write local copies with pg_atomic_read_membarrier_u64. > * As mentioned above, I think GetFlushRecPtr() needs two read barriers. > Also, I think the same for GetXLogWriteRecPtr(). Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs. > * In general, for any place using both Write and Flush, I think Flush > should be loaded first, followed by a read barrier, followed by a load > of the Write pointer. Why read Flush first rather than Write? I think it's enough to do {read Write, read barrier, read Flush}. This works because Write is monotonically advanced first before Flush using full barriers and we don't get reordering issues between the readers and writers no? Am I missing anything here? > And I think in most of those places there shou
Re: Shared detoast Datum proposal
On 3/4/24 02:23, Andy Fan wrote: > > Tomas Vondra writes: > >> On 2/26/24 16:29, Andy Fan wrote: >>> >>> ...> >>> I can understand the benefits of the TOAST cache, but the following >>> issues looks not good to me IIUC: >>> >>> 1). we can't put the result to tts_values[*] since without the planner >>> decision, we don't know if this will break small_tlist logic. But if we >>> put it into the cache only, which means a cache-lookup as a overhead is >>> unavoidable. >> >> True - if you're comparing having the detoasted value in tts_values[*] >> directly with having to do a lookup in a cache, then yes, there's a bit >> of an overhead. >> >> But I think from the discussion it's clear having to detoast the value >> into tts_values[*] has some weaknesses too, in particular: >> >> - It requires decisions which attributes to detoast eagerly, which is >> quite invasive (having to walk the plan, ...). >> >> - I'm sure there will be cases where we choose to not detoast, but it >> would be beneficial to detoast. >> >> - Detoasting just the initial slices does not seem compatible with this. >> >> IMHO the overhead of the cache lookup would be negligible compared to >> the repeated detoasting of the value (which is the current baseline). I >> somewhat doubt the difference compared to tts_values[*] will be even >> measurable. >> >>> >>> 2). It is hard to decide which entry should be evicted without attaching >>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry >>> we keep is the entry we will reuse soon? >>> >> >> True. But is that really a problem? I imagined we'd set some sort of >> memory limit on the cache (work_mem?), and evict oldest entries. So the >> entries would eventually get evicted, and the memory limit would ensure >> we don't consume arbitrary amounts of memory. >> > > Here is a copy from the shared_detoast_datum.org in the patch. I am > feeling about when / which entry to free is a key problem and run-time > (detoast_attr) overhead vs createplan.c overhead is a small difference > as well. the overhead I paid for createplan.c/setref.c looks not huge as > well. I decided to whip up a PoC patch implementing this approach, to get a better idea of how it compares to the original proposal, both in terms of performance and code complexity. Attached are two patches that both add a simple cache in detoast.c, but differ in when exactly the caching happens. toast-cache-v1 - caching happens in toast_fetch_datum, which means this happens before decompression toast-cache-v2 - caching happens in detoast_attr, after decompression I started with v1, but that did not help with the example workloads (from the original post) at all. Only after I changed this to cache decompressed data (in v2) it became competitive. There's GUC to enable the cache (enable_toast_cache, off by default), to make experimentation easier. The cache is invalidated at the end of a transaction - I think this should be OK, because the rows may be deleted but can't be cleaned up (or replaced with a new TOAST value). This means the cache could cover multiple queries - not sure if that's a good thing or bad thing. I haven't implemented any sort of cache eviction. I agree that's a hard problem in general, but I doubt we can do better than LRU. I also didn't implement any memory limit. FWIW I think it's a fairly serious issue Andy's approach does not have any way to limit the memory. It will detoasts the values eagerly, but what if the rows have multiple large values? What if we end up keeping multiple such rows (from different parts of the plan) at once? I also haven't implemented caching for sliced data. I think modifying the code to support this should not be difficult - it'd require tracking how much data we have in the cache, and considering that during lookup and when adding entries to cache. I've implemented the cache as "global". Maybe it should be tied to query or executor state, but then it'd be harder to access it from detoast.c (we'd have to pass the cache pointer in some way, and it wouldn't work for use cases that don't go through executor). I think implementation-wise this is pretty non-invasive. Performance-wise, these patches are slower than with Andy's patch. For example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and v2 at ~150ms. I'm sure there's a number of optimization opportunities and v2 could get v2 closer to the 100ms. v1 is not competitive at all in this jsonb/Q1 test - it's just as slow as master, because the data set is small enough to be fully cached, so there's no I/O and it's the decompression is the actual bottleneck. That being said, I'm not convinced v1 would be a bad choice for some cases. If there's memory pressure enough to evict TOAST, it's quite possible the I/O would become the bottleneck. At which point it might be a good trade off to cache compressed data, because then we'd cache more detoasted values. OTOH it's likely the access to TOAST values is localiz
Re: Extract numeric filed in JSONB more effectively
Peter Eisentraut writes: > On 09.02.24 10:05, Andy Fan wrote: >> 2. Where is the current feature blocked for the past few months? >> It's error message compatible issue! Continue with above setup: >> master: >> select * from tb where (a->'b')::numeric > 3::numeric; >> ERROR: cannot cast jsonb string to type numeric >> select * from tb where (a->'b')::int4 > 3::numeric; >> ERROR: cannot cast jsonb string to type integer >> You can see the error message is different (numeric vs integer). >> Patched: >> We still can get the same error message as master BUT the code >> looks odd. >> select * from tb where (a->'b')::int4 > 3; >> QUERY PLAN >> --- >> Seq Scan on public.tb >> Output: a >> Filter: >> ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, >> 'b'::text), '23'::oid))::integer > 3) >> (3 rows) >> You can see "jsonb_finish_numeric(.., '23::oid)" the '23::oid' is >> just >> for the *"integer"* output in error message: >> "cannot cast jsonb string to type*integer*" >> Now the sistuation is either we use the odd argument (23::oid) in >> jsonb_finish_numeric, or we use a incompatible error message with the >> previous version. I'm not sure which way is better, but this is the >> place the current feature is blocked. > > I'm not bothered by that. It also happens on occasion in the backend C > code that we pass around extra information to be able to construct > better error messages. The functions here are not backend C code, but > they are internal functions, so similar considerations can apply. Thanks for speaking on this! > > But I have a different question about this patch set. This has some > overlap with the JSON_VALUE function that is being discussed at > [0][1]. For example, if I apply the patch > v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run > > select count(*) from tb where json_value(a, '$.a' returning numeric) = 2; > > and I get a noticeable performance boost over > > select count(*) from tb where cast (a->'a' as numeric) = 2; Here is my test and profile about the above 2 queries. create table tb(a jsonb); insert into tb select jsonb_build_object('a', i) from generate_series(1, 1)i; cat a.sql select count(*) from tb where json_value(a, '$.a' returning numeric) = 2; cat b.sql select count(*) from tb where cast (a->'a' as numeric) = 2; pgbench -n -f xxx.sql postgres -T100 | grep lat Then here is the result: | query | master | patched (patch here and jsonb_value) | |---+-+-| | a.sql | / | 2.59 (ms) | | b.sql | 3.34 ms | 1.75 (ms) | As we can see the patch here has the best performance (this result looks be different from yours?). After I check the code, I am sure both patches *don't* have the problem in master where it get a jsonbvalue first and convert it to jsonb and then cast to numeric. Then I perf the result, and find the below stuff: JSOB_VALUE - 32.02% 4.30% postgres postgres [.] JsonPathValue - 27.72% JsonPathValue - 22.63% executeJsonPath (inlined) - 19.97% executeItem (inlined) - executeItemOptUnwrapTarget - 17.79% executeNextItem - 15.49% executeItem (inlined) - executeItemOptUnwrapTarget + 8.50% getKeyJsonValueFromContainer (note here..) + 2.30% executeNextItem 0.79% findJsonbValueFromContainer + 0.68% check_stack_depth + 1.51% jspGetNext + 0.73% check_stack_depth 1.27% jspInitByBuffer 0.85% JsonbExtractScalar + 4.91% DatumGetJsonbP (inlined) Patch here for b.sql: - - 19.98% 2.10% postgres postgres [.] jsonb_object_field_start - 17.88% jsonb_object_field_start - 17.70% jsonb_object_field_internal (inlined) + 11.03% getKeyJsonValueFromContainer - 6.26% DatumGetJsonbP (inlined) + 5.78% detoast_attr + 1.21% _start + 0.54% 0x55ddb44552a JSONB_VALUE has a much longer way to get getKeyJsonValueFromContainer, then I think JSON_VALUE probably is designed for some more complex path which need to pay extra effort which bring the above performance difference. I added Amit and Alvaro to this thread in case they can have more insight on this. > So some questions to think about: > > 1. Compare performance of base case vs. this patch vs. json_value. Done, as above. > > 2. Can json_value be optimized further? hmm, I have some troubles to understand A's performance boost over B, then who is better. But in my test above, looks the patch here is better on the given case and the differece may comes from JS
Re: a wrong index choose when statistics is out of date
On 3/4/24 06:33, David Rowley wrote: > On Sun, 3 Mar 2024 at 20:08, Andy Fan wrote: >> The issue can be reproduced with the following steps: >> >> create table x_events (.., created_at timestamp, a int, b int); >> >> create index idx_1 on t(created_at, a); >> create index idx_2 on t(created_at, b); >> >> query: >> select * from t where create_at = current_timestamp and b = 1; >> >> index (created_at, a) rather than (created_at, b) may be chosen for the >> above query if the statistics think "create_at = current_timestamp" has >> no rows, then both index are OK, actually it is true just because >> statistics is out of date. > > I don't think there's really anything too special about the fact that > the created_at column is always increasing. We commonly get 1-row > estimates after multiplying the selectivities from individual stats. > Your example just seems like yet another reason that this could > happen. > > I've been periodically talking about introducing "risk" as a factor > that the planner should consider. I did provide some detail in [1] > about the design that was in my head at that time. I'd not previously > thought that it could also solve this problem, but after reading your > email, I think it can. > > I don't think it would be right to fudge the costs in any way, but I > think the risk factor for IndexPaths could take into account the > number of unmatched index clauses and increment the risk factor, or > "certainty_factor" as it is currently in my brain-based design. That > way add_path() would be more likely to prefer the index that matches > the most conditions. > > The exact maths to calculate the certainty_factor for this case I > don't quite have worked out yet. I plan to work on documenting the > design of this and try and get a prototype patch out sometime during > this coming southern hemisphere winter so that there's at least a full > cycle of feedback opportunity before the PG18 freeze. > I've been thinking about this stuff too, so I'm curious to hear what kind of plan you come up with. Every time I tried to formalize a more concrete plan, I ended up with a very complex (and possible yet more fragile) approach. I think we'd need to consider a couple things: 1) reliability of cardinality estimates I think this is pretty much the same concept as confidence intervals, i.e. knowing not just the regular estimate, but also a range where the actual value lies with high confidence (e.g. 90%). For a single clauses this might not be terribly difficult, but for more complex cases (multiple conditions, ...) it seems far more complex :-( For example, let's say we know confidence intervals for two conditions. What's the confidence interval when combined using AND or OR? 2) robustness of the paths Knowing just the confidence intervals does not seem sufficient, though. The other thing that matters is how this affects the paths, how robust the paths are. I mean, if we have alternative paths with costs that flip somewhere in the confidence interval - which one to pick? Surely one thing to consider is how fast the costs change for each path. 3) complexity of the model I suppose we'd prefer a systematic approach (and not some ad hoc solution for one particular path/plan type). So this would be somehow integrated into the cost model, making it yet more complex. I'm quite worried about this (not necessarily because of performance reasons). I wonder if trying to improve the robustness solely by changes in the planning phase is a very promising approach. I mean - sure, we should improve that, but by definition it relies on a priori information. And not only the stats may be stale - it's a very lossy approximation of the actual data. Even if the stats are perfectly up to date / very detailed, there's still going to be a lot of uncertainty. I wonder if we'd be better off if we experimented with more robust plans, like SmoothScan [1] or g-join [2]. regards [1] https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf [2] http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)
> On 4 Mar 2024, at 16:47, Ranier Vilela wrote: > > Does filling a memory area, one by one, with branches, need strong evidence > to prove that it is better than filling a memory area, all at once, without > branches? I apologise for being too quick to decide to mark the patch RwF. Wold you mind if restore patch as "Needs review" so we can have more feedback? Best regards, Andrey Borodin.
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote: > Looking at this again I think this is about ready to go in. My only comment > is > that doc/src/sgml/archive-modules.sgml probably should be updated to refer to > setting the errdetail, especially since we document the errormessage there. Thanks for reviewing. How does this look? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 437e4fa9ec27ecf870530fc579cd7673dfcf96af Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 4 Mar 2024 11:15:37 -0600 Subject: [PATCH v5 1/1] Add macro for customizing an archive module WARNING message. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Presently, if an archive module's check_configured_cb callback returns false, a generic WARNING about the archive module misconfiguration is emitted, which unfortunately provides no actionable details about the source of the miconfiguration. This commit introduces a macro that archive module authors can use to add a DETAIL line to this WARNING. Co-authored-by: Tung Nguyen Reviewed-by: Daniel Gustafsson, Álvaro Herrera Discussion: https://postgr.es/m/4109578306242a7cd5661171647e11b2%40oss.nttdata.com --- contrib/basic_archive/basic_archive.c | 7 ++- doc/src/sgml/archive-modules.sgml | 12 src/backend/archive/shell_archive.c | 7 ++- src/backend/postmaster/pgarch.c | 8 +++- src/include/archive/archive_module.h | 8 5 files changed, 39 insertions(+), 3 deletions(-) diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index 804567e919..6b102e9072 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -161,7 +161,12 @@ check_archive_directory(char **newval, void **extra, GucSource source) static bool basic_archive_configured(ArchiveModuleState *state) { - return archive_directory != NULL && archive_directory[0] != '\0'; + if (archive_directory != NULL && archive_directory[0] != '\0') + return true; + + arch_module_check_errdetail("%s is not set.", +"basic_archive.archive_directory"); + return false; } /* diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml index 7064307d9e..cf7438a759 100644 --- a/doc/src/sgml/archive-modules.sgml +++ b/doc/src/sgml/archive-modules.sgml @@ -114,6 +114,18 @@ WARNING: archive_mode enabled, yet archiving is not configured In the latter case, the server will periodically call this function, and archiving will proceed only when it returns true. + + + + When returning false, it may be useful to append some + additional information to the generic warning message. To do that, provide + a message to the arch_module_check_errdetail macro + before returning false. Like + errdetail(), this macro accepts a format string + followed by an optional list of arguments. The resulting string will be + emitted as the DETAIL line of the warning message. + + diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c index c95b732495..bff0ab800d 100644 --- a/src/backend/archive/shell_archive.c +++ b/src/backend/archive/shell_archive.c @@ -45,7 +45,12 @@ shell_archive_init(void) static bool shell_archive_configured(ArchiveModuleState *state) { - return XLogArchiveCommand[0] != '\0'; + if (XLogArchiveCommand[0] != '\0') + return true; + + arch_module_check_errdetail("%s is not set.", +"archive_command"); + return false; } static bool diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index bb0eb13a89..f97035ca03 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -88,6 +88,7 @@ typedef struct PgArchData } PgArchData; char *XLogArchiveLibrary = ""; +char *arch_module_check_errdetail_string; /* -- @@ -401,12 +402,17 @@ pgarch_ArchiverCopyLoop(void) */ HandlePgArchInterrupts(); + /* Reset variables that might be set by the callback */ + arch_module_check_errdetail_string = NULL; + /* can't do anything if not configured ... */ if (ArchiveCallbacks->check_configured_cb != NULL && !ArchiveCallbacks->check_configured_cb(archive_module_state)) { ereport(WARNING, - (errmsg("archive_mode enabled, yet archiving is not configured"))); + (errmsg("archive_mode enabled, yet archiving is not configured"), + arch_module_check_errdetail_string ? + errdetail_internal("%s", arch_module_check_errdetail_string) : 0)); return; } diff --git a/src/include/archive/archive_module.h b/src/include/archive/archive_module.h index fd59b9faf4..763af76e54 100644 --- a/src/include/archive/archive_module.h +++ b/src/include/archive/archive_module.h @@ -56,4 +56,12 @@ typedef const ArchiveModuleCallbacks *(*ArchiveModuleInit) (void); extern PGDLLEXPORT const ArchiveModuleCallback
Re: pgsql: Improve performance of subsystems on top of SLRU
FWIW there's a stupid bug in 0002, which is fixed here. I'm writing a simple test for it. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Now I have my system running, not a byte was off the shelf; It rarely breaks and when it does I fix the code myself. It's stable, clean and elegant, and lightning fast as well, And it doesn't cost a nickel, so Bill Gates can go to hell." >From ecf613092a3cbc4c5112d30af7eea55b668decec Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 4 Mar 2024 11:49:01 +0100 Subject: [PATCH v3] Rework redundant loop in subtrans.c --- src/backend/access/transam/subtrans.c | 29 +++ 1 file changed, 7 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c index dc9566fb51..50bb1d8cfc 100644 --- a/src/backend/access/transam/subtrans.c +++ b/src/backend/access/transam/subtrans.c @@ -311,7 +311,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID) FullTransactionId nextXid; int64 startPage; int64 endPage; - LWLock *prevlock; + LWLock *prevlock = NULL; LWLock *lock; /* @@ -324,42 +324,27 @@ StartupSUBTRANS(TransactionId oldestActiveXID) nextXid = TransamVariables->nextXid; endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid)); - prevlock = SimpleLruGetBankLock(SubTransCtl, startPage); - LWLockAcquire(prevlock, LW_EXCLUSIVE); - while (startPage != endPage) + for (;;) { lock = SimpleLruGetBankLock(SubTransCtl, startPage); - - /* - * Check if we need to acquire the lock on the new bank then release - * the lock on the old bank and acquire on the new bank. - */ if (prevlock != lock) { - LWLockRelease(prevlock); + if (prevlock) +LWLockRelease(prevlock); LWLockAcquire(lock, LW_EXCLUSIVE); prevlock = lock; } (void) ZeroSUBTRANSPage(startPage); + if (startPage == endPage) + break; + startPage++; /* must account for wraparound */ if (startPage > TransactionIdToPage(MaxTransactionId)) startPage = 0; } - lock = SimpleLruGetBankLock(SubTransCtl, startPage); - - /* - * Check if we need to acquire the lock on the new bank then release the - * lock on the old bank and acquire on the new bank. - */ - if (prevlock != lock) - { - LWLockRelease(prevlock); - LWLockAcquire(lock, LW_EXCLUSIVE); - } - (void) ZeroSUBTRANSPage(startPage); LWLockRelease(lock); } -- 2.39.2
Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)
Hi, The function var_strcmp is a critical function. Inside the function, there is a shortcut condition, which allows for a quick exit. Unfortunately, the current code calls a very expensive function beforehand, which if the test was true, all the call time is wasted. So, IMO, it's better to postpone the function call until when it is actually necessary. best regards, Ranier Vilela avoid-call-expensive-function-varlena.patch Description: Binary data
Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)
On Mon, 4 Mar 2024 at 18:39, Ranier Vilela wrote: > > Hi, > > The function var_strcmp is a critical function. > Inside the function, there is a shortcut condition, > which allows for a quick exit. > > Unfortunately, the current code calls a very expensive function beforehand, > which if the test was true, all the call time is wasted. > So, IMO, it's better to postpone the function call until when it is actually > necessary. Thank you for your contribution. I agree it would be better, but your current patch is incorrect, because we need to check if the user has access to the locale (and throw an error if not) before we return that the two strings are equal. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Shared detoast Datum proposal
> > I decided to whip up a PoC patch implementing this approach, to get a > better idea of how it compares to the original proposal, both in terms > of performance and code complexity. > > Attached are two patches that both add a simple cache in detoast.c, but > differ in when exactly the caching happens. > > toast-cache-v1 - caching happens in toast_fetch_datum, which means this > happens before decompression > > toast-cache-v2 - caching happens in detoast_attr, after decompression ... > > I think implementation-wise this is pretty non-invasive. .. > > Performance-wise, these patches are slower than with Andy's patch. For > example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and > v2 at ~150ms. I'm sure there's a number of optimization opportunities > and v2 could get v2 closer to the 100ms. > > v1 is not competitive at all in this jsonb/Q1 test - it's just as slow > as master, because the data set is small enough to be fully cached, so > there's no I/O and it's the decompression is the actual bottleneck. > > That being said, I'm not convinced v1 would be a bad choice for some > cases. If there's memory pressure enough to evict TOAST, it's quite > possible the I/O would become the bottleneck. At which point it might be > a good trade off to cache compressed data, because then we'd cache more > detoasted values. > > OTOH it's likely the access to TOAST values is localized (in temporal > sense), i.e. we read it from disk, detoast it a couple times in a short > time interval, and then move to the next row. That'd mean v2 is probably > the correct trade off. I don't have different thought about the above statement and Thanks for the PoC code which would make later testing much easier. >> >> """ >> A alternative design: toast cache >> - >> >> This method is provided by Tomas during the review process. IIUC, this >> method would maintain a local HTAB which map a toast datum to a detoast >> datum and the entry is maintained / used in detoast_attr >> function. Within this method, the overall design is pretty clear and the >> code modification can be controlled in toasting system only. >> > > Right. > >> I assumed that releasing all of the memory at the end of executor once >> is not an option since it may consumed too many memory. Then, when and >> which entry to release becomes a trouble for me. For example: >> >> QUERY PLAN >> -- >> Nested Loop >>Join Filter: (t1.a = t2.a) >>-> Seq Scan on t1 >>-> Seq Scan on t2 >> (4 rows) >> >> In this case t1.a needs a longer lifespan than t2.a since it is >> in outer relation. Without the help from slot's life-cycle system, I >> can't think out a answer for the above question. >> > > This is true, but how likely such plans are? I mean, surely no one would > do nested loop with sequential scans on reasonably large tables, so how > representative this example is? Acutally this is a simplest Join case, we still have same problem like Nested Loop + Index Scan which will be pretty common. > Also, this leads me to the need of having some sort of memory limit. If > we may be keeping entries for extended periods of time, and we don't > have any way to limit the amount of memory, that does not seem great. > > AFAIK if we detoast everything into tts_values[] there's no way to > implement and enforce such limit. What happens if there's a row with > multiple large-ish TOAST values? What happens if those rows are in > different (and distant) parts of the plan? I think this can be done by tracking the memory usage on EState level or global variable level and disable it if it exceeds the limits and resume it when we free the detoast datum when we don't need it. I think no other changes need to be done. > It seems far easier to limit the memory with the toast cache. I think the memory limit and entry eviction is the key issue now. IMO, there are still some difference when both methods can support memory limit. The reason is my patch can grantee the cached memory will be reused, so if we set the limit to 10MB, we know all the 10MB is useful, but the TOAST cache method, probably can't grantee that, then when we want to make it effecitvely, we have to set a higher limit for this. >> Another difference between the 2 methods is my method have many >> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but >> it can save some run-time effort like hash_search find / enter run-time >> in method 2 since I put them directly into tts_values[*]. >> >> I'm not sure the factor 2 makes some real measurable difference in real >> case, so my current concern mainly comes from factor 1. >> """ >> > > This seems a bit dismissive of the (possible) issue. Hmm, sorry about this, that is not my intention:( > It'd be good to do > some measurements, especially on simple queries that can't benefit from > the detoasting (e.g. because there's no varlena attribute). This testing for
Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)
Em seg., 4 de mar. de 2024 às 14:54, Matthias van de Meent < boekewurm+postg...@gmail.com> escreveu: > On Mon, 4 Mar 2024 at 18:39, Ranier Vilela wrote: > > > > Hi, > > > > The function var_strcmp is a critical function. > > Inside the function, there is a shortcut condition, > > which allows for a quick exit. > > > > Unfortunately, the current code calls a very expensive function > beforehand, which if the test was true, all the call time is wasted. > > So, IMO, it's better to postpone the function call until when it is > actually necessary. > > Thank you for your contribution. > > I agree it would be better, but your current patch is incorrect, > because we need to check if the user has access to the locale (and > throw an error if not) before we return that the two strings are > equal. > I can't see any user validation at the function pg_newlocale_from_collation. meson test pass all checks. best regards, Ranier Vilela
Re: RFC: Logging plan of the running query
On Sat, Mar 2, 2024 at 10:46 AM James Coleman wrote: > If I can rephrase this idea: it's basically "delay this interrupt > until inline to the next ExecProcNode execution". Yes, but it's not just that. It also means that the code which would handle the interrupt doesn't need to be called at every ExecProcNode. Only when the interrupt actually arrives do we enable the code that handles it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)
Hi, On 2024-03-04 08:47:11 -0300, Ranier Vilela wrote: > Does filling a memory area, one by one, with branches, need strong evidence > to prove that it is better than filling a memory area, all at once, without > branches? That's a bogus comparison: a) the memset version modifies the whole array, rather than just elements that correspond to an item - often there will be far fewer items than the maximally possible number b) the memset version initializes array elements that will be set to an actual value c) switching to memset does not elide any branches, as the branch is still needed And even without those, it'd still not be obviously better to use an ahead-of-time memset(), as storing lots of values at once is more likely to be bound by memory bandwidth than interleaving different work. Andres
Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY
On Sat, Feb 24, 2024 at 12:10 PM Noah Misch wrote: > Agreed, those don't touch relation data files. I think you've got all > relation data file mutations. XLOG_DBASE_CREATE_FILE_COPY and XLOG_DBASE_DROP > are the only record types that touch a relation data file without mentioning > it in XLogRecordBlockHeader, XACT_XINFO_HAS_RELFILELOCATORS, or an RM_SMGR_ID > rlocator field. Thanks for the review. I have committed this. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Exponential backoff for auth_delay
Hi, On Wed, Feb 21, 2024 at 10:26:11PM +0100, Tomas Vondra wrote: > Thanks for the patch. I took a closer look at v3, so let me share some > review comments. Please push back if you happen to disagree with some of > it, some of this is going to be more a matter of personal preference. Thanks! As my patch was based on a previous patch, some of decisions were carry-overs I am not overly attached to. > 1) I think it's a bit weird to have two options specifying amount of > time, but one is in seconds and one in milliseconds. Won't that be > unnecessarily confusing? Could we do both in milliseconds? Alright, I changed that. See below for a discussion about the GUCs in general. > 2) The C code defines the GUC as auth_delay.exponential_backoff, but the > SGML docs say auth_delay.exp_backoff. Right, an oversight from the last version where the GUC name got changed but I forgot to change the documentation, fixed. > 3) Do we actually need the exponential_backoff GUC? Wouldn't it be > sufficient to have the maximum value, and if it's -1 treat that as no > backoff? That is a good question, I guess that makes sense. The next question then is: what should the default for (now) auth_delay.max_milliseconds be in this case, -1? Or do we say that as the default for auth_delay.milliseconds is 0 anyway, why would somebody not want exponential backoff when they switch it on and keep it at the current 10s/1ms)? I have not changed that for now, pending further input. > 4) I think the SGML docs should more clearly explain that the delay is > initialized to auth_delay.milliseconds, and reset to this value after > successful authentication. The wording kinda implies that, but it's not > quite clear I think. Ok, I added some text to that end. I also added a not that auth_delay.max_milliseconds will mean that the delay doubling never stops. > 4) I've been really confused by the change from > > if (status != STATUS_OK) >to > if (status == STATUS_ERROR) > > in auth_delay_checks, until I realized those two codes are not the only > ones, and we intentionally ignore STATUS_EOF. I think it'd be good to > mention that in a comment, it's not quite obvious (I only realized it > because the e-mail mentioned it). Yeah I agree, I tried to explain that now. > 5) I kinda like the custom that functions in a contrib module start with > a clear common prefix, say auth_delay_ in this case. Matter of personal > preference, ofc. Ok, I changed the functions to have an auth_delay_ prefix throughout.. > 6) I'm a bit skeptical about some acr_array details. Firstly, why would > 50 entries be enough? Seems like a pretty low and arbitrary number ... > Also, what happens if someone attempts to authenticate, fails to do so, > and then disconnects and never tries again? Or just changes IP? Seems > like the entry will remain in the array forever, no? Yeah, that is how v3 of this patch worked. I have changed that now, see below. > Seems like a way to cause a "limited" DoS - do auth failure from 50 > different hosts, to fill the table, and now everyone has to wait the > maximum amount of time (if they fail to authenticate). Right, though the problem would only exist on authentication failures, so it is really rather limited. > I think it'd be good to consider: > > - Making the acr_array a hash table, and larger than 50 entries (I > wonder if that should be dynamic / customizable by GUC?). I would say a GUC should be overkill for this as this would mostly be an implementation detail. More generally, I think now that entries are expired (see below), this should be less of a concern, so I have not changed this to a hash table for now but doubled MAX_CONN_RECORDS to 100 entries. > - Make sure the entries are eventually expired, based on time (for > example after 2*max_delay?). I went with 5*max_milliseconds - the AuthConnRecord struct now has a last_failed_auth timestamp member; if we increase the delay for a host, we check if any other host expired in the meantime and remove it if so. > - It would be a good idea to log something when we get into the "full > table" and start delaying everyone by max_delay_seconds. (How would > anyone know that's what's happening, seems rather confusing.) Right, I added a log line for that. However, I made it LOG instead of WARNING as I don't think the client would ever see it, would he? Attached is v4 with the above changes. Cheers, Michael >From 6520fb1e768bb8bc26cafad014d08d7e9ac7f12a Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Wed, 27 Dec 2023 15:55:39 +0100 Subject: [PATCH v4] Add optional exponential backoff to auth_delay contrib module. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This adds two new GUCs for auth_delay, exponential_backoff and max_milliseconds. The former controls whether exponential backoff should be used or not, the latter sets an maximum delay (default is 10s) in case exponential backoff is act
Re: Add system identifier to backup manifest
On Mon, Mar 4, 2024 at 12:35 AM Amul Sul wrote: > Yes, you are correct. Both the current caller of get_controlfile() has > access to the root directory. > > I have dropped the 0002 patch -- I don't have a very strong opinion to > refactor > get_controlfile() apart from saying that it might be good to have both > versions :) . I don't have an enormously strong opinion on what the right thing to do is here either, but I am not convinced that the change proposed by Michael is an improvement. After all, that leaves us with the situation where we know the path to the control file in three different places. First, verify_backup_file() does a strcmp() against the string "global/pg_control" to decide whether to call verify_backup_file(). Then, verify_system_identifier() uses that string to construct a pathname to the file that it will be read. Then, get_controlfile() reconstructs the same pathname using it's own logic. That's all pretty disagreeable. Hard-coded constants are hard to avoid completely, but here it looks an awful lot like we're trying to hardcode the same constant into as many different places as we can. The now-dropped patch seems like an effort to avoid this, and while it's possible that it wasn't the best way to avoid this, I still think avoiding it somehow is probably the right idea. I get a compiler warning with 0002, too: ../pgsql/src/backend/backup/basebackup_incremental.c:960:22: warning: call to undeclared function 'GetSystemIdentifier'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] system_identifier = GetSystemIdentifier(); ^ 1 warning generated. But I've committed 0001. -- Robert Haas EDB: http://www.enterprisedb.com
Re: postgres_fdw fails to see that array type belongs to extension
I wrote: > Now that the multirange issue is fixed (3e8235ba4), here's a > new version that includes the needed recursion in ALTER EXTENSION. > I spent some more effort on a proper test case, too. I looked this over again and pushed it. regards, tom lane
Re: Fix a typo in pg_rotate_logfile
> On 14 Feb 2024, at 21:48, Daniel Gustafsson wrote: >> On 14 Feb 2024, at 19:51, Nathan Bossart wrote: >> On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote: >>> Daniel Gustafsson writes: Attached is a diff to show what it would look like to remove adminpack (catalog version bump omitted on purpose to avoid conflicts until commit). >>> >>> I don't see any references you missed, so +1. >> >> Seems reasonable to me, too. > > Thanks! I'll put this in the next CF to keep it open for comments a bit > longer, but will close it early in the CF. This has now been pushed, adminpack has left the building. -- Daniel Gustafsson
Re: Shared detoast Datum proposal
On 3/4/24 18:08, Andy Fan wrote: > ... >> >>> I assumed that releasing all of the memory at the end of executor once >>> is not an option since it may consumed too many memory. Then, when and >>> which entry to release becomes a trouble for me. For example: >>> >>> QUERY PLAN >>> -- >>> Nested Loop >>>Join Filter: (t1.a = t2.a) >>>-> Seq Scan on t1 >>>-> Seq Scan on t2 >>> (4 rows) >>> >>> In this case t1.a needs a longer lifespan than t2.a since it is >>> in outer relation. Without the help from slot's life-cycle system, I >>> can't think out a answer for the above question. >>> >> >> This is true, but how likely such plans are? I mean, surely no one would >> do nested loop with sequential scans on reasonably large tables, so how >> representative this example is? > > Acutally this is a simplest Join case, we still have same problem like > Nested Loop + Index Scan which will be pretty common. > Yes, I understand there are cases where LRU eviction may not be the best choice - like here, where the "t1" should stay in the case. But there are also cases where this is the wrong choice, and LRU would be better. For example a couple paragraphs down you suggest to enforce the memory limit by disabling detoasting if the memory limit is reached. That means the detoasting can get disabled because there's a single access to the attribute somewhere "up the plan tree". But what if the other attributes (which now won't be detoasted) are accessed many times until then? I think LRU is a pretty good "default" algorithm if we don't have a very good idea of the exact life span of the values, etc. Which is why other nodes (e.g. Memoize) use LRU too. But I wonder if there's a way to count how many times an attribute is accessed (or is likely to be). That might be used to inform a better eviction strategy. Also, we don't need to evict the whole entry - we could evict just the data part (guaranteed to be fairly large), but keep the header, and keep the counts, expected number of hits, and other info. And use this to e.g. release entries that reached the expected number of hits. But I'd probably start with LRU and only do this as an improvement later. >> Also, this leads me to the need of having some sort of memory limit. If >> we may be keeping entries for extended periods of time, and we don't >> have any way to limit the amount of memory, that does not seem great. >> >> AFAIK if we detoast everything into tts_values[] there's no way to >> implement and enforce such limit. What happens if there's a row with >> multiple large-ish TOAST values? What happens if those rows are in >> different (and distant) parts of the plan? > > I think this can be done by tracking the memory usage on EState level > or global variable level and disable it if it exceeds the limits and > resume it when we free the detoast datum when we don't need it. I think > no other changes need to be done. > That seems like a fair amount of additional complexity. And what if the toasted values are accessed in context without EState (I haven't checked how common / important that is)? And I'm not sure just disabling the detoast as a way to enforce a memory limit, as explained earlier. >> It seems far easier to limit the memory with the toast cache. > > I think the memory limit and entry eviction is the key issue now. IMO, > there are still some difference when both methods can support memory > limit. The reason is my patch can grantee the cached memory will be > reused, so if we set the limit to 10MB, we know all the 10MB is > useful, but the TOAST cache method, probably can't grantee that, then > when we want to make it effecitvely, we have to set a higher limit for > this. > Can it actually guarantee that? It can guarantee the slot may be used, but I don't see how could it guarantee the detoasted value will be used. We may be keeping the slot for other reasons. And even if it could guarantee the detoasted value will be used, does that actually prove it's better to keep that value? What if it's only used once, but it's blocking detoasting of values that will be used 10x that? If we detoast a 10MB value in the outer side of the Nest Loop, what if the inner path has multiple accesses to another 10MB value that now can't be detoasted (as a shared value)? > >>> Another difference between the 2 methods is my method have many >>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but >>> it can save some run-time effort like hash_search find / enter run-time >>> in method 2 since I put them directly into tts_values[*]. >>> >>> I'm not sure the factor 2 makes some real measurable difference in real >>> case, so my current concern mainly comes from factor 1. >>> """ >>> >> >> This seems a bit dismissive of the (possible) issue. > > Hmm, sorry about this, that is not my intention:( > No need to apologize. >> It'd be good to do >> some measurements, especially on simple queries that
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
> On 4 Mar 2024, at 18:22, Nathan Bossart wrote: > > On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote: >> Looking at this again I think this is about ready to go in. My only comment >> is >> that doc/src/sgml/archive-modules.sgml probably should be updated to refer to >> setting the errdetail, especially since we document the errormessage there. > > Thanks for reviewing. How does this look? Looks good from a read-through, I like it. A few comments on the commit message only: actionable details about the source of the miconfiguration. This s/miconfiguration/misconfiguration/ Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera Alvaro's name seems wrong. -- Daniel Gustafsson
Re: [PATCH] Exponential backoff for auth_delay
On Mon, Mar 4, 2024 at 2:43 PM Michael Banck wrote: > > 3) Do we actually need the exponential_backoff GUC? Wouldn't it be > > sufficient to have the maximum value, and if it's -1 treat that as no > > backoff? > > That is a good question, I guess that makes sense. > > The next question then is: what should the default for (now) > auth_delay.max_milliseconds be in this case, -1? Or do we say that as > the default for auth_delay.milliseconds is 0 anyway, why would somebody > not want exponential backoff when they switch it on and keep it at the > current 10s/1ms)? > > I have not changed that for now, pending further input. I agree that two GUCs here seems to be one more than necessary, but I wonder whether we couldn't just say 0 means no exponential backoff and any other value is the maximum time. The idea that 0 means unlimited doesn't seem useful in practice. There's always going to be some limit, at least by the number of bits we have in the data type that we're using to do the calculation. But that limit is basically never the right answer. I mean, I think 2^31 milliseconds is roughly 25 days, but it seems unlikely to me that delays measured in days helpfully more secure than delays measured in minutes, and they don't seem very convenient for users either, and do you really want a failed connection to linger for days before failing? That seems like you're DOSing yourself. If somebody wants to configure a very large value explicitly, cool, they can do as they like, but we don't need to complicate the interface to make it easier for them to do so. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan wrote: > > On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan wrote: > > Attached is v11, which now says something like that in the commit > > message. > > Attached is v12. Some initial comments on the documentation: > +that searches for multiple values together. Queries that use certain > +SQL constructs to search for rows matching any value > +out of a list (or an array) of multiple scalar values might perform > +multiple primitive index scans (up to one primitive scan > +per scalar value) at runtime. See linkend="functions-comparisons"/> > +for details. I don't think the "see for details" is correctly worded: The surrounding text implies that it would contain details about in which precise situations multiple primitive index scans would be consumed, while it only contains documentation about IN/NOT IN/ANY/ALL/SOME. Something like the following would fit better IMO: +that searches for multiple values together. Queries that use certain +SQL constructs to search for rows matching any value +out of a list or array of multiple scalar values (such as those described in + might perform multiple primitive +index scans (up to one primitive scan per scalar value) at runtime. Then there is a second issue in the paragraph: Inverted indexes such as GIN might well decide to start counting more than one "primitive scan" per scalar value, because they may need to go through their internal structure more than once to produce results for a single scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a trigram index would likely use at least 4 descents here: one for each of "som", "ome", "wor", "ord". > > All that really remains now is to research how we might integrate this > > work with the recently added continuescanPrechecked/haveFirstMatch > > stuff from Alexander Korotkov, if at all. > > The main change in v12 is that I've integrated both the > continuescanPrechecked and the haveFirstMatch optimizations. Both of > these fields are now page-level state, shared between the _bt_readpage > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > appear next to the new home for _bt_checkkeys' continuescan field, in > the new page state struct). Cool. I'm planning to review the rest of this patch this week/tomorrow, could you take some time to review some of my btree patches, too? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Adding deprecation notices to pgcrypto documentation
> On 16 Nov 2023, at 21:49, Daniel Gustafsson wrote: > > In the "Allow tests to pass in OpenSSL FIPS mode" thread [0] it was discovered > that 3DES is joining the ranks of NIST disallowed algorithms. The attached > patch adds a small note to the pgcrypto documentation about deprecated uses of > algorithms. I've kept it to "official" notices such as RFC's and NIST SP's. > There might be more that deserve a notice, but this seemed like a good start. > > Any thoughts on whether this would be helpful? Cleaning out my inbox I came across this which I still think is worth doing, any objections to going ahead with it? -- Daniel Gustafsson
RE: Popcount optimization using AVX512
Hi, First, apologies on the patch. Find re-attached updated version. Now I have some questions #1 > -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) > +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) > +<< 31)) > > IME this means that the autoconf you are using has been patched. A quick > search on the mailing lists seems to indicate that it might be specific to > Debian [1]. I am not sure what the ask is here? I made changes to the configure.ac and ran autoconf2.69 to get builds to succeed. Do you have a separate feedback here? #2 As for the refactoring, this was done to satisfy previous review feedback about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly to avoid segfault due to the AVX512 flags. If its ok, I would prefer to make a single commit as the change is pretty small and straight forward. #3 I am not sure I understand the comment about the SIZE_VOID_P checks. Aren't they necessary to choose which functions to call based on 32 or 64 bit architectures? #4 Would this change qualify for Workflow A as described in [0] and can be picked up by a committer, given it has been reviewed by multiple committers so far? The scope of the change is pretty contained as well. [0] https://wiki.postgresql.org/wiki/Submitting_a_Patch Thanks, Paul -Original Message- From: Nathan Bossart Sent: Friday, March 1, 2024 1:45 PM To: Amonson, Paul D Cc: Andres Freund ; Alvaro Herrera ; Shankaran, Akash ; Noah Misch ; Tom Lane ; Matthias van de Meent ; pgsql-hackers@lists.postgresql.org Subject: Re: Popcount optimization using AVX512 Thanks for the new version of the patch. I didn't see a commitfest entry for this one, and unfortunately I think it's too late to add it for the March commitfest. I would encourage you to add it to July's commitfest [0] so that we can get some routine cfbot coverage. On Tue, Feb 27, 2024 at 08:46:06PM +, Amonson, Paul D wrote: > After consulting some Intel internal experts on MSVC the linking issue > as it stood was not resolved. Instead, I created a MSVC ONLY work-around. > This adds one extra functional call on the Windows builds (The linker > resolves a real function just fine but not a function pointer of the > same name). This extra latency does not exist on any of the other > platforms. I also believe I addressed all issues raised in the > previous reviews. The new pg_popcnt_x86_64_accel.c file is now the > ONLY file compiled with the > AVX512 compiler flags. I added support for the MSVC compiler flag as > well. Both meson and autoconf are updated with the new refactor. > > I am attaching the new patch. I think this patch might be missing the new files. -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) +<< 31)) IME this means that the autoconf you are using has been patched. A quick search on the mailing lists seems to indicate that it might be specific to Debian [1]. -static int pg_popcount32_slow(uint32 word); -static int pg_popcount64_slow(uint64 word); +intpg_popcount32_slow(uint32 word); +intpg_popcount64_slow(uint64 word); +uint64 pg_popcount_slow(const char *buf, int bytes); This patch appears to do a lot of refactoring. Would it be possible to break out the refactoring parts into a prerequisite patch that could be reviewed and committed independently from the AVX512 stuff? -#if SIZEOF_VOID_P >= 8 +#if SIZEOF_VOID_P == 8 /* Process in 64-bit chunks if the buffer is aligned. */ - if (buf == (const char *) TYPEALIGN(8, buf)) + if (buf == (const char *)TYPEALIGN(8, buf)) { - const uint64 *words = (const uint64 *) buf; + const uint64 *words = (const uint64 *)buf; while (bytes >= 8) { @@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes) bytes -= 8; } - buf = (const char *) words; + buf = (const char *)words; } -#else +#elif SIZEOF_VOID_P == 4 /* Process in 32-bit chunks if the buffer is aligned. */ if (buf == (const char *) TYPEALIGN(4, buf)) { Most, if not all, of these changes seem extraneous. Do we actually need to more strictly check SIZEOF_VOID_P? [0] https://commitfest.postgresql.org/48/ [1] https://postgr.es/m/20230211020042.uthdgj72kp3xlqam%40awork3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch Description: v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote: > Looks good from a read-through, I like it. A few comments on the commit > message only: > > actionable details about the source of the miconfiguration. This > s/miconfiguration/misconfiguration/ I reworded the commit message a bit to avoid the word "misconfiguration," as it felt a bit misleading to me. In any case, this was fixed, albeit indirectly. > Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera > Alvaro's name seems wrong. Hm. It looks alright to me. I copied the name from his e-mail signature, which has an accent over the first 'A'. I assume that's why it's not showing up correctly in some places. Anyway, I've committed this now. Thanks for taking a look! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Exponential backoff for auth_delay
Hi, On Mon, Mar 04, 2024 at 03:50:07PM -0500, Robert Haas wrote: > I agree that two GUCs here seems to be one more than necessary, but I > wonder whether we couldn't just say 0 means no exponential backoff and > any other value is the maximum time. Alright, I have changed it so that auth_delay.milliseconds and auth_delay.max_milliseconds are the only GUCs, their default being 0. If the latter is 0, the former's value is always taken. If the latter is non-zero and larger than the former, exponential backoff is applied with the latter's value as maximum delay. If the latter is smaller than the former then auth_delay just sets the delay to the latter, I don't think this is problem or confusing, or should this be considered a misconfiguration? > The idea that 0 means unlimited doesn't seem useful in practice. Yeah, that was more how it was coded than a real policy decision, so let's do away with it. V5 attached. Michael >From 3563d77061480b7e022255b968a39086b0cc8814 Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Wed, 27 Dec 2023 15:55:39 +0100 Subject: [PATCH v5] Add optional exponential backoff to auth_delay contrib module. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This adds the new auth_delay.max_milliseconds GUC. If set (its default is 0), auth_delay adds exponential backoff with this GUC's value as maximum delay. The exponential backoff is tracked per remote host and doubled for every failed login attempt (i.e., wrong password, not just missing pg_hba line or database) and reset to auth_delay.milliseconds after a successful authentication or when no authentication attempts have been made for 5*max_milliseconds from that host. Authors: Michael Banck, based on an earlier patch by 成之焕 Reviewed-by: Abhijit Menon-Sen, Tomas Vondra Discussion: https://postgr.es/m/ahwaxacqiwivoehs5yejpqog.1.1668569845751.hmail.zhch...@ceresdata.com --- contrib/auth_delay/auth_delay.c | 216 ++- doc/src/sgml/auth-delay.sgml | 31 - src/tools/pgindent/typedefs.list | 1 + 3 files changed, 244 insertions(+), 4 deletions(-) diff --git a/contrib/auth_delay/auth_delay.c b/contrib/auth_delay/auth_delay.c index ff0e1fd461..5fb123d133 100644 --- a/contrib/auth_delay/auth_delay.c +++ b/contrib/auth_delay/auth_delay.c @@ -14,24 +14,50 @@ #include #include "libpq/auth.h" +#include "miscadmin.h" #include "port.h" +#include "storage/dsm_registry.h" +#include "storage/ipc.h" +#include "storage/lwlock.h" +#include "storage/shmem.h" #include "utils/guc.h" #include "utils/timestamp.h" PG_MODULE_MAGIC; +#define MAX_CONN_RECORDS 100 + /* GUC Variables */ static int auth_delay_milliseconds = 0; +static int auth_delay_max_milliseconds = 0; /* Original Hook */ static ClientAuthentication_hook_type original_client_auth_hook = NULL; +typedef struct AuthConnRecord +{ + char remote_host[NI_MAXHOST]; + double sleep_time; /* in milliseconds */ + TimestampTz last_failed_auth; +} AuthConnRecord; + +static shmem_startup_hook_type shmem_startup_next = NULL; +static AuthConnRecord *acr_array = NULL; + +static AuthConnRecord *auth_delay_find_acr_for_host(char *remote_host); +static AuthConnRecord *auth_delay_find_free_acr(void); +static double auth_delay_increase_delay_after_failed_conn_auth(Port *port); +static void auth_delay_cleanup_conn_record(Port *port); +static void auth_delay_expire_conn_records(Port *port); + /* * Check authentication */ static void auth_delay_checks(Port *port, int status) { + double delay = auth_delay_milliseconds; + /* * Any other plugins which use ClientAuthentication_hook. */ @@ -39,20 +65,190 @@ auth_delay_checks(Port *port, int status) original_client_auth_hook(port, status); /* - * Inject a short delay if authentication failed. + * We handle both STATUS_ERROR and STATUS_OK - the third option + * (STATUS_EOF) is disregarded. + * + * In case of STATUS_ERROR we inject a short delay, optionally with + * exponential backoff. + */ + if (status == STATUS_ERROR) + { + if (auth_delay_max_milliseconds > 0) + { + /* + * Delay by 2^n seconds after each authentication failure from a + * particular host, where n is the number of consecutive + * authentication failures. + */ + delay = auth_delay_increase_delay_after_failed_conn_auth(port); + + /* + * Clamp delay to a maximum of auth_delay_max_milliseconds. + */ + delay = Min(delay, auth_delay_max_milliseconds); + } + + if (delay > 0) + { + elog(DEBUG1, "Authentication delayed for %g seconds due to auth_delay", delay / 1000.0); + pg_usleep(1000L * (long) delay); + } + + /* + * Expire delays from other hosts after auth_delay_max_milliseconds * + * 5. + */ + auth_delay_expire_conn_records(port); + } + + /* + * Remove host-specific delay if authentication succeeded. + */ + if (status == STATUS_OK) + auth_delay_cleanup_conn_record(port); +} + +static double +auth_delay_increase_delay_after_fai
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
Nathan Bossart writes: > On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote: >> Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera >> Alvaro's name seems wrong. > Hm. It looks alright to me. I copied the name from his e-mail signature, > which has an accent over the first 'A'. I assume that's why it's not > showing up correctly in some places. I think that git has an expectation of commit log entries being in UTF8. The committed message looks okay from my end, but maybe some encoding mangling happened to the version Daniel was looking at? regards, tom lane
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Alvaro Herrera writes: > Pushed that way, but we can discuss further wording improvements/changes > if someone wants to propose any. I just noticed that drongo is complaining about two lines added by commit 53c2a97a9: drongo| 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 'SlruPageStatus *' differs in levels of indirection from 'int' drongo| 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 'SlruPageStatus *' differs in levels of indirection from 'int' These lines are Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY); Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY); These are comparing the address of something with an enum value, which surely cannot be sane. Is the "&" operator incorrect? It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) the numeric value of zero, so I guess the majority of our BF animals are understanding this as "address != NULL". But that doesn't look like a useful test to be making. regards, tom lane
Re: Popcount optimization using AVX512
(Please don't top-post on the Postgres lists.) On Mon, Mar 04, 2024 at 09:39:36PM +, Amonson, Paul D wrote: > First, apologies on the patch. Find re-attached updated version. Thanks for the new version of the patch. >> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) >> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) >> +<< 31)) >> >> IME this means that the autoconf you are using has been patched. A >> quick search on the mailing lists seems to indicate that it might be >> specific to Debian [1]. > > I am not sure what the ask is here? I made changes to the configure.ac > and ran autoconf2.69 to get builds to succeed. Do you have a separate > feedback here? These LARGE_OFF_T changes are unrelated to the patch at hand and should be removed. This likely means that you are using a patched autoconf that is making these extra changes. > As for the refactoring, this was done to satisfy previous review feedback > about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly > to avoid segfault due to the AVX512 flags. If its ok, I would prefer to > make a single commit as the change is pretty small and straight forward. Okay. The only reason I suggest this is to ease review. For example, if there is some required refactoring that doesn't involve any functionality changes, it can be advantageous to get that part reviewed and committed first so that reviewers can better focus on the code for the new feature. But, of course, that isn't necessary and/or isn't possible in all cases. > I am not sure I understand the comment about the SIZE_VOID_P checks. > Aren't they necessary to choose which functions to call based on 32 or 64 > bit architectures? Yes. My comment was that the patch appeared to make unnecessary changes to this code. Perhaps I am misunderstanding something here. > Would this change qualify for Workflow A as described in [0] and can be > picked up by a committer, given it has been reviewed by multiple > committers so far? The scope of the change is pretty contained as well. I think so. I would still encourage you to create an entry for this so that it is automatically tested via cfbot [0]. [0] http://commitfest.cputube.org/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Injection points: some tools to wait and wake
On Mon, Mar 04, 2024 at 10:51:41AM +0100, Jelte Fennema-Nio wrote: > Yeah, it makes sense that you'd want to backport fixes/changes to > this. As long as you put a disclaimer in the docs that you can do that > for this module, I think it would be fine. Our tests fairly regularly > break anyway when changing minor versions of postgres in our CI, e.g. > due to improvements in the output of isolationtester. So if changes to > this module require some changes that's fine by me. Seems much nicer > than having to copy-paste the code. In my experience, anybody who does serious testing with their product integrated with Postgres have two or three types of builds with their own scripts: one with assertions, -DG and other developer-oriented options enabled, and one for production deployments with more optimized options like -O2. Once there are custom scripts to build and package Postgres, do we really need to move that to contrib/ at all? make install would work for a test module as long as the command is run locally in its directory. -- Michael signature.asc Description: PGP signature
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
I wrote: > It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) > the numeric value of zero, so I guess the majority of our BF > animals are understanding this as "address != NULL". But that > doesn't look like a useful test to be making. In hopes of noticing whether there are other similar thinkos, I permuted the order of the SlruPageStatus enum values, and now I get the expected warnings from gcc: In file included from ../../../../src/include/postgres.h:45, from slru.c:59: slru.c: In function ‘SimpleLruWaitIO’: slru.c:436:38: warning: comparison between pointer and integer Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^ slru.c: In function ‘SimpleLruWritePage’: slru.c:717:43: warning: comparison between pointer and integer Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^ So it looks like it's just these two places. regards, tom lane
Re: Adding deprecation notices to pgcrypto documentation
On Mon, Mar 04, 2024 at 10:03:13PM +0100, Daniel Gustafsson wrote: > Cleaning out my inbox I came across this which I still think is worth doing, > any objections to going ahead with it? I think the general idea is reasonable, but I have two small comments: * Should this be a "Warning" and/or moved to the top of the page? This seems like a relatively important notice that folks should see when beginning to use pgcrypto. * Should we actually document the exact list of algorithms along with detailed reasons? This list seems prone to becoming outdated. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)
On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote: > I can't see any user validation at the function pg_newlocale_from_collation. Matthias is right, look closer. I can see more than one check, especially note the one related to the collation version mismatch that should not be silently ignored. > meson test pass all checks. Collations are harder to test because they depend on the environment where the test happens, especially with ICU. -- Michael signature.asc Description: PGP signature
Re: a wrong index choose when statistics is out of date
On Tue, 5 Mar 2024 at 00:37, Andy Fan wrote: > > David Rowley writes: > > I don't think it would be right to fudge the costs in any way, but I > > think the risk factor for IndexPaths could take into account the > > number of unmatched index clauses and increment the risk factor, or > > "certainty_factor" as it is currently in my brain-based design. That > > way add_path() would be more likely to prefer the index that matches > > the most conditions. > > This is somehow similar with my proposal at [1]? What do you think > about the treat 'col op const' as 'col op $1' for the marked column? > This could just resolve a subset of questions in your mind, but the > method looks have a solid reason. Do you mean this? > + /* > + * To make the planner more robust to handle some inaccurate statistics > + * issue, we will add a extra cost to qpquals so that the less qpquals > + * the lower cost it has. > + */ > + cpu_run_cost += 0.01 * list_length(qpquals); I don't think it's a good idea to add cost penalties like you proposed there. This is what I meant by "I don't think it would be right to fudge the costs in any way". If you modify the costs to add some small penalty so that the planner is more likely to favour some other plan, what happens if we then decide the other plan has some issue and we want to penalise that for some other reason? Adding the 2nd penalty might result in the original plan choice again. Which one should be penalised more? I think the uncertainty needs to be tracked separately. Fudging the costs like this is also unlikely to play nicely with add_path's use of STD_FUZZ_FACTOR. There'd be an incentive to do things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large enough penalty. David > [1] > https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com
Re: Add new error_action COPY ON_ERROR "log"
On Mon, Mar 04, 2024 at 05:00:00AM +0530, Bharath Rupireddy wrote: > How about an extra option to error_action ignore-with-verbose which is > similar to ignore but when specified emits one NOTICE per malformed > row? With this, one can say COPY x FROM stdin (ON_ERROR > ignore-with-verbose);. > > Alternatively, we can think of adding a new option verbose altogether > which can be used for not only this but for reporting some other COPY > related info/errors etc. With this, one can say COPY x FROM stdin > (VERBOSE on, ON_ERROR ignore);. I would suggest a completely separate option, because that offers more flexibility as each option has a separate meaning. My main concern in using one option to control them all is that one may want in the future to be able to specify more combinations of actions at query level, especially if more modes are added to the ON_ERROR mode. One option would prevent that. Perhaps ERROR_VERBOSE or ERROR_VERBOSITY would be better names, but I'm never wedded to my naming suggestions. Bad history with the matter. > There's also another way of having a separate GUC, but -100 from me > for it. Because, it not only increases the total number of GUCs by 1, > but also might set a wrong precedent to have a new GUC for controlling > command level outputs. What does this have to do with GUCs? The ON_ERROR option does not have one. -- Michael signature.asc Description: PGP signature
Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)
Michael Paquier writes: > On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote: >> I can't see any user validation at the function pg_newlocale_from_collation. > Matthias is right, look closer. I can see more than one check, > especially note the one related to the collation version mismatch that > should not be silently ignored. The fast path through that code doesn't include any checks, true, but the point is that finding the entry proves that we made those checks previously. I can't agree with making those semantics squishy in order to save a few cycles in the exact-equality case. regards, tom lane
Re: [PATCH] updates to docs about HOT updates for BRIN
On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote: > Attached is an updated patch which drops the 'such as' and adds a > sentence mentioning that BRIN is the only in-core summarizing index. The original patch reads more clearly to me. In v4, summarizing (the exception) feels like it's dominating the description. Also, is it standard practice to backport this kind of doc update? I ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16 as well. Attached my own suggested wording that hopefully addresses Stephen and Alvaro's concerns. I agree that it's tricky to write so I took a more minimalist approach: * I got rid of the "In summary" sentence because (a) it's confusing now that we're talking about summarizing indexes; and (b) it's not summarizing anything, it's just redundant. * I removed the mention partial or expression indexes. It's a bit redundant and doesn't seem especially helpful in this context. If this is agreeable I can commit it. Regards, Jeff Davis From d17ecaf1af52ba5b055c19b465d77cc53f825747 Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Mon, 26 Feb 2024 17:17:54 -0500 Subject: [PATCH v5] docs: Update HOT update docs for 19d8e2308b Commit 19d8e2308b changed when the HOT update optimization is possible but neglected to update the Heap-Only Tuples (HOT) documentation. This patch updates that documentation accordingly. Author: Elizabeth Christensen Reviewed-By: Stephen Frost, Alvaro Herrera Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com --- doc/src/sgml/storage.sgml | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index 652946db7d..2e1914b95b 100644 --- a/doc/src/sgml/storage.sgml +++ b/doc/src/sgml/storage.sgml @@ -1097,8 +1097,10 @@ data. Empty in ordinary tables. - The update does not modify any columns referenced by the table's - indexes, including expression and partial indexes. + The update does not modify any columns referenced by the table's indexes, + not including summarizing indexes. The only summarizing index method in + the core PostgreSQL distribution is BRIN. @@ -1114,7 +1116,8 @@ data. Empty in ordinary tables. - New index entries are not needed to represent updated rows. + New index entries are not needed to represent updated rows, however, + summary indexes may still need to be updated. @@ -1130,14 +1133,12 @@ data. Empty in ordinary tables. - In summary, heap-only tuple updates can only be created - if columns used by indexes are not updated. You can - increase the likelihood of sufficient page space for + You can increase the likelihood of sufficient page space for HOT updates by decreasing a table's fillfactor. - If you don't, HOT updates will still happen because - new rows will naturally migrate to new pages and existing pages with - sufficient free space for new row versions. The system view fillfactor. If you + don't, HOT updates will still happen because new rows + will naturally migrate to new pages and existing pages with sufficient free + space for new row versions. The system view pg_stat_all_tables allows monitoring of the occurrence of HOT and non-HOT updates. -- 2.34.1
Re: Synchronizing slots from primary to standby
Here are some review comments for v105-0001 == doc/src/sgml/config.sgml 1. + +The standbys corresponding to the physical replication slots in +standby_slot_names must configure +sync_replication_slots = true so they can receive +logical failover slots changes from the primary. + /slots changes/slot changes/ == doc/src/sgml/func.sgml 2. +The function may be waiting if the specified slot is a logical failover +slot and standby_slot_names +is configured. I know this has been through multiple versions already, but this latest wording "may be waiting..." doesn't seem very good to me. How about one of these? * The function may not be able to return immediately if the specified slot is a logical failover slot and standby_slot_names is configured. * The function return might be blocked if the specified slot is a logical failover slot and standby_slot_names is configured. * If the specified slot is a logical failover slot then the function will block until all physical slots specified in standby_slot_names have confirmed WAL receipt. * If the specified slot is a logical failover slot then the function will not return until all physical slots specified in standby_slot_names have confirmed WAL receipt. ~~~ 3. +slot may return to an earlier position. The function may be waiting if +the specified slot is a logical failover slot and +standby_slot_names Same as previous review comment #2 == src/backend/replication/slot.c 4. WaitForStandbyConfirmation + * Used by logical decoding SQL functions that acquired logical failover slot. IIUC it doesn't work like that. pg_logical_slot_get_changes_guts() calls here unconditionally (i.e. the SQL functions don't even check if they are failover slots before calling this) so the comment seems misleading/redundant. == src/backend/replication/walsender.c 5. NeedToWaitForWal + /* + * Check if the standby slots have caught up to the flushed position. It + * is good to wait up to the flushed position and then let the WalSender + * send the changes to logical subscribers one by one which are already + * covered by the flushed position without needing to wait on every change + * for standby confirmation. + */ + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) + return true; + + *wait_event = 0; + return false; +} + 5a. The comment (or part of it?) seems misplaced because it is talking WalSender sending changes, but that is not happening in this function. Also, isn't what this is saying already described by the other comment in the caller? e.g.: + /* + * Within the loop, we wait for the necessary WALs to be flushed to + * disk first, followed by waiting for standbys to catch up if there + * are enough WALs or upon receiving the shutdown signal. To avoid the + * scenario where standbys need to catch up to a newer WAL location in + * each iteration, we update our idea of the currently flushed + * position only if we are not waiting for standbys to catch up. + */ ~ 5b. Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line: return NeedToWaitForStandbys(flushed_lsn, wait_event); ~~~ 6. WalSndWaitForWal + /* + * Within the loop, we wait for the necessary WALs to be flushed to + * disk first, followed by waiting for standbys to catch up if there + * are enough WALs or upon receiving the shutdown signal. To avoid the + * scenario where standbys need to catch up to a newer WAL location in + * each iteration, we update our idea of the currently flushed + * position only if we are not waiting for standbys to catch up. + */ Regarding that 1st sentence: maybe this logic used to be done explicitly "within the loop" but IIUC this logic is now hidden inside NeedToWaitForWal() so the comment should mention that. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
On Mon, Feb 26, 2024 at 02:01:45PM +, Bertrand Drouvot wrote: > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch > now > (see the attached). I have looked at what you have here. First, in a build where 818fefd8fd is included, this makes the test script a lot slower. Most of the logic is quick, but we're spending 10s or so checking that catalog_xmin has advanced. Could it be possible to make that faster? A second issue is the failure mode when 818fefd8fd is reverted. The test is getting stuck when we are waiting on the standby to catch up, until a timeout decides to kick in to fail the test, and all the previous tests pass. Could it be possible to make that more responsive? I assume that in the failure mode we would get an incorrect conflict_reason for injection_inactiveslot, succeeding in checking the failure. +my $terminated = 0; +for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++) +{ +if ($node_standby->log_contains( +'terminating process .* to release replication slot \"injection_activeslot\"', $logstart)) +{ +$terminated = 1; +last; +} +usleep(100_000); +} +ok($terminated, 'terminating process holding the active slot is logged with injection point'); The LOG exists when we are sure that the startup process is waiting in the injection point, so this loop could be replaced with something like: + $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot'); + ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';) Nit: the name of the injection point should be terminate-process-holding-slot rather than TerminateProcessHoldingSlot, to be consistent with the other ones. -- Michael signature.asc Description: PGP signature
Re: Preserve subscription OIDs during pg_upgrade
On Mon, Feb 26, 2024 at 09:51:40AM +0530, Robert Haas wrote: > > I am not sure that it is a good idea to relax that for PG17 at this > > stage of the development cycle, though, as we have already done a lot > > in this area for pg_upgrade and it may require more tweaks during the > > beta period depending on the feedback received, so I would suggest to > > do more improvements for the 18 cycle instead once we have a cleaner > > picture of the whole. > > That's fair. > > I want to say that, unlike Tom, I'm basically in favor of preserving > OIDs in more places across updates. It seems to have little downside > and improve the understandability of the outcome. But that's separate > from whether it is a good idea to build on that infrastructure in any > particular way in the time we have left for this release. Yes, the _minimal_ approach has changed in the past few years to make pg_upgrade debugging easier. The original design was ultra-conservative where it could be, considering how radical the core functionality was. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 4, 2024 at 8:48 PM John Naylor wrote: > > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada wrote: > > > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor wrote: > > > > Right, I should have said "reset". Resetting a context will delete > > > it's children as well, and seems like it should work to reset the tree > > > context, and we don't have to know whether that context actually > > > contains leaves at all. That should allow copying "tree context" to > > > "leaf context" in the case where we have no special context for > > > leaves. > > > > Resetting the tree->context seems to work. But I think we should note > > for callers that the dsa_area passed to RT_CREATE should be created in > > a different context than the context passed to RT_CREATE because > > otherwise RT_FREE() will also free the dsa_area. For example, the > > following code in test_radixtree.c will no longer work: > > > > dsa = dsa_create(tranche_id); > > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > > : > > rt_free(radixtree); > > dsa_detach(dsa); // dsa is already freed. > > > > So I think that a practical usage of the radix tree will be that the > > caller creates a memory context for a radix tree and passes it to > > RT_CREATE(). > > That sounds workable to me. > > > I've attached an update patch set: > > > > - 0008 updates RT_FREE_RECURSE(). > > Thanks! > > > - 0009 patch is an updated version of cleanup radix tree memory handling. > > Looks pretty good, as does the rest. I'm going through again, > squashing and making tiny adjustments to the template. The only thing > not done is changing the test with many values to resemble the perf > test more. > > I wrote: > > > Secondly, I thought about my recent work to skip checking if we first > > > need to create a root node, and that has a harmless (for vacuum at > > > least) but slightly untidy behavior: When RT_SET is first called, and > > > the key is bigger than 255, new nodes will go on top of the root node. > > > These have chunk '0'. If all subsequent keys are big enough, the > > > orginal root node will stay empty. If all keys are deleted, there will > > > be a chain of empty nodes remaining. Again, I believe this is > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > > on this, but likely not today. > > > > This turns out to be a lot trickier than it looked, so it seems best > > to allow a trivial amount of waste, as long as it's documented > > somewhere. It also wouldn't be terrible to re-add those branches, > > since they're highly predictable. > > I put a little more work into this, and got it working, just needs a > small amount of finicky coding. I'll share tomorrow. > > I have a question about RT_FREE_RECURSE: > > + check_stack_depth(); > + CHECK_FOR_INTERRUPTS(); > > I'm not sure why these are here: The first seems overly paranoid, > although harmless, but the second is probably a bad idea. Why should > the user be able to to interrupt the freeing of memory? Good catch. We should not check the interruption there. > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the > tree, leading to coding like "iter->tree->ctl->root". I *think* it > would be easier to read if the tree was a parameter to these iteration > functions. That would require an API change, so the tests/tidstore > would have some churn. I can do that, but before trying I wanted to > see what you think -- is there some reason to keep the current way? I considered both usages, there are two reasons for the current style. I'm concerned that if we pass both the tree and RT_ITER to iteration functions, the caller could mistakenly pass a different tree than the one that was specified to create the RT_ITER. And the second reason is just to make it consistent with other data structures such as dynahash.c and dshash.c, but I now realized that in simplehash.h we pass both the hash table and the iterator. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for giving comments! > > I agreed it sounds good, but I don't think it can be implemented by > > current interface. An interface for dynamically allocating memory is > > GetNamedDSMSegment(), and it returns the same shared memory region if > > input names are the same. Therefore, there is no way to re-alloc the > > shared memory. > > Yeah, I was imagining something like this: the workitem-array becomes a > struct, which has a name and a "next" pointer and a variable number of > workitem slots; the AutoVacuumShmem struct has a pointer to the first > workitem-struct and the last one; when a workitem is requested by > brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a > workitem-struct with a smallish number of elements; if we request > another workitem and the array is full, we allocate another array via > GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0 > (so that the list can be followed by an autovacuum worker that's > processing the database), and it's also set as the tail of the list in > AutoVacuumShmem (so that we know where to store further work items). > When all items in a workitem-struct are processed, we can free it > (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems > point to the next one in the list. > > This way, the "array" can grow arbitrarily. > Basically sounds good. My concerns are: * GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This means that it may be difficult to do dsm_unpin_segment on the caller side. * dynamic shared memory is recorded in dhash (dsm_registry_table) and the entry won't be deleted. The reference for the chunk might be remained. Overall, it may be needed that dsm_registry may be also extended. I do not start working yet, so will share results after trying them. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Switching XLog source from archive to streaming when primary available
cfbot claims that this one needs another rebase. I've spent some time thinking about this one. I'll admit I'm a bit worried about adding more complexity to this state machine, but I also haven't thought of any other viable approaches, and this still seems like a useful feature. So, for now, I think we should continue with the current approach. +fails to switch to stream mode, it falls back to archive mode. If this +parameter value is specified without units, it is taken as +milliseconds. Default is 5min. With a lower value Does this really need to be milliseconds? I would think that any reasonable setting would at least on the order of seconds. +attempts. To avoid this, it is recommended to set a reasonable value. I think we might want to suggest what a "reasonable value" is. + static bool canSwitchSource = false; + boolswitchSource = false; IIUC "canSwitchSource" indicates that we are trying to force a switch to streaming, but we are currently exhausting anything that's present in the pg_wal directory, while "switchSource" indicates that we should force a switch to streaming right now. Furthermore, "canSwitchSource" is static while "switchSource" is not. Is there any way to simplify this? For example, would it be possible to make an enum that tracks the streaming_replication_retry_interval state? /* * Don't allow any retry loops to occur during nonblocking -* readahead. Let the caller process everything that has been -* decoded already first. +* readahead if we failed to read from the current source. Let the +* caller process everything that has been decoded already first. */ - if (nonblocking) + if (nonblocking && lastSourceFailed) return XLREAD_WOULDBLOCK; Why do we skip this when "switchSource" is set? + /* Reset the WAL source switch state */ + if (switchSource) + { + Assert(canSwitchSource); + Assert(currentSource == XLOG_FROM_STREAM); + Assert(oldSource == XLOG_FROM_ARCHIVE); + switchSource = false; + canSwitchSource = false; + } How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE? Is there no way it could be XLOG_FROM_PG_WAL? +#streaming_replication_retry_interval = 5min # time after which standby + # attempts to switch WAL source from archive to + # streaming replication + # in milliseconds; 0 disables I think we might want to turn this feature off by default, at least for the first release. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
On Mon, Oct 23, 2023 at 03:25:42PM -0500, Nathan Bossart wrote: > rebased I saw that this thread was referenced elsewhere [0], so I figured I'd take a fresh look. From a quick glance, I'd say 0001 and 0002 are pretty reasonable and could probably be committed for v17. 0003 probably requires some more attention. If there is indeed interest in these changes, I'll try to spend some more time on it. [0] https://postgr.es/m/E0D2F0CE-D27C-49B1-902B-AD8D2427F07E%40yandex-team.ru -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
Hi, Here are some review comments for v7-0001 1. /* * binaryheap_free * * Releases memory used by the given binaryheap. */ void binaryheap_free(binaryheap *heap) { pfree(heap); } Shouldn't the above function (not modified by the patch) also firstly free the memory allocated for the heap->bh_nodes? ~~~ 2. +/* + * Make sure there is enough space for nodes. + */ +static void +bh_enlarge_node_array(binaryheap *heap) +{ + heap->bh_space *= 2; + heap->bh_nodes = repalloc(heap->bh_nodes, + sizeof(bh_node_type) * heap->bh_space); +} Strictly speaking, this function doesn't really "Make sure" of anything because the caller does the check whether we need more space. All that happens here is allocating more space. Maybe this function comment should say something like "Double the space allocated for nodes." -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Support "Right Semi Join" plan shapes
On Mon, Mar 4, 2024 at 10:33 AM wenhui qiu wrote: > HI Richard > Now it is starting the last commitfest for v17, can you respond to > Alena Rybakina points? > Thanks for reminding. Will do that soon. Thanks Richard
Re: Support "Right Semi Join" plan shapes
On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina wrote: > I have reviewed your patch and I think it is better to add an Assert for > JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to > prevent the use of RIGHT_SEMI for these types of connections (NestedLoop > and MergeJoin). Hmm, I don't see why this is necessary. The planner should already guarantee that we won't have nestloops/mergejoins with right-semi joins. Thanks Richard
Re: PostgreSQL Contributors Updates
On Mon, 4 Mar 2024 at 17:43, Aleksander Alekseev wrote: > > > > All, > > > > > > The PostgreSQL Contributor Page > > > (https://www.postgresql.org/community/contributors/) includes people who > > > have made substantial, long-term contributions of time and effort to the > > > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > > > following people for their contributions. > > > > > > New PostgreSQL Contributors: > > > > > > * Bertrand Drouvot > > > * Gabriele Bartolini > > > * Richard Guo > > > > > > New PostgreSQL Major Contributors: > > > > > > * Alexander Lakhin > > > * Daniel Gustafsson > > > * Dean Rasheed > > > * John Naylor > > > * Melanie Plageman > > > * Nathan Bossart > > > > > > Thank you and congratulations to all! > > > > +1. Congratulations to all! > > Congratulations to all! Congratulations to all!
Re: Shared detoast Datum proposal
> >> 2. more likely to use up all the memory which is allowed. for example: >> if we set the limit to 1MB, then we kept more data which will be not >> used and then consuming all of the 1MB. >> >> My method is resolving this with some helps from other modules (kind of >> invasive) but can control the eviction well and use the memory as less >> as possible. >> > > Is the memory usage really an issue? Sure, it'd be nice to evict entries > as soon as we can prove they are not needed anymore, but if the cache > limit is set to 1MB it's not really a problem to use 1MB. This might be a key point which leads us to some different directions, so I want to explain more about this, to see if we can get some agreement here. It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB, 100MB. In my current case it is 40MB at least. Less memory limit causes cache ineffecitve, high memory limit cause potential memory use-up issue in the TOAST cache design. But in my method, even we set a higher value, it just limits the user case it really (nearly) needed, and it would not cache more values util the limit is hit. This would make a noticable difference when we want to set a high limit and we have some high active sessions, like 100 * 40MB = 4GB. > On 3/4/24 18:08, Andy Fan wrote: >> ... >>> I assumed that releasing all of the memory at the end of executor once is not an option since it may consumed too many memory. Then, when and which entry to release becomes a trouble for me. For example: QUERY PLAN -- Nested Loop Join Filter: (t1.a = t2.a) -> Seq Scan on t1 -> Seq Scan on t2 (4 rows) In this case t1.a needs a longer lifespan than t2.a since it is in outer relation. Without the help from slot's life-cycle system, I can't think out a answer for the above question. >>> >>> This is true, but how likely such plans are? I mean, surely no one would >>> do nested loop with sequential scans on reasonably large tables, so how >>> representative this example is? >> >> Acutally this is a simplest Join case, we still have same problem like >> Nested Loop + Index Scan which will be pretty common. >> > > Yes, I understand there are cases where LRU eviction may not be the best > choice - like here, where the "t1" should stay in the case. But there > are also cases where this is the wrong choice, and LRU would be better. > > For example a couple paragraphs down you suggest to enforce the memory > limit by disabling detoasting if the memory limit is reached. That means > the detoasting can get disabled because there's a single access to the > attribute somewhere "up the plan tree". But what if the other attributes > (which now won't be detoasted) are accessed many times until then? I am not sure I can't follow up here, but I want to explain more about the disable-detoast-sharing logic when the memory limit is hit. When this happen, the detoast sharing is disabled, but since the detoast datum will be released every soon when the slot->tts_values[*] is discard, then the 'disable' turn to 'enable' quickly. So It is not true that once it is get disabled, it can't be enabled any more for the given query. > I think LRU is a pretty good "default" algorithm if we don't have a very > good idea of the exact life span of the values, etc. Which is why other > nodes (e.g. Memoize) use LRU too. > But I wonder if there's a way to count how many times an attribute is > accessed (or is likely to be). That might be used to inform a better > eviction strategy. Yes, but in current issue we can get a better esitimation with the help of plan shape and Memoize depends on some planner information as well. If we bypass the planner information and try to resolve it at the cache level, the code may become to complex as well and all the cost is run time overhead while the other way is a planning timie overhead. > Also, we don't need to evict the whole entry - we could evict just the > data part (guaranteed to be fairly large), but keep the header, and keep > the counts, expected number of hits, and other info. And use this to > e.g. release entries that reached the expected number of hits. But I'd > probably start with LRU and only do this as an improvement later. A great lession learnt here, thanks for sharing this! As for the current user case what I want to highlight is in the current user case, we are "caching" "user data" "locally". USER DATA indicates it might be very huge, it is not common to have a 1M tables, but it is much common we have 1M Tuples in one scan, so keeping the header might extra memory usage as well, like 10M * 24 bytes = 240MB. LOCALLY means it is not friend to multi active sessions. CACHE indicates it is hard to evict correctly. My method also have the USER DATA, LOCALLY attributes, but it would be better at eviction. eviction then have lead to memory usage issue which is discr