Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada wrote: > > On Mon, Jan 8, 2024 at 8:35 PM John Naylor wrote: > > > > On Wed, Jan 3, 2024 at 9:10 PM John Naylor wrote: > > > > > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada > > > wrote: > > > > > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > > > calls part. I think that even if we expose them, we will still need to > > > > do something like "if (TidStoreIsShared(ts)) > > > > shared_rt_lock_share(ts->tree.shared)", no? > > > > > > I'll come back to this topic separately. > > > > To answer your question, sure, but that "if (TidStoreIsShared(ts))" > > part would be pushed down into a function so that only one place has > > to care about it. > > > > However, I'm starting to question whether we even need that. Meaning, > > lock the tidstore separately. To "lock the tidstore" means to take a > > lock, _separate_ from the radix tree's internal lock, to control > > access to two fields in a separate "control object": > > > > +typedef struct TidStoreControl > > +{ > > + /* the number of tids in the store */ > > + int64 num_tids; > > + > > + /* the maximum bytes a TidStore can use */ > > + size_t max_bytes; > > > > I'm pretty sure max_bytes does not need to be in shared memory, and > > certainly not under a lock: Thinking of a hypothetical > > parallel-prune-phase scenario, one way would be for a leader process > > to pass out ranges of blocks to workers, and when the limit is > > exceeded, stop passing out blocks and wait for all the workers to > > finish. > > True. I agreed that it doesn't need to be under a lock anyway, as it's > read-only. > > > > > As for num_tids, vacuum previously put the similar count in > > > > @@ -176,7 +179,8 @@ struct ParallelVacuumState > > PVIndStats *indstats; > > > > /* Shared dead items space among parallel vacuum workers */ > > - VacDeadItems *dead_items; > > + TidStore *dead_items; > > > > VacDeadItems contained "num_items". What was the reason to have new > > infrastructure for that count? And it doesn't seem like access to it > > was controlled by a lock -- can you confirm? If we did get parallel > > pruning, maybe the count would belong inside PVShared? > > I thought that since the tidstore is a general-purpose data structure > the shared counter should be protected by a lock. One thing I'm > concerned about is that we might need to update both the radix tree > and the counter atomically in some cases. But that's true we don't > need it for lazy vacuum at least for now. Even given the parallel scan > phase, probably we won't need to have workers check the total number > of stored tuples during a parallel scan. > > > > > The number of tids is not that tightly bound to the tidstore's job. I > > believe tidbitmap.c (a possible future client) doesn't care about the > > global number of tids -- not only that, but AND/OR operations can > > change the number in a non-obvious way, so it would not be convenient > > to keep an accurate number anyway. But the lock would still be > > mandatory with this patch. > > Very good point. > > > > > If we can make vacuum work a bit closer to how it does now, it'd be a > > big step up in readability, I think. Namely, getting rid of all the > > locking logic inside tidstore.c and let the radix tree's locking do > > the right thing. We'd need to make that work correctly when receiving > > pointers to values upon lookup, and I already shared ideas for that. > > But I want to see if there is any obstacle in the way of removing the > > tidstore control object and it's separate lock. > > So I agree to remove both max_bytes and num_items from the control > object.Also, as you mentioned, we can remove the tidstore control > object itself. TidStoreGetHandle() returns a radix tree handle, and we > can pass it to TidStoreAttach(). I'll try it. > I realized that if we remove the whole tidstore control object including max_bytes, processes who attached the shared tidstore cannot use TidStoreIsFull() actually as it always returns true. Also they cannot use TidStoreReset() as well since it needs to pass max_bytes to RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it could be problematic for general use. If we remove it, we probably need a safeguard to prevent those who attached the tidstore from calling these functions. Or we can keep the control object but remove the lock and num_tids. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
plpgsql memory leaks
Hi I have reported very memory expensive pattern: CREATE OR REPLACE FUNCTION public.fx(iter integer) RETURNS void LANGUAGE plpgsql AS $function$ declare c cursor(m bigint) for select distinct i from generate_series(1, m) g(i); t bigint; s bigint; begin for i in 1..iter loop open c(m := i * 1); s := 0; loop fetch c into t; exit when not found; s := s + t; end loop; close c; raise notice '%=%', i, s; end loop; end; $function$ ; This script takes for 100 iterations 100MB but rewritten CREATE OR REPLACE FUNCTION public.fx(iter integer) RETURNS void LANGUAGE plpgsql AS $function$ declare t bigint; s bigint; begin for i in 1..iter loop s := 0; for t in select ic from generate_series(1, i * 1) g(ic) loop s := s + t; end loop; raise notice '%=%', i, s; end loop; end; $function$ takes lot of megabytes of memory too. Regards Pavel
Re: alter table add x wrong error position
On 2024-Jan-08, jian he wrote: > hi. > Maybe this is a small printout err_position bug. > > create table atacc2 ( test int, a int, b int) ; > success tests: > alter table atacc2 add CONSTRAINT x PRIMARY KEY (id, b ); > alter table atacc2 add CONSTRAINT x PRIMARY KEY (id, b a); > alter table atacc2 add CONSTRAINT x PRIMARY KEYa (id, b); > > tests have problem: > alter table atacc2 add constraints x unique (test, a, b); > ERROR: syntax error at or near "(" > LINE 1: alter table atacc2 add constraints x unique (test, a, b); > > ^ > ADD either following with the optional keyword "COLUMN" or > "CONSTRAINT" as the doc. > so I should expect the '^' point at "constraints"? Here you're saying to add a column called constraints, of type x; then UNIQUE is parsed by columnDef as ColQualList, which goes to the ColConstraintElem production starting with the UNIQUE keyword: | UNIQUE opt_unique_null_treatment opt_definition OptConsTableSpace so the next thing could be opt_unique_null_treatment or opt_definition or OptConsTableSpace or going back to ColQualList, but none of those start with a '(' parens. So the ( doesn't have a match and you get the syntax error. If you don't misspell CONSTRAINT as "constraints", there's no issue. I don't see any way to improve this. You can't forbid misspellings of the keyword CONSTRAINT, because they can be column names. alter table atacc2 add cnstrnt x unique (test, a, b); ERROR: error de sintaxis en o cerca de «(» LÍNEA 1: alter table atacc2 add cnstrnt x unique (test, a, b); ^ -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: plpgsql memory leaks
pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule napsal: > Hi > > I have reported very memory expensive pattern: > > CREATE OR REPLACE FUNCTION public.fx(iter integer) > RETURNS void > LANGUAGE plpgsql > AS $function$ > declare > c cursor(m bigint) for select distinct i from generate_series(1, m) g(i); > t bigint; > s bigint; > begin > for i in 1..iter > loop > open c(m := i * 1); > s := 0; > loop > fetch c into t; > exit when not found; > s := s + t; > end loop; > close c; raise notice '%=%', i, s; > end loop; > end; > $function$ > ; > > This script takes for 100 iterations 100MB > > but rewritten > > CREATE OR REPLACE FUNCTION public.fx(iter integer) > RETURNS void > LANGUAGE plpgsql > AS $function$ > declare > t bigint; > s bigint; > begin > for i in 1..iter > loop > s := 0; > for t in select ic from generate_series(1, i * 1) g(ic) > loop > s := s + t; > end loop; > raise notice '%=%', i, s; > end loop; > end; > $function$ > > takes lot of megabytes of memory too. > The megabytes leaks are related to JIT. With JIT off the memory consumption is significantly less although there are some others probably. regards Pavel > Regards > > Pavel >
Re: A failure in t/038_save_logical_slots_shutdown.pl
On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila wrote: > > On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy > wrote: > > > > On Thu, Jan 11, 2024 at 4:35 PM vignesh C wrote: > > > > > > On further analysis, it was found that in the failing test, > > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL > > > page header present just before the CHECKPOINT_SHUTDOWN which was > > > causing the failure. We could alternatively reproduce the issue by > > > switching the WAL file before restarting the server like in the > > > attached test change patch. > > > There are a couple of ways to fix this issue a) one by switching the > > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN > > > does not get inserted in a new page as in the attached test_fix.patch > > > b) by using pg_walinspect to check that the next WAL record is > > > CHECKPOINT_SHUTDOWN. I have to try this approach. > > > > > > Thanks to Bharath and Kuroda-san for offline discussions and helping > > > in getting to the root cause. > > > > IIUC, the problem the commit e0b2eed tries to solve is to ensure there > > are no left-over decodable WAL records between confirmed_flush LSN and > > a shutdown checkpoint, which is what it is expected from the > > t/038_save_logical_slots_shutdown.pl. How about we have a PG function > > returning true if there are any decodable WAL records between the > > given start_lsn and end_lsn? > > > > But, we already test this in 003_logical_slot during a successful > upgrade. Having an explicit test to do the same thing has some merits > but not sure if it is worth it. If the code added by commit e0b2eed is covered by the new upgrade test, why not remove 038_save_logical_slots_shutdown.pl altogether? > The current test tries to ensure that > during shutdown after we shutdown walsender and ensures that it sends > all the wal records and receipts an ack for the same, there is no > other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes > the test robust enough that it shouldn't show spurious failures like > the current one reported by you, so shall we try to proceed with that? Do you mean something like [1]? It ensures the test passes unless any writes are added (in future) before the publisher restarts in the test which can again make the tests flaky. How do we ensure no one adds anything in before the publisher restart 038_save_logical_slots_shutdown.pl? A note before the restart perhaps? I might be okay with a simple solution like [1] with a note before the restart instead of other complicated ones. [1] diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl index 5a4f5dc1d4..493fdbce2f 100644 --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl @@ -60,6 +60,14 @@ $node_subscriber->start; $node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)"); $node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)"); +# On some machines, it was detected that the shutdown checkpoint WAL record +# that gets generated as part of the publisher restart below falls exactly in +# the new page in the WAL file. Due to this, the latest checkpoint location and +# confirmed flush check in compare_confirmed_flush() was failing. Hence, we +# advance WAL by 1 segment before generating some data so that the shutdown +# checkpoint doesn't fall exactly in the new WAL file page. +$node_publisher->advance_wal(1); + # Insert some data $node_publisher->safe_psql('postgres', "INSERT INTO test_tbl VALUES (generate_series(1, 5));"); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
plperl and perl 5.38
Perl 5.38 has landed in Debian unstable, and plperl doesn't like it: diff -U3 /home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_elog_1.out /home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_elog.out --- /home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_elog_1.out 2023-07-24 12:47:52.124583553 + +++ /home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_elog.out 2024-01-12 10:09:51.065265341 + @@ -76,6 +76,7 @@ RETURN 1; END; $$; +WARNING: could not determine encoding for locale "C.utf8": codeset is "ANSI_X3.4-1968" select die_caller(); NOTICE: caught die die_caller diff -U3 /home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_call.out /home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_call.out --- /home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_call.out 2023-10-17 09:40:01.365865484 + +++ /home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_call.out 2024-01-12 10:09:51.413278511 + @@ -64,6 +64,7 @@ RAISE NOTICE '_a: %, _b: %', _a, _b; END $$; +WARNING: could not determine encoding for locale "C.utf8": codeset is "ANSI_X3.4-1968" NOTICE: a: 10, b: NOTICE: _a: 10, _b: 20 DROP PROCEDURE test_proc1; Same problem in 17devel and 16. (Did not try the older branches yet.) Christoph
RE: speed up a logical replica setup
Dear Euler, Here are comments for your v5 patch. 01. In the document, two words target/standby are used as almost the same meaning. Can you unify them? 02. ``` pg_subscriber option ``` There are some mandatory options like -D/-S/-P. It must be listed in Synopsis chapter. 03. ``` pg_subscriber takes the publisher and subscriber connection strings, a cluster directory from a standby server and a list of database names and it sets up a new logical replica using the physical recovery process. ``` I briefly checked other pages and they do not describe accepted options here. A summary of the application should be mentioned. Based on that, how about: ``` pg_subscriber creates a new subscriber from a physical standby server. This allows users to quickly set up logical replication system. ``` 04. ``` The pg_subscriber should be run at the target server. The source server (known as publisher server) should accept logical replication connections from the target server (known as subscriber server). The target server should accept local logical replication connection. ``` I'm not native speaker, but they are not just recommmendations - they are surely required. So, should we replace s/should/has to/? 05. ``` -S conninfo --subscriber-conninfo=conninfo The connection string to the subscriber. For details see . ``` I became not sure whether it is "The connection string to the subscriber.". The server is still physical standby at that time. 06. ``` * IDENTIFICATION * src/bin/pg_subscriber/pg_subscriber.c ``` The identification is not correct. 07. I felt that there were too many global variables and LogicalRepInfo should be refactored. Because... * Some info related with clusters(e.g., subscriber_dir, conninfo, ...) should be gathered in one struct. * pubconninfo/subsconninfo are stored per db, but it is not needed if we have one base_conninfo. * pubname/subname are not needed because we have fixed naming rule. * pg_ctl_path and pg_resetwal_path can be conbimed into one bindir. * num_dbs should not be alone. ... Based on above, how about using structures like below? ``` typedef struct LogicalRepPerdbInfo { Oid oid; char *dbname; boolmade_replslot; /* replication slot was created */ boolmade_publication; /* publication was created */ boolmade_subscription; /* subscription was created */ } LogicalRepPerdbInfo; typedef struct PrimaryInfo { char *base_conninfo; boolmade_transient_replslot; } PrimaryInfo; typedef struct StandbyInfo { char *base_conninfo; char *bindir; char *pgdata; char *primary_slot_name; } StandbyInfo; typedef struct LogicalRepInfo { int num_dbs; LogicalRepPerdbInfo*perdb; PrimaryInfo*primary; StandbyInfo*standby; } LogicalRepInfo; ``` 08. ``` char *subconninfo;/* subscription connection string for logical * replication */ ``` Not sure how we should notate because the target has not been subscriber yet. 09. ``` enum WaitPMResult { POSTMASTER_READY, POSTMASTER_STANDBY, POSTMASTER_STILL_STARTING, POSTMASTER_FAILED }; ``` This enum has been already defined in pg_ctl.c. Not sure we can use the same name. Can we rename to PGSWaitPMResult. or export pre-existing one? 10. ``` /* Options */ static const char *progname; ``` I think it is not an option. 11. ``` /* * Validate a connection string. Returns a base connection string that is a * connection string without a database name plus a fallback application name. * Since we might process multiple databases, each database name will be * appended to this base connection string to provide a final connection string. * If the second argument (dbname) is not null, returns dbname if the provided * connection string contains it. If option --database is not provided, uses * dbname as the only database to setup the logical replica. * It is the caller's responsibility to free the returned connection string and * dbname. */ static char * get_base_conninfo(char *conninfo, char *dbname, const char *noderole) ``` Just FYI - adding fallback_application_name may be too optimisitic. Currently the output was used by both pg_subscriber and subscription connection. 12. Can we add an option not to remove log files even operations were succeeded. 13. ``` /* * Since the standby server is running, check if it is using an * existing replication slot for WAL retention purposes. This * replication slot has no use after the transformation, hence, it * will
Re: tablecmds.c/MergeAttributes() cleanup
On 2024-Jan-11, Alvaro Herrera wrote: > If you look closely at InsertPgAttributeTuples and accompanying code, it > all looks a bit archaic. They seem to be treating TupleDesc as a > glorified array of Form_pg_attribute elements in a convenient packaging. > It's probably cleaner to change these APIs so that they deal with a > Form_pg_attribute array, and not TupleDesc anymore. But we can hack on > that some other day. In addition, it also occurs to me now that maybe it would make sense to change the TupleDesc implementation to use a List of Form_pg_attribute instead of an array, and do away with ->natts. This would let us change all "for ( ... natts ...)" loops into foreach_ptr() loops ... there are only five hundred of them or so --rolls eyes--. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "El sudor es la mejor cura para un pensamiento enfermo" (Bardia)
Re: plpgsql memory leaks
Hi, On Fri, Jan 12, 2024 at 11:02:14AM +0100, Pavel Stehule wrote: > pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule > napsal: > > > Hi > > > > I have reported very memory expensive pattern: [...] > > takes lot of megabytes of memory too. > > The megabytes leaks are related to JIT. With JIT off the memory consumption > is significantly less although there are some others probably. I cannot readily reproduce this. Which version of Postgres is this and on which platform/distribution? Did you try keep jit on but set jit_inline_above_cost to 0? The back-branches have a fix for the above case, i.e. llvmjit memleaks that can be worked-around by setting jit_inline_above_cost=0. Michael
Re: Make attstattarget nullable
On 2024-Jan-11, Peter Eisentraut wrote: > On 10.01.24 14:16, Alvaro Herrera wrote: > > Seems reasonable. Do we really need a catversion bump for this? > > Yes, this changes the order of the fields in pg_attribute. Ah, right. > > In get_attstattarget() I think we should return 0 for dropped columns > > without reading attstattarget, which is useless anyway, and if it did > > happen to return non-null, it might cause us to do stuff, which would be > > a waste. > > I ended up deciding to get rid of get_attstattarget() altogether and just do > the fetching inline in examine_attribute(). Because the previous API and > what you are discussing here is over-designed, since the only caller doesn't > call it with dropped columns or system columns anyway. This way these > issues are contained in the ANALYZE code, not in a very general place like > lsyscache.c. Sounds good. Maybe instead of having examine_attribute hand a -1 target to the analyze functions, we could just put default_statistics_target there. Analyze functions would never receive negative values, and we could remove that from the analyze functions. Maybe make VacAttrStats->attstattarget unsigned while at it. (This could be a separate patch.) > > It's annoying that the new code in index_concurrently_swap() is more > > verbose than the code being replaced, but it seems OK to me, since it > > allows us to distinguish a null value in attstattarget from actual 0 > > without complicating the get_attstattarget API (which I think we would > > have to do if we wanted to use it here.) > > Yeah, this was annoying. Originally, I had it even more complicated, > because I was trying to check if the actual (non-null) values are the same. > But then I realized the new value is never set at this point. I think what > the code is actually about is clearer now. Yeah, it's neat and the comment is clear enough. > And of course the 0003 patch gets rid of it anyway. I again didn't look at 0002 and 0003 very closely, but from 10,000 feet it looks mostly reasonable -- but I think naming the struct FormData_pg_attribute_extra is not a great idea, as it looks like there would have to be a catalog named pg_attribute_extra -- and I don't think I would make the "non-Data" pointer-to-struct typedef either. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Make attstattarget nullable
BTW I wanted to but didn't say so explicitly, so here goes: 0001 looks ready to go in. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever" (Oliver Silfridge)
Re: doc: add LITERAL tag to RETURNING
On 2024-Jan-12, Ashutosh Bapat wrote: > On Fri, Jan 12, 2024 at 11:27 AM torikoshia > wrote: > > > > RETURNING is usually tagged with appropriate tags, such as , > > but not in the 'query' section of COPY. > The patch looks good. Good catch, pushed. It has user-visible effect, so I backpatched it. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "People get annoyed when you try to debug them." (Larry Wall)
Re: Synchronizing slots from primary to standby
On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila wrote: > > > > +static bool > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > { > > ... > > + /* Slot ready for sync, so sync it. */ > > + else > > + { > > + /* > > + * Sanity check: With hot_standby_feedback enabled and > > + * invalidations handled appropriately as above, this should never > > + * happen. > > + */ > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > + elog(ERROR, > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > + " to remote slot's LSN(%X/%X) as synchronization" > > + " would move it backwards", remote_slot->name, > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > ... > > } > > > > I was thinking about the above code in the patch and as far as I can > > think this can only occur if the same name slot is re-created with > > prior restart_lsn after the existing slot is dropped. Normally, the > > newly created slot (with the same name) will have higher restart_lsn > > but one can mimic it by copying some older slot by using > > pg_copy_logical_replication_slot(). > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > temporarily set to off, the above shouldn't happen. It can only lead > > to invalidated slots on standby. > > > > To close the above race, I could think of the following ways: > > 1. Drop and re-create the slot. > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > ahead of local_slot's LSN then we can update it; but as mentioned in > > your previous comment, we need to update all other fields as well. If > > we follow this then we probably need to have a check for catalog_xmin > > as well. > > > > The second point as mentioned is slightly misleading, so let me try to > rephrase it once again: Emit LOG/WARNING in this case and once > remote_slot's LSN moves ahead of local_slot's LSN then we can update > it; additionally, we need to update all other fields like two_phase as > well. If we follow this then we probably need to have a check for > catalog_xmin as well along remote_slot's restart_lsn. > > > Now, related to this the other case which needs some handling is what > > if the remote_slot's restart_lsn is greater than local_slot's > > restart_lsn but it is a re-created slot with the same name. In that > > case, I think the other properties like 'two_phase', 'plugin' could be > > different. So, is simply copying those sufficient or do we need to do > > something else as well? > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > this problem as I think it is important to handle this race condition. Is there any good use case of copying a failover slot in the first place? If it's not a normal use case and we can probably live without it, why not always disable failover during the copy? FYI we always disable two_phase on copied slots. It seems to me that copying a failover slot could lead to problems, as long as we synchronize slots based on their names. IIUC without the copy, this pass should never happen. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Compile warnings in dbcommands.c building with meson
On 2024-Jan-12, jian he wrote: > I saw it sometimes, sometimes not. > Now I think the reason is: > it will appear when you do `-Dbuildtype=release`. > > but it will not occur when I do: > `-Dbuildtype=debug` > > my current meson version is 1.3.1, my ninja version is 1.10.1. Hmm, but why doesn't it happen for other arguments of get_db_info that have pretty much identical code, say src_istemplate? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "But static content is just dynamic content that isn't moving!" http://smylers.hates-software.com/2007/08/15/fe244d0c.html
Re: Make mesage at end-of-recovery less scary.
Hi, > The errors occurred in a part of the tests for end-of-WAL detection > added in the master branch. These failures were primarily due to > changes in the message contents introduced by this patch. During the > revision, I discovered an issue with the handling of empty pages that > appear in the middle of reading continuation records. In the previous > version, such empty pages were mistakenly identified as indicating a > clean end-of-WAL (that is a LOG). However, they should actually be > handled as a WARNING, since the record curently being read is broken > at the empty pages. The following changes have been made in this > version: > > 1. Adjusting the test to align with the error message changes > introduced by this patch. > > 2. Adding tests for the newly added messages. > > 3. Correcting the handling of empty pages encountered during the > reading of continuation records. (XLogReaderValidatePageHeader) > > 4. Revising code comments. > > 5. Changing the term "log segment" to "WAL > segment". (XLogReaderValidatePageHeader) > > regards. Thanks for the updated patch. ``` +p = (char *) record; +pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + +while (p < pe && *p == 0) +p++; + +if (p == pe) ``` Just as a random thought: perhaps we should make this a separate function, as a part of src/port/. It seems to me that this code could benefit from using vector instructions some day, similarly to memcmp(), memset() etc. Surprisingly there doesn't seem to be a standard C function for this. Alternatively one could argue that one cycle doesn't make much code to reuse and that the C compiler will place SIMD instructions for us. However a counter-counter argument would be that we could use a macro or even better an inline function and have the same effect except getting a slightly more readable code. ``` - * This is just a convenience subroutine to avoid duplicated code in + * This is just a convenience subroutine to avoid duplicate code in ``` This change doesn't seem to be related to the patch. Personally I don't mind it though. All in all I find v28 somewhat scary. It does much more than "making one message less scary" as it was initially intended and what bugged me personally, and accordingly touches many more places including xlogreader.c, xlogrecovery.c, etc. Particularly I have mixed feeling about this: ``` +/* + * Consider it as end-of-WAL if all subsequent bytes of this page + * are zero. We don't bother checking the subsequent pages since + * they are not zeroed in the case of recycled segments. + */ ``` If I understand correctly, if somehow several FS blocks end up being zeroed (due to OS bug, bit rot, restoring from a corrupted for whatever reason backup, hardware failures, ...) there is non-zero chance that PG will interpret this as a normal situation. To my knowledge this is not what we typically do - typically PG would report an error and ask a human to figure out what happened. Of course the possibility of such a scenario is small, but I don't think that as DBMS developers we can ignore it. Does anyone agree or maybe I'm making things up? -- Best regards, Aleksander Alekseev
Re: Improve the connection failure error messages
Thanks for the review. Attached v2 patch with suggested changes. Please find my response inline. On Fri, Jan 12, 2024 at 8:20 AM Peter Smith wrote: > > Thanks for the patch! Here are a couple of review comments for it. > > == > src/backend/commands/subscriptioncmds.c > > 1. > @@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate, > CreateSubscriptionStmt *stmt, > if (!wrconn) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > - errmsg("could not connect to the publisher: %s", err))); > + errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, > err))); > > In practice, these commands give errors like: > > test_sub=# create subscription sub1 connection 'dbname=bogus' publication > pub1; > ERROR: could not connect to the publisher: connection to server on > socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "bogus" does not > exist > > and logs like: > > 2024-01-12 12:45:05.177 AEDT [13108] ERROR: could not connect to the > publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed: > FATAL: database "bogus" does not exist > 2024-01-12 12:45:05.177 AEDT [13108] STATEMENT: create subscription > sub1 connection 'dbname=bogus' publication pub1; > > Since the subscription name is already obvious from the statement that > caused the error I'm not sure it benefits much to add this to the > error message (but maybe it is useful if the error message was somehow > read in isolation from the statement). > > Anyway, I felt at least it should include the word "subscription" for > better consistency with the other messages in this patch: > > SUGGESTION > subscription \"%s\" could not connect to the publisher: %s Done. > == > > 2. > + appname = cluster_name[0] ? cluster_name : "walreceiver"; > + > /* Establish the connection to the primary for XLOG streaming */ > - wrconn = walrcv_connect(conninfo, false, false, > - cluster_name[0] ? cluster_name : "walreceiver", > - &err); > + wrconn = walrcv_connect(conninfo, false, false, appname, &err); > if (!wrconn) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > - errmsg("could not connect to the primary server: %s", err))); > + errmsg("%s could not connect to the primary server: %s", appname, err))); > > I think your new %s should be quoted according to the guidelines at [1]. Done. > == > src/test/regress/expected/subscription.out > > 3. > Apparently, there is no existing regression test case for the ALTER > "could not connect" message because if there was, it would have > failed. Maybe a test should be added? > The ALTER SUBSCRIPTION command does not error out on the user interface if updated with a bad connection string and the connection failure error can only be seen in the respective log file. Due to this behavior, it is not possible to add a test to show the error message as it is done for CREATE SUBSCRIPTION. Let me know if you think there is another way to add this test. -- Thanks, Nisha v2-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Fri, Jan 12, 2024 at 5:30 PM Masahiko Sawada wrote: > > On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila wrote: > > > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ > > > + else > > > + { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > > + elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > > temporarily set to off, the above shouldn't happen. It can only lead > > > to invalidated slots on standby. > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. If > > > we follow this then we probably need to have a check for catalog_xmin > > > as well. > > > > > > > The second point as mentioned is slightly misleading, so let me try to > > rephrase it once again: Emit LOG/WARNING in this case and once > > remote_slot's LSN moves ahead of local_slot's LSN then we can update > > it; additionally, we need to update all other fields like two_phase as > > well. If we follow this then we probably need to have a check for > > catalog_xmin as well along remote_slot's restart_lsn. > > > > > Now, related to this the other case which needs some handling is what > > > if the remote_slot's restart_lsn is greater than local_slot's > > > restart_lsn but it is a re-created slot with the same name. In that > > > case, I think the other properties like 'two_phase', 'plugin' could be > > > different. So, is simply copying those sufficient or do we need to do > > > something else as well? > > > > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > > this problem as I think it is important to handle this race condition. > > Is there any good use case of copying a failover slot in the first > place? If it's not a normal use case and we can probably live without > it, why not always disable failover during the copy? FYI we always > disable two_phase on copied slots. It seems to me that copying a > failover slot could lead to problems, as long as we synchronize slots > based on their names. IIUC without the copy, this pass should never > happen. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com There are multiple approaches discussed and tried when it comes to starting a slot-sync worker. I am summarizing all here: 1) Make slotsync worker as an Auxiliary Process (like checkpointer, walwriter, walreceiver etc). The benefit this approach provides is, it can control begin and stop in a more flexible way as each auxiliary process could have different checks before starting and can have different stop conditions. But it needs code duplication for process management(start, stop, crash handling, signals etc) and currently it does not support db-connection smoothly (none of the auxiliary process has one so far) We attempted to make slot-sync worker as an auxiliary process and faced some challenges. The slot sync worker needs db-connection and thus needs InitPostgres(). But AuxiliaryProcessMain() and InitPostgres() are not compatible as both invoke common functions and end up setting many callbacks functions twice (with different args). Also InitPostgres() does 'MyBackendId' initialization (which further triggers some stuff) which is not needed for AuxiliaryProcess and so on. And thus in order to make slot-sync worker as an auxiliary process, we need something similar to InitPostgres (trimmed down working version) which needs further detailed analysis. 2) Make slotsync worker as a 'special' process like AutoVacLauncher which is neither an Auxiliary process nor a bgworker one. It allows db-connection and also provides flexibility to have start and stop conditions for a process. But it needs a lot of code-duplication ar
Re: plpgsql memory leaks
pá 12. 1. 2024 v 11:54 odesílatel Michael Banck napsal: > Hi, > > On Fri, Jan 12, 2024 at 11:02:14AM +0100, Pavel Stehule wrote: > > pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule > > > napsal: > > > > > Hi > > > > > > I have reported very memory expensive pattern: > > [...] > > > > takes lot of megabytes of memory too. > > > > The megabytes leaks are related to JIT. With JIT off the memory > consumption > > is significantly less although there are some others probably. > > I cannot readily reproduce this. > > Which version of Postgres is this and on which platform/distribution? > It was tested on master branch (pg 17) on Fedora 39 > > Did you try keep jit on but set jit_inline_above_cost to 0? > > The back-branches have a fix for the above case, i.e. llvmjit memleaks > that can be worked-around by setting jit_inline_above_cost=0. > I'll do recheck Pavel > > > Michael >
Re: Show WAL write and fsync stats in pg_stat_io
Hi, On Thu, 11 Jan 2024 at 17:28, Melanie Plageman wrote: > > On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz wrote: > > > > On Thu, 11 Jan 2024 at 08:01, Michael Paquier wrote: > > > > > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote: > > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz > > > > wrote: > > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier > > > >> wrote: > > > >> Oh, I understand it now. Yes, that makes sense. > > > >> I thought removing op_bytes completely ( as you said "This patch > > > >> extends it with two more operation sizes, and there are even cases > > > >> where it may be a variable" ) from pg_stat_io view then adding > > > >> something like {read | write | extend}_bytes and {read | write | > > > >> extend}_calls could be better, so that we don't lose any information. > > > > > > > > Upthread, Michael says: > > > > > > > >> I find the use of 1 in this context a bit confusing, because when > > > >> referring to a counter at N, then it can be understood as doing N > > > >> times a operation, > > > > > > > > I didn't understand this argument, so I'm not sure if I agree or > > > > disagree with it. > > > > > > Nazir has mentioned upthread one thing: what should we do for the case > > > where a combination of (io_object,io_context) does I/O with a > > > *variable* op_bytes, because that may be the case for the WAL > > > receiver? For this case, he has mentioned that we should set op_bytes > > > to 1, but that's something I find confusing because it would mean that > > > we are doing read, writes or extends 1 byte at a time. My suggestion > > > would be to use op_bytes = -1 or NULL for the variable case instead, > > > with reads, writes and extends referring to a number of bytes rather > > > than a number of operations. > > > > I agree but we can't do this only for the *variable* cases since > > B_WAL_RECEIVER and other backends use the same > > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean > > is, if two backends use the same pgstat_count_io_op_time() function > > call in the code; they must count the same thing (number of calls, > > bytes, etc.). It could be better to count the number of bytes for all > > the IOOBJECT_WAL IOs. > > I'm a bit confused by this. pgstat_count_io_op_time() can check > MyBackendType. In fact, you do this to ban the wal receiver already. > It is true that you would need to count all wal receiver normal > context wal object IOOps in the variable way, but I don't see how > pgstat_count_io_op_time() is the limiting factor as long as the > callsite is always doing either the number of bytes or the number of > calls. Apologies for not being clear. Let me try to explain this by giving examples: Let's assume that there are 3 different pgstat_count_io_op_time() calls in the code base and they are labeled as 1, 2 and 3. And let's' assume that: WAL receiver uses 1st and 2nd pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd pgstat_count_io_op_time() and checkpointer uses 3rd pgstat_count_io_op_time() to count IOs. The 1st one is the only pgstat_count_io_op_time() call that must count the number of bytes because of the variable cases and the others count the number of calls or blocks. a) WAL receiver uses both 1st and 2nd => 1st and 2nd pgstat_count_io_op_time() must count the same thing => 2nd pgstat_count_io_op_time() must count the number of bytes as well. b) 2nd pgstat_count_io_op_time() started to count the number of bytes => Autovacuum will start to count the number of bytes => 2nd and 3rd both are used by autocavuum => 3rd pgstat_count_io_op_time() must count the number of bytes as well. c) 3rd pgstat_count_io_op_time() started to count the number of bytes => Checkpointer will start to count the number of bytes. The list goes on like this and if we don't have [write | read | extend]_bytes, this effect will be multiplied. > > > > I think these are the three proposals for handling WAL reads: > > > > > > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes > > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the > > > > number of calls to pg_pread() or similar > > > > 3) setting op_bytes to NULL and the number of reads is the number of > > > > calls to pg_pread() or similar > > > > > > 3) could be a number of bytes, actually. > > > > One important point is that we can't change only reads, if we decide > > to count the number of bytes for the reads; writes and extends should > > be counted as a number of bytes as well. > > Yes, that is true. > > > > > Looking at the patch, I think it is still doing 2. > > > > > > The patch disables stats for the WAL receiver, while the startup > > > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to > > > discard the variable case. > > > > > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for > > > > all those IOOps for which it would be relevant. It kind of stinks but > > > > it wo
Re: Compile warnings in dbcommands.c building with meson
On Fri, Jan 12, 2024 at 8:03 PM Alvaro Herrera wrote: > > On 2024-Jan-12, jian he wrote: > > > I saw it sometimes, sometimes not. > > Now I think the reason is: > > it will appear when you do `-Dbuildtype=release`. > > > > but it will not occur when I do: > > `-Dbuildtype=debug` > > > > my current meson version is 1.3.1, my ninja version is 1.10.1. > > Hmm, but why doesn't it happen for other arguments of get_db_info that > have pretty much identical code, say src_istemplate? > git at commit 6780b79d5c580586ae6feb37b9c8b8bf33367886 (HEAD -> master, origin/master, origin/HEAD) the minimum setup that will generate the warning: meson setup --reconfigure ${BUILD} \ -Dprefix=${PG_PREFIX} \ -Dpgport=5462 \ -Dbuildtype=release gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Re: Improve the connection failure error messages
Hi, Thanks for the patch. > Due to this behavior, it is not possible to add a test to show the > error message as it is done for CREATE SUBSCRIPTION. > Let me know if you think there is another way to add this test. I believe it can be done with TAP tests, see for instance: contrib/auto_explain/t/001_auto_explain.pl However I wouldn't insist on including the test in scope of this particular patch. Such a test doesn't currently exist, it can be added as a separate patch, and whether this is actually a useful test (all the tests consume time after all...) is somewhat debatable. Personally I agree that it would be nice to have though. This being said... > The ALTER SUBSCRIPTION command does not error out on the user > interface if updated with a bad connection string and the connection > failure error can only be seen in the respective log file. I wonder if we should fix this. Again, not necessarily in scope of this work, but if this is not a difficult task, again, it would be nice to have. Other than that v2 looks good. -- Best regards, Aleksander Alekseev
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote: > Hi, > > 12.01.2024 10:15, Bertrand Drouvot wrote: > > > > For this one, the "good" news is that it looks like that we don’t see the > > "terminating" message not followed by an "obsolete" message (so the engine > > behaves correctly) anymore. > > > > There is simply nothing related to the row_removal_activeslot at all (the > > catalog_xmin > > advanced and there is no conflict). > > Yes, judging from all the failures that we see now, it looks like the > 0001-Fix-race-condition...patch works as expected. Yeah, thanks for confirming, I'll create a dedicated hackers thread for that one. > > Let's give v5 a try? (please apply > > v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch > > too). > > Unfortunately, I've got the failure again (please see logs attached). > (_primary.log can confirm that I have used exactly v5 — I see no > txid_current() calls there...) Okay ;-( Thanks for the testing. Then I can think of: 1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying things if such a standby snapshot is found). 2) Don't report a test error for active slots in case its catalog_xmin advanced. I'd vote for 2) as: - this is a corner case and the vast majority of the animals don't report any issues (means the active slot conflict detection is already well covered). - even on the same animal it should be pretty rare to not have an active slot conflict detection not covered at all (and the "failing" one would be probably moving over time). - It may be possible that 1) ends up failing (as we'd need to put a limit on the retry logic anyhow). What do you think? And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's fully correct, so I'll give it another look. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: plpgsql memory leaks
Hi, On Fri, Jan 12, 2024 at 01:35:24PM +0100, Pavel Stehule wrote: > pá 12. 1. 2024 v 11:54 odesílatel Michael Banck napsal: > > Which version of Postgres is this and on which platform/distribution? > > It was tested on master branch (pg 17) on Fedora 39 > > > Did you try keep jit on but set jit_inline_above_cost to 0? > > > > The back-branches have a fix for the above case, i.e. llvmjit memleaks > > that can be worked-around by setting jit_inline_above_cost=0. I got that wrong, it needs to be -1 to disable it. But if you are already running the master branch, it is probably a separate issue. Michael
Re: reorganize "Shared Memory and LWLocks" section of docs
Hi, > I recently began trying to write documentation for the dynamic shared > memory registry feature [0], and I noticed that the "Shared Memory and > LWLocks" section of the documentation might need some improvement. I know that feeling. > Thoughts? """ Any registered shmem_startup_hook will be executed shortly after each backend attaches to shared memory. """ IMO the word "each" here can give the wrong impression as if there are certain guarantees about synchronization between backends. Maybe we should change this to simply "... will be executed shortly after [the?] backend attaches..." """ should ensure that only one process allocates a new tranche_id (LWLockNewTrancheId) and initializes each new LWLock (LWLockInitialize). """ Personally I think that reminding the corresponding function name here is redundant and complicates reading just a bit. But maybe it's just me. Except for these nitpicks the patch looks good. -- Best regards, Aleksander Alekseev
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman wrote: > Yes, this is a good point. Seems like writing maintainable code is > really about never lying and then figuring out when hiding the truth > is also lying. Darn! I think that's pretty true. In this particular case, this code has a fair number of preexisting problems of this type, IMHO. It's been touched by many hands, each person with their own design ideas, and the result isn't as coherent as if it were written by a single mind from scratch. But, because this code is also absolutely critical to the system not eating user data, any changes have to be approached with the highest level of caution. I think it's good to be really careful about this sort of refactoring anywhere, because finding out a year later that you broke something and having to go back and fix it is no fun even if the consequences are minor ... here, they might not be. > Ah! I think you are right. Good catch. I could fix this with logic like this: > > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted; > if ((vacrel->nindexes == 0 && space_freed) || > (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming))) Perhaps that would be better written as space_freed || (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point you might not need to introduce the variable. > I think I made this mistake when working on a different version of > this that combined the prune and no prune cases. I noticed that > lazy_scan_noprune() updates the FSM whenever there are no indexes. I > wonder why this is (and why we don't do it in the prune case). Yeah, this all seems distinctly under-commented. As noted above, this code has grown organically, and some things just never got written down. Looking through the git history, I see that this behavior seems to date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced lazy_scan_noprune(). The comments don't explain the reasoning, but my guess is that it was just an accident. It's not entirely evident to me whether there might ever be good reasons to update the freespace map for a page where we haven't freed up any space -- after all, the free space map isn't crash-safe, so you could always postulate that updating it will correct an existing inaccuracy. But I really doubt that there's any good reason for lazy_scan_prune() and lazy_scan_noprune() to have different ideas about whether to update the FSM or not, especially in an obscure corner case like this. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan - FSIP) wrote: > Thank you for developing the new tool. I have attached a patch that corrects > the spelling of the --individual option in the SGML file. Thanks, committed. -- Robert Haas EDB: http://www.enterprisedb.com
[PATCH] New predefined role pg_manage_extensions
Hi, I propose to add a new predefined role to Postgres, pg_manage_extensions. The idea is that it allows Superusers to delegate the rights to create, update or delete extensions to other roles, even if those extensions are not trusted or those users are not the database owner. I have attached a WIP patch for this. Thoughts? Michael >From 59497e825184f0de30a18573ffd7d331be3b233d Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Fri, 12 Jan 2024 13:56:59 +0100 Subject: [PATCH] Add new pg_manage_extensions predefined role. This allows any role that is granted this new predefined role to CREATE, UPDATE or DROP extensions, no matter whether they are trusted or not. --- doc/src/sgml/user-manag.sgml | 5 + src/backend/commands/extension.c | 11 ++- src/include/catalog/pg_authid.dat | 5 + 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/user-manag.sgml b/doc/src/sgml/user-manag.sgml index 92a299d2d3..ebb82801ec 100644 --- a/doc/src/sgml/user-manag.sgml +++ b/doc/src/sgml/user-manag.sgml @@ -693,6 +693,11 @@ DROP ROLE doomed_role; database to issue CREATE SUBSCRIPTION. + + pg_manage_extensions + Allow creating, removing or updating extensions, even if the + extensions are untrusted or the user is not the database owner. + diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 226f85d0e3..71481d9a73 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -882,13 +882,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, ListCell *lc2; /* - * Enforce superuser-ness if appropriate. We postpone these checks until - * here so that the control flags are correctly associated with the right + * Enforce superuser-ness/membership of the pg_manage_extensions + * predefined role if appropriate. We postpone these checks until here + * so that the control flags are correctly associated with the right * script(s) if they happen to be set in secondary control files. */ if (control->superuser && !superuser()) { - if (extension_is_trusted(control)) + if (extension_is_trusted(control) || has_privs_of_role(GetUserId(), ROLE_PG_MANAGE_EXTENSIONS)) switch_to_superuser = true; else if (from_version == NULL) ereport(ERROR, @@ -897,7 +898,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, control->name), control->trusted ? errhint("Must have CREATE privilege on current database to create this extension.") - : errhint("Must be superuser to create this extension."))); + : errhint("Only roles with privileges of the \"%s\" role are allowed to create this extension.", "pg_manage_extensions"))); else ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), @@ -905,7 +906,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, control->name), control->trusted ? errhint("Must have CREATE privilege on current database to update this extension.") - : errhint("Must be superuser to update this extension."))); + : errhint("Only roles with privileges of the \"%s\" role are allowed to update this extension.", "pg_manage_extensions"))); } filename = get_extension_script_filename(control, from_version, version); diff --git a/src/include/catalog/pg_authid.dat b/src/include/catalog/pg_authid.dat index 82a2ec2862..ac70603d26 100644 --- a/src/include/catalog/pg_authid.dat +++ b/src/include/catalog/pg_authid.dat @@ -94,5 +94,10 @@ rolcreaterole => 'f', rolcreatedb => 'f', rolcanlogin => 'f', rolreplication => 'f', rolbypassrls => 'f', rolconnlimit => '-1', rolpassword => '_null_', rolvaliduntil => '_null_' }, +{ oid => '8801', oid_symbol => 'ROLE_PG_MANAGE_EXTENSIONS', + rolname => 'pg_manage_extensions', rolsuper => 'f', rolinherit => 't', + rolcreaterole => 'f', rolcreatedb => 'f', rolcanlogin => 'f', + rolreplication => 'f', rolbypassrls => 'f', rolconnlimit => '-1', + rolpassword => '_null_', rolvaliduntil => '_null_' }, ] -- 2.39.2
Re: tablecmds.c/MergeAttributes() cleanup
On Fri, Jan 12, 2024 at 5:32 AM Alvaro Herrera wrote: > On 2024-Jan-11, Alvaro Herrera wrote: > > If you look closely at InsertPgAttributeTuples and accompanying code, it > > all looks a bit archaic. They seem to be treating TupleDesc as a > > glorified array of Form_pg_attribute elements in a convenient packaging. > > It's probably cleaner to change these APIs so that they deal with a > > Form_pg_attribute array, and not TupleDesc anymore. But we can hack on > > that some other day. > > In addition, it also occurs to me now that maybe it would make sense to > change the TupleDesc implementation to use a List of Form_pg_attribute > instead of an array, and do away with ->natts. This would let us change > all "for ( ... natts ...)" loops into foreach_ptr() loops ... there are > only five hundred of them or so --rolls eyes--. Open-coding stuff like this can easily work out to a loss, and I personally think we're overly dependent on List. It's not a particularly good abstraction, IMHO, and if we do a lot of work to start using it everywhere because a list is really an array, then what happens when somebody decides that a list really ought to be a skip-list, or a hash table, or some other crazy thing? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Stack overflow issue
On 11/01/2024 19:37, Robert Haas wrote: On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas wrote: The problem with CommitTransactionCommand (or rather AbortCurrentTransaction() which has the same problem) and ShowTransactionStateRec is that they get called in a state where aborting can lead to a panic. If you add a "check_stack_depth()" to them and try to reproducer scripts that Egor posted, you still get a panic. Hmm, that's unfortunate. I'm not sure what to do about that. But I'd still suggest looking for a goto-free approach. Here's one goto-free attempt. It adds a local loop to where the recursion was, so that if you have a chain of subtransactions that need to be aborted in CommitTransactionCommand, they are aborted iteratively. The TBLOCK_SUBCOMMIT case already had such a loop. I added a couple of comments in the patch marked with "REVIEWER NOTE", to explain why I changed some things. They are to be removed before committing. I'm not sure if this is better than a goto. In fact, even if we commit this, I think I'd still prefer to replace the remaining recursive calls with a goto. Recursion feels a weird to me, when we're unwinding the states from the stack as we go. Of course we could use a "for (;;) { ... continue }" construct around the whole function, instead of a goto, but I don't think that's better than a goto in this case. -- Heikki Linnakangas Neon (https://neon.tech) From 44568e6feef79d604bbe168c0839ab2e03c99f8a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 12 Jan 2024 17:07:25 +0200 Subject: [PATCH v2 1/1] Turn tail recursion into iteration in AbortCurrentTransaction() Usually the compiler will optimize away the tail recursion anyway, but if it doesn't, you can drive the function into stack overflow. For example: (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null The usual fix would be to add a check_stack_depth() to the function, but that's not good in AbortCurrentTransaction(), as you are already trying to clean up after a failure. ereporting there leads to a panic. Fix CurrentTransactionCommand() in a similar fashion, although ereporting would be less bad there. Report by Egor Chindyaskin and Alexander Lakhin. Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru --- src/backend/access/transam/xact.c | 141 -- 1 file changed, 75 insertions(+), 66 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 464858117e0..329a139c991 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -3194,47 +3194,62 @@ CommitTransactionCommand(void) CommitSubTransaction(); s = CurrentTransactionState; /* changed by pop */ } while (s->blockState == TBLOCK_SUBCOMMIT); + /* + * REVIEWER NOTE: We used to have duplicated code here, to handle + * the TBLOCK_END and TBLOCK_PREPARE cases exactly the same way + * that we do above. I replaced that with a recursive call to + * CommitTransactionCommand(), to make this consistent with the + * TBLOCK_SUBABORT_END and TBLOCK_SUBABORT_PENDING handling. This + * can only recurse once, so there's no risk of stack overflow. + */ /* If we had a COMMIT command, finish off the main xact too */ - if (s->blockState == TBLOCK_END) - { -Assert(s->parent == NULL); -CommitTransaction(); -s->blockState = TBLOCK_DEFAULT; -if (s->chain) -{ - StartTransaction(); - s->blockState = TBLOCK_INPROGRESS; - s->chain = false; - RestoreTransactionCharacteristics(&savetc); -} - } - else if (s->blockState == TBLOCK_PREPARE) - { -Assert(s->parent == NULL); -PrepareTransaction(); -s->blockState = TBLOCK_DEFAULT; - } - else + if (s->blockState != TBLOCK_END && s->blockState != TBLOCK_PREPARE) elog(ERROR, "CommitTransactionCommand: unexpected state %s", BlockStateAsString(s->blockState)); + CommitTransactionCommand(); break; /* - * The current already-failed subtransaction is ending due to a - * ROLLBACK or ROLLBACK TO command, so pop it and recursively - * examine the parent (which could be in any of several states). + * The current subtransaction is ending due to a ROLLBACK or + * ROLLBACK TO command. Pop it, as well as any of it parents that + * are also being rolled back. */ case TBLOCK_SUBABORT_END: - CleanupSubTransaction(); - CommitTransactionCommand(); - break; + case TBLOCK_SUBABORT_PENDING: + do { +/* + * The difference between SUBABORT_END and SUBABORT_PENDING is + * whether the subtransaction has already been aborted and we + * just need to clean it up, or if we need to also abort it + * first. + */ +if (s->blockState == TBLOCK_SUBABORT_PENDING) + AbortSubTransaction(); +CleanupSubTransaction(); +s = CurrentTransa
Re: [PATCH] New predefined role pg_manage_extensions
On Fri, 12 Jan 2024 at 15:53, Michael Banck wrote: > I propose to add a new predefined role to Postgres, > pg_manage_extensions. The idea is that it allows Superusers to delegate > the rights to create, update or delete extensions to other roles, even > if those extensions are not trusted or those users are not the database > owner. I agree that extension creation is one of the main reasons people require superuser access, and I think it would be beneficial to try to reduce that. But I'm not sure that such a pg_manage_extensions role would have any fewer permissions than superuser in practice. Afaik many extensions that are not marked as trusted, are not trusted because they would allow fairly trivial privilege escalation to superuser if they were.
Re: Make all Perl warnings fatal
On 11.01.24 12:29, Bharath Rupireddy wrote: On Sat, Dec 30, 2023 at 12:57 AM Peter Eisentraut wrote: committed With the commit c5385929 converting perl warnings to FATAL, use of psql/safe_psql with timeout parameters [1] fail with the following error: Use of uninitialized value $ret in bitwise and (&) at /home/ubuntu/postgres/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2015. I think what is actually relevant is the timed_out parameter, otherwise the psql/safe_psql function ends up calling "die" and you don't get any further. Perhaps assigning a default error code to $ret instead of undef in PostgreSQL::Test::Cluster - psql() function is the solution. I would put this code my $core = $ret & 128 ? " (core dumped)" : ""; die "psql exited with signal " . ($ret & 127) . "$core: '$$stderr' while running '@psql_params'" if $ret & 127; $ret = $ret >> 8; inside a if (defined $ret) block. Then the behavior would be that the whole function returns undef on timeout, which is usefully different from returning 0 (and matches previous behavior).
Re: tablecmds.c/MergeAttributes() cleanup
Robert Haas writes: > On Fri, Jan 12, 2024 at 5:32 AM Alvaro Herrera > wrote: >> In addition, it also occurs to me now that maybe it would make sense to >> change the TupleDesc implementation to use a List of Form_pg_attribute >> instead of an array, and do away with ->natts. This would let us change >> all "for ( ... natts ...)" loops into foreach_ptr() loops ... there are >> only five hundred of them or so --rolls eyes--. > Open-coding stuff like this can easily work out to a loss, and I > personally think we're overly dependent on List. It's not a > particularly good abstraction, IMHO, and if we do a lot of work to > start using it everywhere because a list is really an array, then what > happens when somebody decides that a list really ought to be a > skip-list, or a hash table, or some other crazy thing? Without getting into opinions on whether List is a good abstraction, I'm -1 on this idea. It would be a large amount of code churn, with attendant back-patching pain, and I just don't see that there is much to be gained. regards, tom lane
Re: the s_lock_stuck on perform_spin_delay
On Wed, Jan 10, 2024 at 10:17 PM Andy Fan wrote: > fixed in v2. Timing the spinlock wait seems like a separate patch from the new sanity checks. I suspect that the new sanity checks should only be armed in assert-enabled builds. I'm doubtful that this method of tracking the wait time will be accurate. And I don't know that we can make it accurate with reasonable overhead. But I don't think we can assume that the time we tried to wait for and the time that we were actually descheduled are the same. -- Robert Haas EDB: http://www.enterprisedb.com
Re: reorganize "Shared Memory and LWLocks" section of docs
Thanks for reviewing. On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote: > """ > Any registered shmem_startup_hook will be executed shortly after each > backend attaches to shared memory. > """ > > IMO the word "each" here can give the wrong impression as if there are > certain guarantees about synchronization between backends. Maybe we > should change this to simply "... will be executed shortly after > [the?] backend attaches..." I see what you mean, but I don't think the problem is the word "each." I think the problem is the use of passive voice. What do you think about something like Each backend will execute the registered shmem_startup_hook shortly after it attaches to shared memory. > """ > should ensure that only one process allocates a new tranche_id > (LWLockNewTrancheId) and initializes each new LWLock > (LWLockInitialize). > """ > > Personally I think that reminding the corresponding function name here > is redundant and complicates reading just a bit. But maybe it's just > me. Yeah, I waffled on this one. I don't mind removing it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: tablecmds.c/MergeAttributes() cleanup
On Fri, Jan 12, 2024 at 10:09 AM Robert Haas wrote: > Open-coding stuff like this can easily work out to a loss, and I > personally think we're overly dependent on List. It's not a > particularly good abstraction, IMHO, and if we do a lot of work to > start using it everywhere because a list is really an array, then what > happens when somebody decides that a list really ought to be a > skip-list, or a hash table, or some other crazy thing? This paragraph was a bit garbled. I meant to say that open-coding can be better than relying on a canned abstraction, but it came out the other way around. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make all Perl warnings fatal
On Fri, Jan 12, 2024 at 9:03 PM Peter Eisentraut wrote: > > On 11.01.24 12:29, Bharath Rupireddy wrote: > > On Sat, Dec 30, 2023 at 12:57 AM Peter Eisentraut > > wrote: > >> > >> committed > > > > With the commit c5385929 converting perl warnings to FATAL, use of > > psql/safe_psql with timeout parameters [1] fail with the following > > error: > > > > Use of uninitialized value $ret in bitwise and (&) at > > /home/ubuntu/postgres/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm > > line 2015. > > I think what is actually relevant is the timed_out parameter, otherwise > the psql/safe_psql function ends up calling "die" and you don't get any > further. > > > Perhaps assigning a default error code to $ret instead of undef in > > PostgreSQL::Test::Cluster - psql() function is the solution. > > I would put this code > > my $core = $ret & 128 ? " (core dumped)" : ""; > die "psql exited with signal " >. ($ret & 127) >. "$core: '$$stderr' while running '@psql_params'" >if $ret & 127; > $ret = $ret >> 8; > > inside a if (defined $ret) block. > > Then the behavior would be that the whole function returns undef on > timeout, which is usefully different from returning 0 (and matches > previous behavior). WFM. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: System username in pg_stat_activity
On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot wrote: > > Hi, > > On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote: > > On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot > > wrote: > > > > > > If we go the 2 fields way, then what about auth_identity and auth_method > > > then? > > > > > > Here is an updated patch based on this idea. > > Thanks! > > + > + > + auth_method text > + > + > + The authentication method used for authenticating the connection, or > + NULL for background processes. > + > > I'm wondering if it would make sense to populate it for parallel workers too. > I think it's doable thanks to d951052, but I'm not sure it's worth it (one > could > join based on the leader_pid though). OTOH that would be consistent with > how the SYSTEM_USER behaves with parallel workers (it's populated). I guess one could conceptually argue that "authentication happens int he leader". But we do populate it with the other user records, and it'd be weird if this one was excluded. The tricky thing is that pgstat_bestart() is called long before we deserialize the data. But from what I can tell it should be safe to change it per the attached? That should be AFAICT an extremely short window of time longer before we report it, not enough to matter. > + > + auth_identity text > + > + > + The identity (if any) that the user presented during the > authentication > + cycle before they were assigned a database role. Contains the same > + value as > > Same remark regarding the parallel workers case +: > > - Would it be better to use the `name` datatype for auth_identity? I've been going back and forth. And I think my conclusion is that it's not a postgres identifier, so it shouldn't be. See the earlier discussion, and for example that that's what we do for cert names when SSL is used. > - what about "Contains the same value as the identity part in linkend="system-user" />"? > > + /* > +* Trust doesn't set_authn_id(), but we still need to > store the > +* auth_method > +*/ > + MyClientConnectionInfo.auth_method = uaTrust; > > +1, I think it is useful here to provide "trust" and not a NULL value in the > context of this patch. Yeah, that's probably "independently correct", but actually useful here. > +# pg_stat_activity shold contain trust and empty string for trust auth > > typo: s/shold/should/ Ops. > +# Users with md5 auth should show both auth method and name in > pg_stat_activity > > what about "show both auth method and identity"? Good spot, yeah, I changed it over to identity everywhere else so it should be here as well. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/ diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 849a03e4b6..269ab4e586 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -1529,6 +1529,9 @@ ParallelWorkerMain(Datum main_arg) /* Attach to the leader's serializable transaction, if SERIALIZABLE. */ AttachSerializableXact(fps->serializable_xact_handle); + /* Report to the stats system that we are started */ + pgstat_bestart(); + /* * We've initialized all of our state now; nothing should change * hereafter. diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 1ad3367159..ce090f6864 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -22,6 +22,7 @@ #include "access/genam.h" #include "access/heapam.h" #include "access/htup_details.h" +#include "access/parallel.h" #include "access/session.h" #include "access/sysattr.h" #include "access/tableam.h" @@ -1235,7 +1236,7 @@ InitPostgres(const char *in_dbname, Oid dboid, process_session_preload_libraries(); /* report this backend in the PgBackendStatus array */ - if (!bootstrap) + if (!bootstrap && !IsParallelWorker()) pgstat_bestart(); /* close the transaction we started above */
Re: Error "initial slot snapshot too large" in create replication slot
On Thu, Jan 11, 2024 at 9:47 PM Kyotaro Horiguchi wrote: > I understand. So, summirizing the current state briefly, I believe it > as follows: > > a. snapbuild lacks a means to differentiate between top and sub xids > during snapshot building. > > b. Abusing takenDuringRecovery could lead to potential issues, so it > has been rejected. > > c. Dynamic sizing of xip is likely to have a significant impact on > performance (as mentioned in the comments of GetSnapshotData). > > d. (new!) Adding a third storing method is not favored. > > As a method to satisfy these prerequisites, I think one direction > could be to split takenDuringRecovery into flags indicating the > storage method and creation timing. I present this as a last-ditch > effort. It's a rough proposal, and further validation should be > necessary. If this direction is also not acceptable, I'll proceed with > removing the CF entry. I think that the idea of evolving takenDuringRecovery could potentially work for this problem and maybe for some other things as well. I remember studying that flag before and coming to the conclusion that it had some pretty specific and surprising semantics that weren't obvious from the name -- I don't remember the details -- and so I think improving it could be useful. Also, I think that multiple storage methods could be more palatable if there were a clear way to distinguish which one was in use and good comments explaining the distinction in relevant places. However, I wonder whether this whole area is in need of a bigger rethink. There seem to be a number of situations in which the split into xip and subxip arrays is not very convenient, and also some situations where it's quite important. Sometimes we want to record what's committed, and sometimes what isn't. It's all a bit messy and inconsistent. The necessity of limiting snapshot size is annoying, too. I have no real idea what can be done about all of this, but what strikes me is that the current system has grown up incrementally: we started with a data structure designed for the original use case, and now by gradually adding new use cases things have gotten complicated. If we were designing things over from scratch, maybe we'd do it differently and end up with something less messy. And maybe someone can imagine a redesign that is likewise less messy. But on the other hand, maybe not. Perhaps we can't really do any better than what we have. Then the question becomes whether this case is important enough to justify additional code complexity. I don't think I've personally seen users run into this problem so I have no special reason to think that it's important, but if it's causing issues for other people then maybe it is. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: Add new pg_walsummary tool.
On Thu, Jan 11, 2024 at 12:56 PM Robert Haas wrote: > Add new pg_walsummary tool. culicidae is unhappy with this, but I don't yet understand why. The output is: # Failed test 'stdout shows block 0 modified' # at t/002_blocks.pl line 85. # 'TS 1663, DB 5, REL 16384, FORK main: blocks 0..1' # doesn't match '(?^m:FORK main: block 0$)' The test is expecting block 0 to be modified, but block 1 to be unmodified, but here, both blocks are modified. That would maybe make sense if this machine had a really big block size, but that doesn't seem to be the case. Or, maybe the test has erred in failing to disable autovacuum -- though it does take other precautions to try to prevent that from interfering. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: Add new pg_walsummary tool.
On Thu, Jan 11, 2024 at 1:49 PM Robert Haas wrote: > On Thu, Jan 11, 2024 at 12:56 PM Robert Haas wrote: > > Add new pg_walsummary tool. > > culicidae is unhappy with this, but I don't yet understand why. The output is: > > # Failed test 'stdout shows block 0 modified' > # at t/002_blocks.pl line 85. > # 'TS 1663, DB 5, REL 16384, FORK main: blocks 0..1' > # doesn't match '(?^m:FORK main: block 0$)' > > The test is expecting block 0 to be modified, but block 1 to be > unmodified, but here, both blocks are modified. That would maybe make > sense if this machine had a really big block size, but that doesn't > seem to be the case. Or, maybe the test has erred in failing to > disable autovacuum -- though it does take other precautions to try to > prevent that from interfering. It's not autovacuum, the test is flaky. I ran it in a loop locally until it failed, and then ran pg_waldump, finding this: rmgr: Heaplen (rec/tot): 73/ 8249, tx:738, lsn: 0/0158AEE8, prev 0/01588EB8, desc: UPDATE old_xmax: 738, old_off: 2, old_infobits: [], flags: 0x03, new_xmax: 0, new_off: 76, blkref #0: rel 1663/5/16384 blk 1 FPW, blkref #1: rel 1663/5/16384 blk 0 I'm slightly puzzled, here. I would have expected that if I inserted a bunch of records into the table and then updated one of them, the new record would have gone into a new page at the end of the table, and also that even if it didn't extend the relation, it would go into the same page every time the test was run. But here the behavior seems to be nondeterministic. -- Robert Haas EDB: http://www.enterprisedb.com
Re: index prefetching
Hi, Here's an improved version of this patch, finishing a lot of the stuff that I alluded to earlier - moving the code from indexam.c, renaming a bunch of stuff, etc. I've also squashed it into a single patch, to make it easier to review. I'll briefly go through the main changes in the patch, and then will respond in-line to Robert's points. 1) I moved the code from indexam.c to (new) execPrefetch.c. All the prototypes / typedefs now live in executor.h, with only minimal changes in execnodes.h (adding it to scan descriptors). I believe this finally moves the code to the right place - it feels much nicer and cleaner than in indexam.c. And it allowed me to hide a bunch of internal structs and improve the general API, I think. I'm sure there's stuff that could be named differently, but the layering feels about right, I think. 2) A bunch of stuff got renamed to start with IndexPrefetch... to make the naming consistent / clearer. I'm not entirely sure IndexPrefetch is the right name, though - it's still a bit misleading, as it might seem it's about prefetching index stuff, but really it's about heap pages from indexes. Maybe IndexScanPrefetch() or something like that? 3) If there's a way to make this work with the streaming I/O API, I'm not aware of it. But the overall design seems somewhat similar (based on "next" callback etc.) so hopefully that'd make it easier to adopt it. 4) I initially relied on parallelModeOK to disable prefetching, which kinda worked, but not really. Robert suggested to use the execute_once flag directly, and I think that's much better - not only is it cleaner, it also seems more appropriate (the parallel flag considers other stuff that is not quite relevant to prefetching). Thinking about this, I think it should be possible to make prefetching work even for plans with execute_once=false. In particular, when the plan changes direction it should be possible to simply "walk back" the prefetch queue, to get to the "correct" place in in the scan. But I'm not sure it's worth it, because plans that change direction often can't really benefit from prefetches anyway - they'll often visit stuff they accessed shortly before anyway. For plans that don't change direction but may pause, we don't know if the plan pauses long enough for the prefetched pages to get evicted or something. So I think it's OK that execute_once=false means no prefetching. 5) I haven't done anything about the xs_heap_continue=true case yet. 6) I went through all the comments and reworked them considerably. The main comment at execPrefetch.c start, with some overall design etc. And then there are comments for each function, explaining that bit in more detail. Or at least that's the goal - there's still work to do. There's two trivial FIXMEs, but you can ignore those - it's not that there's a bug, but that I'd like to rework something and just don't know how yet. There's also a couple of XXX comments. Some are a bit wild ideas for the future, others are somewhat "open questions" to be discussed during a review. Anyway, there should be no outright obsolete comments - if there's something I missed, let me know. Now to Robert's message ... On 1/9/24 21:31, Robert Haas wrote: > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra > wrote: >> Here's a somewhat reworked version of the patch. My initial goal was to >> see if it could adopt the StreamingRead API proposed in [1], but that >> turned out to be less straight-forward than I hoped, for two reasons: > > I guess we need Thomas or Andres or maybe Melanie to comment on this. > Yeah. Or maybe Thomas if he has thoughts on how to combine this with the streaming I/O stuff. >> Perhaps a bigger change is that I decided to move this into a separate >> API on top of indexam.c. The original idea was to integrate this into >> index_getnext_tid/index_getnext_slot, so that all callers benefit from >> the prefetching automatically. Which would be nice, but it also meant >> it's need to happen in the indexam.c code, which seemed dirty. > > This patch is hard to review right now because there's a bunch of > comment updating that doesn't seem to have been done for the new > design. For instance: > > + * XXX This does not support prefetching of heap pages. When such > prefetching is > + * desirable, use index_getnext_tid(). > > But not any more. > True. And this is now even more obsolete, as the prefetching was moved from indexam.c layer to the executor. > + * XXX The prefetching may interfere with the patch allowing us to evaluate > + * conditions on the index tuple, in which case we may not need the heap > + * tuple. Maybe if there's such filter, we should prefetch only pages that > + * are not all-visible (and the same idea would also work for IOS), but > + * it also makes the indexing a bit "aware" of the visibility stuff (which > + * seems a somewhat wrong). Also, maybe we should consider the filter > selectivity > > I'm not sure whether all the problems in this
Re: psql not responding to SIGINT upon db reconnection
On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin wrote: > I think the way to go is to expose some variation of libpq's > pqSocketPoll(), which I would be happy to put together a patch for. > Making frontends, psql in this case, have to reimplement the polling > logic doesn't strike me as fruitful, which is essentially what I have > done. I encourage further exploration of this line of attack. I fear that if I were to commit something like what you've posted up until now, people would complain that that code was too ugly to live, and I'd have a hard time telling them that they're wrong. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 9:44 AM Robert Haas wrote: > Looking through the git history, I see that this behavior seems to > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > lazy_scan_noprune(). The comments don't explain the reasoning, but my > guess is that it was just an accident. It's not entirely evident to me > whether there might ever be good reasons to update the freespace map > for a page where we haven't freed up any space -- after all, the free > space map isn't crash-safe, so you could always postulate that > updating it will correct an existing inaccuracy. But I really doubt > that there's any good reason for lazy_scan_prune() and > lazy_scan_noprune() to have different ideas about whether to update > the FSM or not, especially in an obscure corner case like this. Why do you think that lazy_scan_prune() and lazy_scan_noprune() have different ideas about whether to update the FSM or not? Barring certain failsafe edge cases, we always call PageGetHeapFreeSpace() exactly once for each scanned page. While it's true that we won't always do that in the first heap pass (in cases where VACUUM has indexes), that's only because we expect to do it in the second heap pass instead -- since we only want to do it once. It's true that lazy_scan_noprune unconditionally calls PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases we know that there won't be any second heap pass, and so in both cases we always call PageGetHeapFreeSpace() in the first heap pass. It's just that it's a bit harder to see that in the lazy_scan_prune case. No? -- Peter Geoghegan
Re: Report planning memory in EXPLAIN ANALYZE
I think this patch is mostly OK and decided to use it to test something else. In doing so I noticed one small thing that bothers me: if planning uses buffers, and those were requested to be reported, we get something like this at the end of the explain Planning: Buffers: shared hit=120 read=30 Planning Memory: used=67944 bytes allocated=73728 bytes Planning Time: 0.892 ms This looks a bit unpleasant. Wouldn't it be much nicer if the Planning thingies were all reported together in the Planning group? Planning: Buffers: shared hit=120 read=30 Memory: used=67944 bytes allocated=73728 bytes Time: 0.892 ms This is easier said than done. First, moving "Time" to be in the same group would break tools that are already parsing the current format. So maybe we should leave that one alone. So that means we end up with this: Planning: Buffers: shared hit=120 read=30 Memory: used=67944 bytes allocated=73728 bytes Planning Time: 0.892 ms Better than the original, I think, even if not perfect. Looking at the code, this is a bit of a mess to implement, because of the way show_buffer_usage is coded; currently it has an internal increment/decrement of indentation, so in order to report memory we would have to move the indentation change to outside show_buffer_usage (I think it does belong outside, but the determination of whether to print or not is quite messy, so we'd need a new function returning boolean). Alternatively we could move the memory reportage to within show_buffer_usage, but that seems worse. Or we could leave it as you have it, but to me that's akin to giving up on doing it nicely. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "I think my standards have lowered enough that now I think 'good design' is when the page doesn't irritate the living f*ck out of me." (JWZ)
Re: index prefetching
Not a full response, but just to address a few points: On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra wrote: > Thinking about this, I think it should be possible to make prefetching > work even for plans with execute_once=false. In particular, when the > plan changes direction it should be possible to simply "walk back" the > prefetch queue, to get to the "correct" place in in the scan. But I'm > not sure it's worth it, because plans that change direction often can't > really benefit from prefetches anyway - they'll often visit stuff they > accessed shortly before anyway. For plans that don't change direction > but may pause, we don't know if the plan pauses long enough for the > prefetched pages to get evicted or something. So I think it's OK that > execute_once=false means no prefetching. +1. > > + * XXX We do add the cache size to the request in order not to > > + * have issues with uint64 underflows. > > > > I don't know what this means. > > > > There's a check that does this: > > (x + PREFETCH_CACHE_SIZE) >= y > > it might also be done as "mathematically equivalent" > > x >= (y - PREFETCH_CACHE_SIZE) > > but if the "y" is an uint64, and the value is smaller than the constant, > this would underflow. It'd eventually disappear, once the "y" gets large > enough, ofc. The problem is, I think, that there's no particular reason that someone reading the existing code should imagine that it might have been done in that "mathematically equivalent" fashion. I imagined that you were trying to make a point about adding the cache size to the request vs. adding nothing, whereas in reality you were trying to make a point about adding from one side vs. subtracting from the other. > > + * We reach here if the index only scan is not parallel, or if we're > > + * serially executing an index only scan that was planned to be > > + * parallel. > > > > Well, this seems sad. > > Stale comment, I believe. However, I didn't see much benefits with > parallel index scan during testing. Having I/O from multiple workers > generally had the same effect, I think. Fair point, likely worth mentioning explicitly in the comment. > Yeah. I renamed all the structs and functions to IndexPrefetchSomething, > to keep it consistent. And then the constants are all capital, ofc. It'd still be nice to get table or heap in there, IMHO, but maybe we can't, and consistency is certainly a good thing regardless of the details, so thanks for that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote: > Jeremy also raised a problem with old versions of psql connecting to > a > new server: the \l and \dO won't work. Not sure exactly what to do > there, but I could work around it by adding a new field rather than > renaming (though that's not ideal). I did a bit of research for a precedent, and the closest I could find is that \dp was broken between 14 and 15 by commit 07eee5a0dc. That is some precedent, but it's more narrow. I think that might justify breaking \dO in older clients, but \l is used frequently. It seems safer to just introduce new columns "datbuiltinlocale" and "collbuiltinlocale". They'll be nullable anyway. If we want to clean this up we can do so as a separate commit. There are other parts of the catalog representation related to collation that we might want to clean up as well. Regards, Jeff Davis
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 9:44 AM Robert Haas wrote: > > On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman > wrote: > > Ah! I think you are right. Good catch. I could fix this with logic like > > this: > > > > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted; > > if ((vacrel->nindexes == 0 && space_freed) || > > (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming))) > > Perhaps that would be better written as space_freed || > (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point > you might not need to introduce the variable. As I revisited this, I realized why I had written this logic for updating the FSM: if (vacrel->nindexes == 0 || !vacrel->do_index_vacuuming || lpdead_items == 0) It is because in master, in lazy_scan_heap(), when there are no indexes and no space has been freed, we actually keep going and hit the other logic to update the FSM at the bottom of the loop: if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) The effect is that if there are no indexes and no space is freed we always update the FSM. When combined, that means that if there are no indexes, we always update the FSM. With the logic you suggested, we fail a pageinspect test which expects the FSM to exist when it doesn't. So, adding the logic: if (space_freed || (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming) || vacrel->nindexes == 0) We pass that pageinspect test. But, we fail a freespacemap test which expects some freespace reported which isn't: id| blkno | is_avail -+---+-- - freespace_tab | 0 | t + freespace_tab | 0 | f which I presume is because space_freed is not the same as !has_lpdead_items. I can't really see what logic would be right here. > > I think I made this mistake when working on a different version of > > this that combined the prune and no prune cases. I noticed that > > lazy_scan_noprune() updates the FSM whenever there are no indexes. I > > wonder why this is (and why we don't do it in the prune case). > > Yeah, this all seems distinctly under-commented. As noted above, this > code has grown organically, and some things just never got written > down. > > Looking through the git history, I see that this behavior seems to > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > lazy_scan_noprune(). The comments don't explain the reasoning, but my > guess is that it was just an accident. It's not entirely evident to me > whether there might ever be good reasons to update the freespace map > for a page where we haven't freed up any space -- after all, the free > space map isn't crash-safe, so you could always postulate that > updating it will correct an existing inaccuracy. But I really doubt > that there's any good reason for lazy_scan_prune() and > lazy_scan_noprune() to have different ideas about whether to update > the FSM or not, especially in an obscure corner case like this. All of this makes me wonder why we would update the FSM in vacuum when no space was freed. I thought that RelationAddBlocks() and RelationGetBufferForTuple() would handle making sure the FSM gets created and updated if the reason there is freespace is because the page hasn't been filled yet. - Melanie
Re: Custom explain options
On 10/21/23 14:16, Konstantin Knizhnik wrote: > Hi hackers, > > EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, > COST,...) which help to provide useful details of query execution. > In Neon we have added PREFETCH option which shows information about page > prefetching during query execution (prefetching is more critical for Neon > architecture because of separation of compute and storage, so it is > implemented not only for bitmap heap scan as in Vanilla Postgres, but > also for seqscan, indexscan and indexonly scan). Another possible > candidate for explain options is local file cache (extra caching layer > above shared buffers which is used to somehow replace file system cache > in standalone Postgres). Not quite related to this patch about EXPLAIN options, but can you share some details how you implemented prefetching for the other nodes? I'm asking because I've been working on prefetching for index scans, so I'm wondering if there's a better way to do this, or how to do it in a way that would allow neon to maybe leverage that too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: psql not responding to SIGINT upon db reconnection
On Fri Jan 12, 2024 at 10:45 AM CST, Robert Haas wrote: On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin wrote: > I think the way to go is to expose some variation of libpq's > pqSocketPoll(), which I would be happy to put together a patch for. > Making frontends, psql in this case, have to reimplement the polling > logic doesn't strike me as fruitful, which is essentially what I have > done. I encourage further exploration of this line of attack. I fear that if I were to commit something like what you've posted up until now, people would complain that that code was too ugly to live, and I'd have a hard time telling them that they're wrong. Completely agree. Let me look into this. Perhaps I can get something up next week or the week after. -- Tristan Partin Neon (https://neon.tech)
Re: introduce dynamic shared memory registry
Here's a new version of the patch set in which I've attempted to address the feedback in this thread. Note that 0001 is being tracked in a separate thread [0], but it is basically a prerequisite for adding the documentation for this feature, so that's why I've also added it here. [0] https://postgr.es/m/20240112041430.GA3557928%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..0ba52b41d4 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend sould obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. Like with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. - - - An example of a shmem_request_hook can be found in - contrib/pg_stat_statements/pg_stat_statements.c in the - PostgreSQL source tree. - - - There is another, more flexible method of obtaining LWLocks. First, - allocate a tranche_id from a shared counter by - calling: + This ensures that an array of num_lwlocks LWLocks is + available under the name tranche_name. A pointer to + this array can be obtained by calling: -int LWLockNewTrancheId(void) +LWLockPadded *GetNamedLWLockTranche(const char *tranche_name) - Next, each individual process using the tranche_id - should associate it with
Re: Report planning memory in EXPLAIN ANALYZE
At 2024-01-12 17:52:27 +0100, alvhe...@alvh.no-ip.org wrote: > > I think this patch is mostly OK (After the last few rounds of changes, it looks fine to me too.) > Planning: >Buffers: shared hit=120 read=30 >Memory: used=67944 bytes allocated=73728 bytes > Planning Time: 0.892 ms > > […] > > Or we could leave it as you have it, but to me that's akin to giving up > on doing it nicely. For completeness, there's a third option, which is easier to write and a bit more friendly to the sort of thing that might already be parsing "Planning Time", viz., Planning Buffers: shared hit=120 read=30 Planning Memory: used=67944 bytes allocated=73728 bytes Planning Time: 0.892 ms (Those "bytes" look slightly odd to me in the midst of all the x=y pieces, but that's probably not worth thinking about.) -- Abhijit
Re: reorganize "Shared Memory and LWLocks" section of docs
On Fri, Jan 12, 2024 at 09:46:50AM -0600, Nathan Bossart wrote: > On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote: >> """ >> Any registered shmem_startup_hook will be executed shortly after each >> backend attaches to shared memory. >> """ >> >> IMO the word "each" here can give the wrong impression as if there are >> certain guarantees about synchronization between backends. Maybe we >> should change this to simply "... will be executed shortly after >> [the?] backend attaches..." > > I see what you mean, but I don't think the problem is the word "each." I > think the problem is the use of passive voice. What do you think about > something like > > Each backend will execute the registered shmem_startup_hook shortly > after it attaches to shared memory. > >> """ >> should ensure that only one process allocates a new tranche_id >> (LWLockNewTrancheId) and initializes each new LWLock >> (LWLockInitialize). >> """ >> >> Personally I think that reminding the corresponding function name here >> is redundant and complicates reading just a bit. But maybe it's just >> me. > > Yeah, I waffled on this one. I don't mind removing it. Here is a new version of the patch with these changes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v2 1/1] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..0ba52b41d4 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend sould obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. Like with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, in
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan wrote: > > On Fri, Jan 12, 2024 at 9:44 AM Robert Haas wrote: > > Looking through the git history, I see that this behavior seems to > > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > > lazy_scan_noprune(). The comments don't explain the reasoning, but my > > guess is that it was just an accident. It's not entirely evident to me > > whether there might ever be good reasons to update the freespace map > > for a page where we haven't freed up any space -- after all, the free > > space map isn't crash-safe, so you could always postulate that > > updating it will correct an existing inaccuracy. But I really doubt > > that there's any good reason for lazy_scan_prune() and > > lazy_scan_noprune() to have different ideas about whether to update > > the FSM or not, especially in an obscure corner case like this. > > Why do you think that lazy_scan_prune() and lazy_scan_noprune() have > different ideas about whether to update the FSM or not? > > Barring certain failsafe edge cases, we always call > PageGetHeapFreeSpace() exactly once for each scanned page. While it's > true that we won't always do that in the first heap pass (in cases > where VACUUM has indexes), that's only because we expect to do it in > the second heap pass instead -- since we only want to do it once. > > It's true that lazy_scan_noprune unconditionally calls > PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same > behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases > we know that there won't be any second heap pass, and so in both cases > we always call PageGetHeapFreeSpace() in the first heap pass. It's > just that it's a bit harder to see that in the lazy_scan_prune case. > No? So, I think this is the logic in master: Prune case, first pass - indexes == 0 && space_freed -> update FSM - indexes == 0 && (!space_freed || !index_vacuuming) -> update FSM - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM which reduces to: - indexes == 0 || !space_freed || !index_vacuuming No Prune (except aggressive vacuum), first pass: - indexes == 0 -> update FSM - indexes > 0 && !space_freed -> update FSM which reduces to: - indexes == 0 || !space_freed which is, during the first pass, if we will not do index vacuuming, either because we have no indexes, because there is nothing to vacuum, or because do_index_vacuuming is false, make sure we update the freespace map now. but what about no prune when do_index_vacuuming is false? I still don't understand why vacuum is responsible for updating the FSM per page when no line pointers have been set unused. That is how PageGetFreeSpace() figures out if there is free space, right? - Melanie
Re: introduce dynamic shared memory registry
At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote: > > From: Nathan Bossart > Date: Thu, 11 Jan 2024 21:55:25 -0600 > Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation > > --- > doc/src/sgml/xfunc.sgml | 182 +--- > 1 file changed, 114 insertions(+), 68 deletions(-) > > diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml > index 89116ae74c..0ba52b41d4 100644 > --- a/doc/src/sgml/xfunc.sgml > +++ b/doc/src/sgml/xfunc.sgml > @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS > anyarray > > > […] > - from your shmem_request_hook. > - > - > - LWLocks are reserved by calling: > + Each backend sould obtain a pointer to the reserved shared memory by sould → should > + Add-ins can reserve LWLocks on server startup. Like with shared > memory, (Would "As with shared memory" read better? Maybe, but then again maybe it should be left alone because you also write "Unlike with" elsewhere.) -- Abhijit
Re: Built-in CTYPE provider
Jeff Davis wrote: > > Jeremy also raised a problem with old versions of psql connecting to > > a > > new server: the \l and \dO won't work. Not sure exactly what to do > > there, but I could work around it by adding a new field rather than > > renaming (though that's not ideal). > > I did a bit of research for a precedent, and the closest I could find > is that \dp was broken between 14 and 15 by commit 07eee5a0dc. Another one is that version 12 broke \d in older psql by removing pg_class.relhasoids. ISTM that in general the behavior of old psql vs new server does not weight much against choosing optimal catalog changes. There's also that warning at start informing users about it: WARNING: psql major version X, server major version Y. Some psql features might not work. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman wrote: > So, I think this is the logic in master: > > Prune case, first pass > > ... > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM What is "space_freed"? Isn't that something from your uncommitted patch? As I said, the aim is to call PageGetHeapFreeSpace() (*not* PageGetFreeSpace(), which is only used for index pages) exactly once per heap page scanned. This is supposed to happen independently of whatever specific work was/will be required for the heap page. In general, we don't ever trust that the FSM is already up-to-date. Presumably because the FSM isn't crash safe. On master, prunestate.has_lpdead_items may be set true when our VACUUM wasn't actually the thing that performed pruning that freed tuple storage -- typically when some other backend was the one that did all required pruning at some earlier point in time, often via opportunistic pruning. For better or worse, the only thing that VACUUM aims to do is make sure that PageGetHeapFreeSpace() gets called exactly once per scanned page. > which is, during the first pass, if we will not do index vacuuming, > either because we have no indexes, because there is nothing to vacuum, > or because do_index_vacuuming is false, make sure we update the > freespace map now. > > but what about no prune when do_index_vacuuming is false? Note that do_index_vacuuming cannot actually affect the "nindexes == 0" case. We have an assertion that's kinda relevant to this point: if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) { /* * Wait until lazy_vacuum_heap_rel() to save free space. * */ Assert(vacrel->nindexes > 0); UnlockReleaseBuffer(buf); } else { } We should never get this far down in the lazy_scan_heap() loop when "nindexes == 0 && prunestate.has_lpdead_items". That's handled in the special "single heap pass" branch after lazy_scan_prune(). As for the case where we use lazy_scan_noprune() for a "nindexes == 0" VACUUM's heap page, we also can't get this far down. (Actually, lazy_scan_noprune lies if it has to in order to make sure that lazy_scan_heap never has to deal with a "nindexes == 0" VACUUM with LP_DEAD items from a no-cleanup-lock page. This is a kludge -- see comments about how this is "dishonest" inside lazy_scan_noprune().) > I still don't understand why vacuum is responsible for updating the > FSM per page when no line pointers have been set unused. That is how > PageGetFreeSpace() figures out if there is free space, right? You mean PageGetHeapFreeSpace? Not really. (Though even pruning can set line pointers unused, or heap-only tuples.) Even if pruning doesn't happen in VACUUM, that doesn't mean that the FSM is up-to-date. In short, we do these things with the free space map because it is a map of free space (which isn't crash safe) -- nothing more. I happen to agree that that general design has a lot of problems, but those seem out of scope here. -- Peter Geoghegan
Re: Built-in CTYPE provider
On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote: > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. +1. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote: > Another one is that version 12 broke \d in older psql by > removing pg_class.relhasoids. > > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. > > There's also that warning at start informing users about it: > WARNING: psql major version X, server major version Y. > Some psql features might not work. Good point, I'll leave it as-is then. If someone complains I can rework it. Also, the output of \l changes from version to version, so if there are automated tools processing the output then they'd have to change anyway. Regards, Jeff Davis
Re: speed up a logical replica setup
Hi, On Fri, 12 Jan 2024 at 09:32, Hayato Kuroda (Fujitsu) wrote: > > Good, all warnings were removed. However, the patch failed to pass tests on > FreeBSD twice. > I'm quite not sure the ERROR, but is it related with us? FreeBSD errors started after FreeBSD's CI image was updated [1]. I do not think error is related to this. [1] https://cirrus-ci.com/task/4700394639589376 -- Regards, Nazir Bilal Yavuz Microsoft
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan wrote: > Why do you think that lazy_scan_prune() and lazy_scan_noprune() have > different ideas about whether to update the FSM or not? When lazy_scan_prune() is called, we call RecordPageWithFreeSpace if vacrel->nindexes == 0 && prunestate.has_lpdead_items. See the code near the comment that begins "Consider the need to do page-at-a-time heap vacuuming". When lazy_scan_noprune() is called, whether we call RecordPageWithFreeSpace depends on the output parameter recordfreespace. That is set to true whenever vacrel->nindexes == 0 || lpdead_items == 0. See the code near the comment that begins "Save any LP_DEAD items". The case where I thought there was a behavior difference is when vacrel->nindexes == 0 and lpdead_items == 0 and thus prunestate.has_lpdead_items is false. But now I see (from Melanie's email) that this isn't really true, because in that case we fall through to the logic that we use when indexes are present, giving us a second chance to call RecordPageWithFreeSpace(), which we take when (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) comes out false, as it always does in the scenario that I postulated. P.S. to Melanie: I'll respond to your further emails next but I wanted to respond to this one from Peter first so he didn't think I was rudely ignoring him. :-) P.P.S. to everyone: Yikes, this logic is really confusing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan wrote: > > On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman > wrote: > > So, I think this is the logic in master: > > > > Prune case, first pass > > > > ... > > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM > > What is "space_freed"? Isn't that something from your uncommitted patch? Yes, I was mixing the two together. I just want to make sure that we agree that, on master, when lazy_scan_prune() is called, the logic for whether or not to update the FSM after the first pass is: indexes == 0 || !has_lpdead_items || !index_vacuuming and when lazy_scan_noprune() is called, the logic for whether or not to update the FSM after the first pass is: indexes == 0 || !has_lpdead_items Those seem different to me. - Melanie
Re: Confine vacuum skip logic to lazy_scan_skip
On 1/11/24 5:50 PM, Melanie Plageman wrote: On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz wrote: On Fri, 5 Jan 2024 at 02:25, Jim Nasby wrote: On 1/4/24 2:23 PM, Andres Freund wrote: On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var nskippable_blocks I think these may lead to worse code - the compiler has to reload vacrel->rel_pages/next_unskippable_block for every loop iteration, because it can't guarantee that they're not changed within one of the external functions called in the loop body. Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go away in the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of "here's the next X number of blocks that we need to vacuum"? Sorry if I misunderstood. AFAIU, with the help of the vectored IO; "the next X number of blocks that need to be vacuumed" will be prefetched by calculating the unskippable blocks ( using the lazy_scan_skip() function ) and the X will be determined by Postgres itself. Do you have something different in your mind? I think you are both right. As we gain more control of readahead from within Postgres, we will likely want to revisit this heuristic as it may not serve us anymore. But the streaming read interface/vectored I/O is also not a drop-in replacement for it. To change anything and ensure there is no regression, we will probably have to do cross-platform benchmarking, though. That being said, I would absolutely love to get rid of the skippable ranges because I find them very error-prone and confusing. Hopefully now that the skipping logic is isolated to a single function, it will be easier not to trip over it when working on lazy_scan_heap(). Yeah, arguably it's just a matter of semantics, but IMO it's a lot clearer to simply think in terms of "here's the next blocks we know we want to vacuum" instead of "we vacuum everything, but sometimes we skip some blocks". -- Jim Nasby, Data Architect, Austin TX
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan wrote: > > On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman > wrote: > > So, I think this is the logic in master: > > > > Prune case, first pass > > > > ... > > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM > > What is "space_freed"? Isn't that something from your uncommitted patch? > > As I said, the aim is to call PageGetHeapFreeSpace() (*not* > PageGetFreeSpace(), which is only used for index pages) exactly once > per heap page scanned. This is supposed to happen independently of > whatever specific work was/will be required for the heap page. In > general, we don't ever trust that the FSM is already up-to-date. > Presumably because the FSM isn't crash safe. > > On master, prunestate.has_lpdead_items may be set true when our VACUUM > wasn't actually the thing that performed pruning that freed tuple > storage -- typically when some other backend was the one that did all > required pruning at some earlier point in time, often via > opportunistic pruning. For better or worse, the only thing that VACUUM > aims to do is make sure that PageGetHeapFreeSpace() gets called > exactly once per scanned page. ... > > I still don't understand why vacuum is responsible for updating the > > FSM per page when no line pointers have been set unused. That is how > > PageGetFreeSpace() figures out if there is free space, right? > > You mean PageGetHeapFreeSpace? Not really. (Though even pruning can > set line pointers unused, or heap-only tuples.) > > Even if pruning doesn't happen in VACUUM, that doesn't mean that the > FSM is up-to-date. > > In short, we do these things with the free space map because it is a > map of free space (which isn't crash safe) -- nothing more. I happen > to agree that that general design has a lot of problems, but those > seem out of scope here. So, there are 3 issues I am trying to understand: 1) How often should vacuum update the FSM (not vacuum as in the second pass but vacuum as in the whole thing that is happening in lazy_scan_heap())? 2) What is the exact logic in master that ensures that vacuum implements the cadence in 1)? 3) How can the logic in 2) be replicated exactly in my patch that sets would-be dead items LP_UNUSED during pruning? >From what Peter is saying, I think 1) is decided and is once per page (across all passes) For 2), see my previous email. And for 3), TBD until 2) is agreed upon. - Melanie
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Jan 12, 2024 at 2:02 PM Jim Nasby wrote: > > On 1/11/24 5:50 PM, Melanie Plageman wrote: > > On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz wrote: > >> > >> On Fri, 5 Jan 2024 at 02:25, Jim Nasby wrote: > >>> > >>> On 1/4/24 2:23 PM, Andres Freund wrote: > >>> > >>> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: > >>> > >>> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var > >>> rel_pages > >>> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var > >>> nskippable_blocks > >>> > >>> I think these may lead to worse code - the compiler has to reload > >>> vacrel->rel_pages/next_unskippable_block for every loop iteration, > >>> because it > >>> can't guarantee that they're not changed within one of the external > >>> functions > >>> called in the loop body. > >>> > >>> Admittedly I'm not up to speed on recent vacuum changes, but I have to > >>> wonder if the concept of skipping should go away in the context of vector > >>> IO? Instead of thinking about "we can skip this range of blocks", why not > >>> maintain a list of "here's the next X number of blocks that we need to > >>> vacuum"? > >> > >> Sorry if I misunderstood. AFAIU, with the help of the vectored IO; > >> "the next X number of blocks that need to be vacuumed" will be > >> prefetched by calculating the unskippable blocks ( using the > >> lazy_scan_skip() function ) and the X will be determined by Postgres > >> itself. Do you have something different in your mind? > > > > I think you are both right. As we gain more control of readahead from > > within Postgres, we will likely want to revisit this heuristic as it > > may not serve us anymore. But the streaming read interface/vectored > > I/O is also not a drop-in replacement for it. To change anything and > > ensure there is no regression, we will probably have to do > > cross-platform benchmarking, though. > > > > That being said, I would absolutely love to get rid of the skippable > > ranges because I find them very error-prone and confusing. Hopefully > > now that the skipping logic is isolated to a single function, it will > > be easier not to trip over it when working on lazy_scan_heap(). > > Yeah, arguably it's just a matter of semantics, but IMO it's a lot > clearer to simply think in terms of "here's the next blocks we know we > want to vacuum" instead of "we vacuum everything, but sometimes we skip > some blocks". Even "we vacuum some stuff, but sometimes we skip some blocks" would be okay. What we have now is "we vacuum some stuff, but sometimes we skip some blocks, but only if we would skip enough blocks, and, when we decide to do that we can't go back and actually get visibility information for those blocks we skipped because we are too cheap" - Melanie
Re: plpgsql memory leaks
pá 12. 1. 2024 v 14:53 odesílatel Michael Banck napsal: > Hi, > > On Fri, Jan 12, 2024 at 01:35:24PM +0100, Pavel Stehule wrote: > > pá 12. 1. 2024 v 11:54 odesílatel Michael Banck napsal: > > > Which version of Postgres is this and on which platform/distribution? > > > > It was tested on master branch (pg 17) on Fedora 39 > > > > > Did you try keep jit on but set jit_inline_above_cost to 0? > > > > > > The back-branches have a fix for the above case, i.e. llvmjit memleaks > > > that can be worked-around by setting jit_inline_above_cost=0. > > I got that wrong, it needs to be -1 to disable it. > > But if you are already running the master branch, it is probably a > separate issue. > I tested code CREATE OR REPLACE FUNCTION public.fx(iter integer) RETURNS void LANGUAGE plpgsql AS $function$ declare c cursor(m bigint) for select distinct i from generate_series(1, m) g(i); t bigint; s bigint; begin for i in 1..iter loop s := 0; for r in c(i*1) loop s := s + r.i; end loop; raise notice '%=%', i, s; end loop; end; $function$ default master branch - res 190MB ram jit_inline_above_cost = -1 doesn't helps disabling JIT doesn't helps too, so it looks like the wrong hypothesis , and the problem is maybe somewhere else :-/ Regards Pavel > > > Michael >
Re: Custom explain options
On 12/01/2024 7:03 pm, Tomas Vondra wrote: On 10/21/23 14:16, Konstantin Knizhnik wrote: Hi hackers, EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution. In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres). Not quite related to this patch about EXPLAIN options, but can you share some details how you implemented prefetching for the other nodes? I'm asking because I've been working on prefetching for index scans, so I'm wondering if there's a better way to do this, or how to do it in a way that would allow neon to maybe leverage that too. regards Yes, I am looking at your PR. What we have implemented in Neon is more specific to Neon architecture where storage is separated from compute. So each page not found in shared buffers has to be downloaded from page server. It adds quite noticeable latency, because of network roundtrip. While vanilla Postgres can rely on OS file system cache when page is not found in shared buffer (access to OS file cache is certainly slower than to shared buffers because of syscall and copying of page, but performance penaly is not very large - less than 15%), Neon has no local files and so has to send request to the socket. This is why we have to perform aggressive prefetching whenever it is possible (when it it is possible to predict order of subsequent pages). Unlike vanilla Postgres which implements prefetch only for bitmap heap scan, we have implemented it for seqscan, index scan, indexonly scan, bitmap heap scan, vacuum, pg_prewarm. The main difference between Neon prefetch and vanilla Postgres prefetch is that first one is backend specific. So each backend prefetches only pages which it needs. This is why we have to rewrite prefetch for bitmap heap scan, which is using `fadvise` and assumes that pages prefetched by one backend in file cache, can be used by any other backend. Concerning index scan we have implemented two different approaches: for index only scan we try to prefetch leave pages and for index scan we prefetch referenced heap pages. In both cases we start from prefetch distance 0 and increase it until it reaches `effective_io_concurrency` for this relation. Doing so we try to avoid prefetching of useless pages and slowdown of "point" lookups returning one or few records. If you are interested, you can look at our implementation in neon repo: all source are available. But briefly speaking, each backend has its own prefetch ring (prefetch requests which are waiting for response). The key idea is that we can send several prefetch requests to page server and then receive multiple replies. It allows to increased speed of OLAP queries up to 10 times. Heikki thinks that prefetch can be somehow combined with async-io proposal (based on io_uring). But right now they have nothing in common.
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman wrote: > Yes, I was mixing the two together. > > I just want to make sure that we agree that, on master, when > lazy_scan_prune() is called, the logic for whether or not to update > the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items || !index_vacuuming > > and when lazy_scan_noprune() is called, the logic for whether or not > to update the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items > > Those seem different to me. This analysis seems correct to me, except that "when lazy_scan_noprune() is called" should really say "when lazy_scan_noprune() is called (and returns true)", because when it returns false we fall through and call lazy_scan_prune() afterwards. Here's a draft patch to clean up the inconsistency here. It also gets rid of recordfreespace, because ISTM that recordfreespace is adding to the confusion here rather than helping anything. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Be-more-consistent-about-whether-to-update-the-FS.patch.nocfbot Description: Binary data
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 2:32 PM Robert Haas wrote: > On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman > wrote: > This analysis seems correct to me, except that "when > lazy_scan_noprune() is called" should really say "when > lazy_scan_noprune() is called (and returns true)", because when it > returns false we fall through and call lazy_scan_prune() afterwards. Now that I see your patch, I understand what Melanie must have meant. I agree that there is a small inconsistency here, that we could well do without. In general I am in favor of religiously eliminating such inconsistencies (between lazy_scan_prune and lazy_scan_noprune), unless there is a reason not to. Not because it's necessarily important. More because it's just too hard to be sure whether it might matter. It's usually easier to defensively assume that it matters. > Here's a draft patch to clean up the inconsistency here. It also gets > rid of recordfreespace, because ISTM that recordfreespace is adding to > the confusion here rather than helping anything. You're using "!prunestate.has_lpdead_items" as part of your test that sets "recordfreespace". But lazy_scan_noprune doesn't get passed a pointer to prunestate, so clearly you'll need to detect the same condition some other way. -- Peter Geoghegan
Re: introduce dynamic shared memory registry
On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote: > At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote: >> + Each backend sould obtain a pointer to the reserved shared memory by > > sould → should D'oh. Thanks. >> + Add-ins can reserve LWLocks on server startup. Like with shared >> memory, > > (Would "As with shared memory" read better? Maybe, but then again maybe > it should be left alone because you also write "Unlike with" elsewhere.) I think "As with shared memory..." sounds better here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan wrote: > You're using "!prunestate.has_lpdead_items" as part of your test that > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a > pointer to prunestate, so clearly you'll need to detect the same > condition some other way. OOPS. Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Extensible storage manager API - SMGR hook Redux
Thought I would show off what is possible with this patchset. Heikki, a couple of months ago in our internal Slack, said: [I would like] a debugging tool that checks that we're not missing any fsyncs. I bumped into a few missing fsync bugs with unlogged tables lately and a tool like that would've been very helpful. My task was to create such a tool, and off I went. I started with the storage manager extension patch that Matthias sent to the list last year[0]. Andres, in another thread[1], said: I've been thinking that we need a validation layer for fsyncs, it's too hard to get right without testing, and crash testing is not likel enough to catch problems quickly / resource intensive. My thought was that we could keep a shared hash table of all files created / dirtied at the fd layer, with the filename as key and the value the current LSN. We'd delete files from it when they're fsynced. At checkpoints we go through the list and see if there's any files from before the redo that aren't yet fsynced. All of this in assert builds only, of course. I took this idea and ran with it. I call it the fsync_checker™️. It is an extension that prints relations that haven't been fsynced prior to a CHECKPOINT. Note that this idea doesn't work in practice because relations might not be fsynced, but they might be WAL-logged, like in the case of createdb. See log_smgrcreate(). I can't think of an easy way to solve this problem looking at the codebase as it stands. Here is a description of the patches: 0001: This is essentially just the patch that Matthias posted earlier, but rebased and adjusted a little bit so storage managers can "inherit" from other storage managers. 0002: This is an extension of 0001, which allows for extensions to set a global storage manager. This is pretty hacky, and if it was going to be pulled into mainline, it would need some better protection. For instance, only one extension should be able to set the global storage manager. We wouldn't want extensions stepping over each other, etc. 0003: Adds a hook for extensions to inspect a checkpoint before it actually occurs. The purpose for the fsync_checker is so that I can iterate over all the relations the extension tracks to find files that haven't been synced prior to the completion of the checkpoint. 0004: This is the actual fsync_checker extension itself. It must be preloaded. Hopefully this is a good illustration of how the initial patch could be used, even though it isn't perfect. [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de -- Tristan Partin Neon (https://neon.tech) From 5ffbc7c35bb3248501b2517d26f99afe02fb53d6 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Tue, 27 Jun 2023 15:59:23 +0200 Subject: [PATCH v1 1/5] Expose f_smgr to extensions for manual implementation There are various reasons why one would want to create their own implementation of a storage manager, among which are block-level compression, encryption and offloading to cold storage. This patch is a first patch that allows extensions to register their own SMgr. Note, however, that this SMgr is not yet used - only the first SMgr to register is used, and this is currently the md.c smgr. Future commits will include facilities to select an SMgr for each tablespace. --- src/backend/postmaster/postmaster.c | 5 + src/backend/storage/smgr/md.c | 172 +++- src/backend/storage/smgr/smgr.c | 129 ++--- src/backend/utils/init/miscinit.c | 13 +++ src/include/miscadmin.h | 1 + src/include/storage/md.h| 4 + src/include/storage/smgr.h | 59 -- 7 files changed, 252 insertions(+), 131 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index feb471dd1d..a0e46fe1f2 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1010,6 +1010,11 @@ PostmasterMain(int argc, char *argv[]) */ ApplyLauncherRegister(); + /* + * Register built-in managers that are not part of static arrays + */ + register_builtin_dynamic_managers(); + /* * process any libraries that should be preloaded at postmaster start */ diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b1e9932a29..66a93101ab 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -87,6 +87,21 @@ typedef struct _MdfdVec } MdfdVec; static MemoryContext MdCxt; /* context for all MdfdVec objects */ +SMgrId MdSMgrId; + +typedef struct MdSMgrRelationData +{ + /* parent data */ + SMgrRelationData reln; + /* + * for md.c; per-fork arrays of the number of open segments + * (md_num_open_segs) and the segments themselves (md_seg_fds). + */ + int md_num_open_segs[MAX_FORKNUM + 1]
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman wrote: > On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan wrote: > > What is "space_freed"? Isn't that something from your uncommitted patch? > > Yes, I was mixing the two together. An understandable mistake. > I just want to make sure that we agree that, on master, when > lazy_scan_prune() is called, the logic for whether or not to update > the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items || !index_vacuuming > > and when lazy_scan_noprune() is called, the logic for whether or not > to update the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items > > Those seem different to me. Right. As I said to Robert just now, I can now see that they're slightly different conditions. FWIW my brain was just ignoring " || !index_vacuuming". I dismissed it as an edge-case, only relevant when the failsafe has kicked in. Which it is. But that's still no reason to allow an inconsistency that we can easily just avoid. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 2:47 PM Robert Haas wrote: > > On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan wrote: > > You're using "!prunestate.has_lpdead_items" as part of your test that > > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a > > pointer to prunestate, so clearly you'll need to detect the same > > condition some other way. > > OOPS. Thanks. Also, I think you should combine these in lazy_scan_noprune() now /* Save any LP_DEAD items found on the page in dead_items array */ if (vacrel->nindexes == 0) { /* Using one-pass strategy (since table has no indexes) */ if (lpdead_items > 0) { Since we don't set recordfreespace in the outer if statement anymore And I noticed you missed a reference to recordfreespace output parameter in the function comment above lazy_scan_noprune(). - Melanie
Re: Recovering from detoast-related catcache invalidations
Xiaoran Wang writes: >> Maybe, but that undocumented hack in SetHintBits seems completely >> unacceptable. Isn't there a cleaner way to make this check? > Maybe we don't need to call 'HeapTupleSatisfiesVisibility' to check if the > tuple has been deleted. > As the tuple's xmin must been committed, so we just need to check if its > xmax is committed, I'm not super thrilled with that. Something I realized last night is that your proposal only works if "ntp" is pointing directly into the catalog's disk buffers. If something earlier than this code had made a local-memory copy of the catalog tuple, then it's possible that its header fields (particularly xmax) are out of date compared to shared buffers and would fail to tell us that some other process just invalidated the tuple. Now in fact, with the current implementation of syscache_getnext() the result is actually a live tuple and so we can expect to see any relevant updates. But I think we'd better add some Asserts that that's so; and that also provides us with a way to call HeapTupleSatisfiesVisibility fully legally, because we can get the buffer reference out of the scan descriptor too. This is uncomfortably much in bed with the tuple table slot code, perhaps, but I don't see a way to do it more cleanly unless we want to add some new provisions to that API. Andres, do you have any thoughts about that? Anyway, this approach gets rid of false positives, which is great for performance and bad for testing. Code coverage says that now we never hit the failure paths during regression tests, which is unsurprising, but I'm not very comfortable with leaving those paths unexercised. I tried to make an isolation test to exercise them, but there's no good way at the SQL level to get a session to block during the detoast step. LOCK TABLE on some catalog's toast table would do, but we disallow it. I thought about adding a small C function to regress.so to take out such a lock, but we have no infrastructure for referencing regress.so from isolation tests. What I ended up doing is adding a random failure about 0.1% of the time in USE_ASSERT_CHECKING builds --- that's intellectually ugly for sure, but doing better seems like way more work than it's worth. regards, tom lane diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index c0591cffe5..b1ce7bf513 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -15,6 +15,7 @@ #include "postgres.h" #include "access/genam.h" +#include "access/heapam.h" #include "access/heaptoast.h" #include "access/relscan.h" #include "access/sysattr.h" @@ -24,8 +25,10 @@ #include "catalog/pg_operator.h" #include "catalog/pg_type.h" #include "common/hashfn.h" +#include "common/pg_prng.h" #include "miscadmin.h" #include "port/pg_bitutils.h" +#include "storage/bufmgr.h" #ifdef CATCACHE_STATS #include "storage/ipc.h" /* for on_proc_exit */ #endif @@ -90,10 +93,10 @@ static void CatCachePrintStats(int code, Datum arg); static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct); static void CatCacheRemoveCList(CatCache *cache, CatCList *cl); static void CatalogCacheInitializeCache(CatCache *cache); -static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, +static CatCTup *CatalogCacheCreateEntry(CatCache *cache, + HeapTuple ntp, SysScanDesc scandesc, Datum *arguments, - uint32 hashValue, Index hashIndex, - bool negative); + uint32 hashValue, Index hashIndex); static void ReleaseCatCacheWithOwner(HeapTuple tuple, ResourceOwner resowner); static void ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner); @@ -1372,6 +1375,7 @@ SearchCatCacheMiss(CatCache *cache, SysScanDesc scandesc; HeapTuple ntp; CatCTup*ct; + bool stale; Datum arguments[CATCACHE_MAXKEYS]; /* Initialize local parameter array */ @@ -1380,16 +1384,6 @@ SearchCatCacheMiss(CatCache *cache, arguments[2] = v3; arguments[3] = v4; - /* - * Ok, need to make a lookup in the relation, copy the scankey and fill - * out any per-call fields. - */ - memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys); - cur_skey[0].sk_argument = v1; - cur_skey[1].sk_argument = v2; - cur_skey[2].sk_argument = v3; - cur_skey[3].sk_argument = v4; - /* * Tuple was not found in cache, so we have to try to retrieve it directly * from the relation. If found, we will add it to the cache; if not @@ -1404,9 +1398,28 @@ SearchCatCacheMiss(CatCache *cache, * will eventually age out of the cache, so there's no functional problem. * This case is rare enough that it's not worth expending extra cycles to * detect. + * + * Another case, which we *must* handle, is that the tuple could become + * outdated during CatalogCacheCreateEntry's attempt to detoast it (since + * AcceptInvalidationMessages can run during TOAST table access). We do + * not want to return alread
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman wrote: > Also, I think you should combine these in lazy_scan_noprune() now > > /* Save any LP_DEAD items found on the page in dead_items array */ > if (vacrel->nindexes == 0) > { > /* Using one-pass strategy (since table has no indexes) */ > if (lpdead_items > 0) > { > > Since we don't set recordfreespace in the outer if statement anymore Well, maybe, but there's an else clause attached to the outer "if", so you have to be a bit careful. I didn't think it was critical to further rejigger this. > And I noticed you missed a reference to recordfreespace output > parameter in the function comment above lazy_scan_noprune(). OK. So what's the best way to solve the problem that Peter pointed out? Should we pass in the prunestate? Maybe just replace bool *recordfreespace with bool *has_lpdead_items? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Invalidate the subscription worker in cases where a user loses their superuser status
On Tue, 2023-10-17 at 14:17 +0530, Amit Kapila wrote: > > Fair enough. I'll wait till early next week (say till Monday EOD) > > to > > see if anyone thinks otherwise and push this patch to HEAD after > > some > > more testing and review. > > > > Pushed. There was a brief discussion on backporting this here: https://www.postgresql.org/message-id/CA%2BTgmob2mYpaUMT7aoFOukYTrpxt-WdgcitJnsjWhufnbDWEeg%40mail.gmail.com It looks like you opted not to backport it. Was there a reason for that? The only risky thing I see there is a change in the Subscription structure -- I suppose that could be used by extensions? (I am fine with it not being backported, but I was inclined to think it should be backported.) Regards, Jeff Davis
cfbot is failing all tests on FreeBSD/Meson builds
It looks like every recent cfbot run has failed in the FreeBSD-13-Meson build, even if it worked in other ones. The symptoms are failures in the TAP tests that try to use interactive_psql: Can't call method "slave" on an undefined value at /usr/local/lib/perl5/site_perl/IPC/Run.pm line 2889. I suspect that we are looking at some bug in IPC::Run that exists in the version that that FreeBSD release has (but not, seemingly, elsewhere in the community), and that was mostly harmless until c53859295 made all Perl warnings fatal. Not sure what we want to do about this. regards, tom lane
Re: libpq compression (part 3)
On Sun, Dec 31, 2023 at 2:32 AM Jacob Burroughs wrote: > I ended up reworking this to use a version of this option in place of > the `none` hackery, but naming the parameters `compress` and > `decompress, so to disable compression but allow decompression you > would specify `libpq_compression=gzip:compress=off`. I'm still a bit befuddled by this interface. libpq_compression=gzip:compress=off looks a lot like it's saying "do it, except don't". I guess that you're using "compress" and "decompress" to distinguish the two directions - i.e. server to client and client to server - but of course in both directions the sender compresses and the receiver decompresses, so I don't find that very clear. I wonder if we could use "upstream" and "downstream" to be clearer? Or some other terminology? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Recovering from detoast-related catcache invalidations
I wrote: > This is uncomfortably much in bed with the tuple table slot code, > perhaps, but I don't see a way to do it more cleanly unless we want > to add some new provisions to that API. Andres, do you have any > thoughts about that? Oh! After nosing around a bit more I remembered systable_recheck_tuple, which is meant for exactly this purpose. So v4 attached. regards, tom lane diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index c0591cffe5..0dcd45d2f0 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -24,6 +24,7 @@ #include "catalog/pg_operator.h" #include "catalog/pg_type.h" #include "common/hashfn.h" +#include "common/pg_prng.h" #include "miscadmin.h" #include "port/pg_bitutils.h" #ifdef CATCACHE_STATS @@ -90,10 +91,10 @@ static void CatCachePrintStats(int code, Datum arg); static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct); static void CatCacheRemoveCList(CatCache *cache, CatCList *cl); static void CatalogCacheInitializeCache(CatCache *cache); -static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, +static CatCTup *CatalogCacheCreateEntry(CatCache *cache, + HeapTuple ntp, SysScanDesc scandesc, Datum *arguments, - uint32 hashValue, Index hashIndex, - bool negative); + uint32 hashValue, Index hashIndex); static void ReleaseCatCacheWithOwner(HeapTuple tuple, ResourceOwner resowner); static void ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner); @@ -1372,6 +1373,7 @@ SearchCatCacheMiss(CatCache *cache, SysScanDesc scandesc; HeapTuple ntp; CatCTup*ct; + bool stale; Datum arguments[CATCACHE_MAXKEYS]; /* Initialize local parameter array */ @@ -1380,16 +1382,6 @@ SearchCatCacheMiss(CatCache *cache, arguments[2] = v3; arguments[3] = v4; - /* - * Ok, need to make a lookup in the relation, copy the scankey and fill - * out any per-call fields. - */ - memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys); - cur_skey[0].sk_argument = v1; - cur_skey[1].sk_argument = v2; - cur_skey[2].sk_argument = v3; - cur_skey[3].sk_argument = v4; - /* * Tuple was not found in cache, so we have to try to retrieve it directly * from the relation. If found, we will add it to the cache; if not @@ -1404,9 +1396,28 @@ SearchCatCacheMiss(CatCache *cache, * will eventually age out of the cache, so there's no functional problem. * This case is rare enough that it's not worth expending extra cycles to * detect. + * + * Another case, which we *must* handle, is that the tuple could become + * outdated during CatalogCacheCreateEntry's attempt to detoast it (since + * AcceptInvalidationMessages can run during TOAST table access). We do + * not want to return already-stale catcache entries, so we loop around + * and do the table scan again if that happens. */ relation = table_open(cache->cc_reloid, AccessShareLock); + do + { + /* + * Ok, need to make a lookup in the relation, copy the scankey and + * fill out any per-call fields. (We must re-do this when retrying, + * because systable_beginscan scribbles on the scankey.) + */ + memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys); + cur_skey[0].sk_argument = v1; + cur_skey[1].sk_argument = v2; + cur_skey[2].sk_argument = v3; + cur_skey[3].sk_argument = v4; + scandesc = systable_beginscan(relation, cache->cc_indexoid, IndexScanOK(cache, cur_skey), @@ -1415,12 +1426,18 @@ SearchCatCacheMiss(CatCache *cache, cur_skey); ct = NULL; + stale = false; while (HeapTupleIsValid(ntp = systable_getnext(scandesc))) { - ct = CatalogCacheCreateEntry(cache, ntp, arguments, - hashValue, hashIndex, - false); + ct = CatalogCacheCreateEntry(cache, ntp, scandesc, NULL, + hashValue, hashIndex); + /* upon failure, we must start the scan over */ + if (ct == NULL) + { + stale = true; + break; + } /* immediately set the refcount to 1 */ ResourceOwnerEnlarge(CurrentResourceOwner); ct->refcount++; @@ -1429,6 +1446,7 @@ SearchCatCacheMiss(CatCache *cache, } systable_endscan(scandesc); + } while (stale); table_close(relation, AccessShareLock); @@ -1447,9 +1465,11 @@ SearchCatCacheMiss(CatCache *cache, if (IsBootstrapProcessingMode()) return NULL; - ct = CatalogCacheCreateEntry(cache, NULL, arguments, - hashValue, hashIndex, - true); + ct = CatalogCacheCreateEntry(cache, NULL, NULL, arguments, + hashValue, hashIndex); + + /* Creating a negative cache entry shouldn't fail */ + Assert(ct != NULL); CACHE_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples", cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup); @@ -1663,7 +1683,8 @@ SearchCatCacheList(CatCache *cache, * We have to bump the member refcounts temporarily to ensure they won't * get dropped from the c
Re: Stack overflow issue
On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas wrote: > Here's one goto-free attempt. It adds a local loop to where the > recursion was, so that if you have a chain of subtransactions that need > to be aborted in CommitTransactionCommand, they are aborted iteratively. > The TBLOCK_SUBCOMMIT case already had such a loop. > > I added a couple of comments in the patch marked with "REVIEWER NOTE", > to explain why I changed some things. They are to be removed before > committing. > > I'm not sure if this is better than a goto. In fact, even if we commit > this, I think I'd still prefer to replace the remaining recursive calls > with a goto. Recursion feels a weird to me, when we're unwinding the > states from the stack as we go. I'm not able to quickly verify whether this version is correct, but I do think the code looks nicer this way. I understand that's a question of opinion rather than fact, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: libpq compression (part 3)
> I wonder if we could use "upstream" and "downstream" to be clearer? Or > some other terminology? What about `send` and `receive`?
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Fri, Jan 12, 2024 at 3:22 PM Robert Haas wrote: > > On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman > wrote: > > So what's the best way to solve the problem that Peter pointed out? > Should we pass in the prunestate? Maybe just replace bool > *recordfreespace with bool *has_lpdead_items? Yea, that works for now. I mean, I think the way we should do it is update the FSM in lazy_scan_noprune(), but, for the purposes of this patch, yes. has_lpdead_items output parameter seems fine to me. - Melanie
Re: libpq compression (part 3)
On Fri, Jan 12, 2024 at 4:02 PM Jacob Burroughs wrote: > > I wonder if we could use "upstream" and "downstream" to be clearer? Or > > some other terminology? > > What about `send` and `receive`? I think that would definitely be better than "compress" and "decompress," but I was worried that it might be unclear to the user whether the parameter that they specified was from the point of view of the client or the server. Perhaps that's a dumb thing to worry about, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: plpgsql memory leaks
Pavel Stehule writes: > default master branch - res 190MB ram > jit_inline_above_cost = -1 doesn't helps > disabling JIT doesn't helps too, > so it looks like the wrong hypothesis , and the problem is maybe somewhere > else :-/ I see no leak with these examples on HEAD, either with or without --enable-llvm --- the process size stays quite stable according to "top". I wonder if you are using some extension that's contributing to the problem. regards, tom lane
Re: Make NUM_XLOGINSERT_LOCKS configurable
On 1/12/24 12:32 AM, Bharath Rupireddy wrote: Test case: ./pgbench --initialize --scale=100 --username=ubuntu postgres ./pgbench --progress=10 --client=64 --time=300 --builtin=tpcb-like --username=ubuntu postgres Setup: ./configure --prefix=$PWD/inst/ CFLAGS="-ggdb3 -O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = '8GB' max_wal_size = '32GB' track_wal_io_timing = on Stats measured: I've used the attached patch to measure WAL Insert Lock Acquire Time (wal_insert_lock_acquire_time) and WAL Wait for In-progress Inserts to Finish Time (wal_wait_for_insert_to_finish_time). Unfortunately this leaves the question of how frequently is WaitXLogInsertionsToFinish() being called and by whom. One possibility here is that wal_buffers is too small so backends are constantly having to write WAL data to free up buffers. -- Jim Nasby, Data Architect, Austin TX
Re: [DOC] Add detail regarding resource consumption wrt max_connections
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed I think it is good to warn the user about the increased allocation of memory for certain parameters so that they do not abuse it by setting it to a huge number without knowing the consequences. It is true that max_connections can increase the size of proc array and other resources, which are allocated in the shared buffer, which also means less shared buffer to perform regular data operations. I am sure this is not the only parameter that affects the memory allocation. "max_prepared_xacts" can also affect the shared memory allocation too so the same warning message applies here as well. Maybe there are other parameters with similar effects. Instead of stating that higher max_connections results in higher allocation, It may be better to tell the user that if the value needs to be set much higher, consider increasing the "shared_buffers" setting as well. thank you --- Cary Huang Highgo Software Canada www.highgo.ca
Re: pg_upgrade failing for 200+ million Large Objects
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: > * I did not invent a switch to control the batching of blobs; it's > just hard-wired at 1000 blobs per group here. Probably we need some > user knob for that, but I'm unsure if we want to expose a count or > just a boolean for one vs more than one blob per batch. The point of > forcing one blob per batch would be to allow exact control during > selective restore, and I'm not sure if there's any value in random > other settings. On the other hand, selective restore of blobs has > been completely broken for the last dozen years and I can't recall any > user complaints about that; so maybe nobody cares and we could just > leave this as an internal choice. I think the argument for making this configurable is that folks might have fewer larger blobs, in which case it might make sense to lower the batch size, or they might have many smaller blobs, in which case they might want to increase it. But I'm a bit skeptical that will make any sort of tremendous difference in practice, and I'm not sure how a user would decide on the right value besides trial-and-error. In any case, at the moment I think it'd be okay to keep this an internal setting and wait for feedback from the field. > * As the patch stands, we still build a separate TOC entry for each > comment or seclabel or ACL attached to a blob. If you have a lot of > blobs with non-default properties then the TOC bloat problem comes > back again. We could do something about that, but it would take a bit > of tedious refactoring, and the most obvious way to handle it probably > re-introduces too-many-locks problems. Is this a scenario that's > worth spending a lot of time on? I'll ask around about this. > Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes. This one looked good to me. > Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata > and data entries. I spent most of my review time reading this one. Overall, it looks pretty good to me, and I think you've called out most of the interesting design choices. > + char *cmdEnd = psprintf(" OWNER TO %s", > fmtId(te->owner)); > + > + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", > cmdEnd); This is just a nitpick, but is there any reason not to have IssueCommandPerBlob() accept a format string and the corresponding arguments? > + while (n < 1000 && i + n < ntups) Another nitpick: it might be worth moving these hard-wired constants to macros. Without a control switch, that'd at least make it easy for anyone determined to change the value for their installation. > Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA. This one looks reasonable, too. > In this patch I have just hard-wired pg_upgrade to use > --transaction-size 1000. Perhaps there would be value in adding > another pg_upgrade option to allow user control of that, but I'm > unsure that it's worth the trouble; I think few users would use it, > and any who did would see not that much benefit. However, we > might need to adjust the logic to make the size be 1000 divided > by the number of parallel restore jobs allowed. I wonder if it'd be worth making this configurable for pg_upgrade as an escape hatch in case the default setting is problematic. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote: > On further reflection, there is a very good reason why it's done like > that. Because pg_upgrade is doing schema-only dump and restore, > there's next to no opportunity for parallelism within either pg_dump > or pg_restore. There's no data-loading steps, and there's no > index-building either, so the time-consuming stuff that could be > parallelized just isn't happening in pg_upgrade's usage. > > Now it's true that my 0003 patch moves the needle a little bit: > since it makes BLOB creation (as opposed to loading) parallelizable, > there'd be some hope for parallel pg_restore doing something useful in > a database with very many blobs. But it makes no sense to remove the > existing cross-database parallelism in pursuit of that; you'd make > many more people unhappy than happy. I assume the concern is that we'd end up multiplying the effective number of workers if we parallelized both in-database and cross-database? Would it be sufficient to make those separately configurable with a note about the multiplicative effects of setting both? I think it'd be unfortunate if pg_upgrade completely missed out on this improvement. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote: > On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: >> +char *cmdEnd = psprintf(" OWNER TO %s", >> fmtId(te->owner)); >> + >> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", >> cmdEnd); > > This is just a nitpick, but is there any reason not to have > IssueCommandPerBlob() accept a format string and the corresponding > arguments? Eh, I guess you'd have to find some other way of specifying where the OID is supposed to go, which would probably be weird. Please disregard this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
Nathan Bossart writes: > On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: >> +char *cmdEnd = psprintf(" OWNER TO %s", >> fmtId(te->owner)); >> + >> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", >> cmdEnd); > This is just a nitpick, but is there any reason not to have > IssueCommandPerBlob() accept a format string and the corresponding > arguments? The problem is how to combine the individual per-blob OID with the command string given by the caller. If we want the caller to also be able to inject data values, I don't really see how to combine the OID with the va_args list from the caller. If we stick with the design that the caller provides separate "front" and "back" strings, but ask to be able to inject data values into those, then we need two va_args lists which C doesn't support, or else an arbitrary decision about which one gets the va_args. (Admittedly, with only one caller that needs a non-constant string, we could make such a decision; but by the same token we'd gain little.) It'd be notationally simpler if we could have the caller supply one string that includes %u where the OID is supposed to go; but then we have problems if an owner name includes %. So on the whole I'm not seeing anything much better than what I did. Maybe I missed an idea though. > Another nitpick: it might be worth moving these hard-wired constants to > macros. Without a control switch, that'd at least make it easy for anyone > determined to change the value for their installation. OK. regards, tom lane
Fix minor memory leak in connection string validation
Introduced in commit c3afe8cf5a. Someone issuing repeated "CREATE SUBSCRIPTION" commands where the connection has no password and must_have_password is true will leak malloc'd memory in the error path. Minor issue in practice, because I suspect that a user privileged enough to create a subscription could cause bigger problems. It makes me wonder if we should use the resowner mechanism to track pointers to malloc'd memory. Then we could use a standard pattern for these kinds of cases, and it would also catch more remote issues, like if a pstrdup() fails in an error path (which can happen a few lines up if the parse fails). Patch attached; intended for 16 and 17. Regards, Jeff Davis From cbce9f5a6d1f7a7ce19473c45ac4cc74bca49c0e Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Fri, 12 Jan 2024 14:39:05 -0800 Subject: [PATCH] Fix memory leak in connection string validation. Introduced in commit c3afe8cf5a. Backpatch-through: 16 --- src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 5 + 1 file changed, 5 insertions(+) diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index ead30f87c9..201c36cb22 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -284,10 +284,15 @@ libpqrcv_check_conninfo(const char *conninfo, bool must_use_password) } if (!uses_password) + { + /* malloc'd, so we must free it explicitly */ + PQconninfoFree(opts); + ereport(ERROR, (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), errmsg("password is required"), errdetail("Non-superusers must provide a password in the connection string."))); + } } PQconninfoFree(opts); -- 2.34.1
Re: cfbot is failing all tests on FreeBSD/Meson builds
On Sat, Jan 13, 2024 at 9:32 AM Tom Lane wrote: > It looks like every recent cfbot run has failed in the > FreeBSD-13-Meson build, even if it worked in other ones. > The symptoms are failures in the TAP tests that try to > use interactive_psql: > > Can't call method "slave" on an undefined value at > /usr/local/lib/perl5/site_perl/IPC/Run.pm line 2889. > > I suspect that we are looking at some bug in IPC::Run that exists in > the version that that FreeBSD release has (but not, seemingly, > elsewhere in the community), and that was mostly harmless until > c53859295 made all Perl warnings fatal. Not sure what we want > to do about this. Right, I see this locally on my FreeBSD box. Reverting the fatal warnings thing doesn't help, it's more broken than that. I tried to understand https://github.com/cpan-authors/IPC-Run/blob/master/lib/IPC/Run.pm https://github.com/cpan-authors/IO-Tty/blob/master/Pty.pm but I am not good at perl. I think the error means that in ## Close all those temporary filehandles that the kids needed. for my $pty ( values %{ $self->{PTYS} } ) { close $pty->slave; } the variable $pty holds undef, so then we have to find out where undef was put into $self->{PTYS}. There is a place that does: ## Just flag the pyt's existence for now. It'll be ## converted to a real IO::Pty by _open_pipes. $self->{PTYS}->{$pty_id} = undef; We can see that this started a couple of days ago: https://github.com/postgres/postgres/commits/master/ But at older commits I see the problem locally too, so, yeah, it must be coming from where else. Looking at the relevant packages p5-IPC-Run and p5-IO-Tty I see that they recently moved to 20231003.0 and 1.18, respectively. Downgrading p5-IPC-Run to p5-IPC-Run-20220807.0.pkg was not enough. Downgrading p5-IO-Tty to p5-IO-Tty-1.17.pkg allowed psql/010_tab_completion to pass with either version of p5-IPC-Run. The answer may lie in the commits between those versions here: https://github.com/cpan-authors/IO-Tty/commits/master/ I see that Debian Bookwork is still using 1.17, which fits with your guess that FreeBSD is just breaking first because it is using a newer version.
Re: cfbot is failing all tests on FreeBSD/Meson builds
Thomas Munro writes: > Looking at the relevant packages > p5-IPC-Run and p5-IO-Tty I see that they recently moved to 20231003.0 > and 1.18, respectively. Downgrading p5-IPC-Run to > p5-IPC-Run-20220807.0.pkg was not enough. Downgrading p5-IO-Tty to > p5-IO-Tty-1.17.pkg allowed psql/010_tab_completion to pass with either > version of p5-IPC-Run. > The answer may lie in the commits between those versions here: > https://github.com/cpan-authors/IO-Tty/commits/master/ > I see that Debian Bookwork is still using 1.17, which fits with your > guess that FreeBSD is just breaking first because it is using a newer > version. Yeah, I have nothing newer than 1.17 here either. Time for a bug report to IO::Tty's authors, I guess. regards, tom lane
Re: cfbot is failing all tests on FreeBSD/Meson builds
On Sat, Jan 13, 2024 at 1:51 PM Tom Lane wrote: > Time for a bug report to IO::Tty's authors, I guess. Ahh, there is one: https://github.com/cpan-authors/IO-Tty/issues/38 In the meantime, will look into whether I can pin that package to 1.17 somewhere in the pipeline, hopefully later today...
Re: Fix minor memory leak in connection string validation
On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote: > It makes me wonder if we should use the resowner mechanism to track > pointers to malloc'd memory. Then we could use a standard pattern for > these kinds of cases, and it would also catch more remote issues, like > if a pstrdup() fails in an error path (which can happen a few lines up > if the parse fails). That seems worth exploring. > if (!uses_password) > + { > + /* malloc'd, so we must free it explicitly */ > + PQconninfoFree(opts); > + > ereport(ERROR, > > (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), >errmsg("password is required"), >errdetail("Non-superusers must provide > a password in the connection string."))); > + } > } > > PQconninfoFree(opts); Another option could be to surround this with PG_TRY/PG_FINALLY, but your patch seems sufficient, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Fix minor memory leak in connection string validation
Nathan Bossart writes: > On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote: >> It makes me wonder if we should use the resowner mechanism to track >> pointers to malloc'd memory. Then we could use a standard pattern for >> these kinds of cases, and it would also catch more remote issues, like >> if a pstrdup() fails in an error path (which can happen a few lines up >> if the parse fails). > That seems worth exploring. I'm pretty dubious about adding overhead for that, mainly because most of the direct callers of malloc in a backend are going to be code that's not under our control. Modifying the callers that we do control is not going to give a full solution, and could well be outright misleading. > Another option could be to surround this with PG_TRY/PG_FINALLY, but your > patch seems sufficient, too. Yeah, seems fine for now. If that function grows any more complexity then we could think about using PG_TRY. regards, tom lane
Re: Synchronizing slots from primary to standby
On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot wrote: > > On Fri, Jan 12, 2024 at 08:42:39AM +0530, Amit Kapila wrote: > > On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot > > wrote: > > > > > > I'm not sure to follow here. If the remote slot is re-created then it > > > would > > > be also dropped / re-created locally, or am I missing something? > > > > > > > As our slot-syncing mechanism is asynchronous (from time to time we > > check the slot information on primary), isn't it possible that the > > same name slot is dropped and recreated between slot-sync worker's > > checks? > > > > Yeah, I should have thought harder ;-) So for this case, let's imagine that > If we > had an easy way to detect that a remote slot has been drop/re-created then I > think > we would also drop and re-create it on the standby too. > > If so, I think we should then update all the fields (that we're currently > updating > in the "create locally" case) when we detect that (at least) one of the > following differs: > > - dboid > - plugin > - two_phase > Right, I think even if any of restart/confirmed LSN's or xmin has changed then also there is no harm in simply copying all the fields from remote_slot as done by Hou-San in latest patch. > Maybe the "best" approach would be to have a way to detect that a slot has > been > re-created on the primary (but that would mean rely on more than the slot name > to "identify" a slot and probably add a new member to the struct to do so). > Right, I also thought so but not sure further complicating the slot machinery is worth detecting this case explicitly. If we see any problem with the idea discussed then we may need to think something along those lines. -- With Regards, Amit Kapila.
Re: Recovering from detoast-related catcache invalidations
Great! That's what exactly we need. The patch LGTM, +1 Tom Lane 于2024年1月13日周六 04:47写道: > I wrote: > > This is uncomfortably much in bed with the tuple table slot code, > > perhaps, but I don't see a way to do it more cleanly unless we want > > to add some new provisions to that API. Andres, do you have any > > thoughts about that? > > Oh! After nosing around a bit more I remembered systable_recheck_tuple, > which is meant for exactly this purpose. So v4 attached. > > regards, tom lane > >
Re: plpgsql memory leaks
Hi pá 12. 1. 2024 v 22:25 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > default master branch - res 190MB ram > > jit_inline_above_cost = -1 doesn't helps > > disabling JIT doesn't helps too, > > > so it looks like the wrong hypothesis , and the problem is maybe > somewhere > > else :-/ > > I see no leak with these examples on HEAD, either with or without > --enable-llvm --- the process size stays quite stable according > to "top". I wonder if you are using some extension that's > contributing to the problem. > memory info after DO $$ BEGIN END $$; (2024-01-13 05:36:46) postgres=# do $$ begin end $$; DO (2024-01-13 05:37:16) postgres=# select meminfo(); NOTICE: Total non-mmapped bytes (arena): 1114112 NOTICE: # of free chunks (ordblks):11 NOTICE: # of free fastbin blocks (smblks): 0 NOTICE: # of mapped regions (hblks): 2 NOTICE: Bytes in mapped regions (hblkhd): 401408 NOTICE: Max. total allocated space (usmblks): 0 NOTICE: Free bytes held in fastbins (fsmblks): 0 NOTICE: Total allocated space (uordblks): 1039216 NOTICE: Total free space (fordblks): 74896 NOTICE: Topmost releasable block (keepcost): 67360 after script execution NOTICE: ("1165 kB","1603 kB","438 kB") NOTICE: Total non-mmapped bytes (arena): 22548480 NOTICE: # of free chunks (ordblks):25 NOTICE: # of free fastbin blocks (smblks): 0 NOTICE: # of mapped regions (hblks): 2 NOTICE: Bytes in mapped regions (hblkhd): 401408 NOTICE: Max. total allocated space (usmblks): 0 NOTICE: Free bytes held in fastbins (fsmblks): 0 NOTICE: Total allocated space (uordblks): 1400224 NOTICE: Total free space (fordblks): 21148256 NOTICE: Topmost releasable block (keepcost): 20908384 so attached memory is 20MB - but is almost free. The sum of memory context is very stable without leaks (used 1165kB). but when I modify the script to CREATE OR REPLACE FUNCTION public.fx(iter integer) RETURNS void LANGUAGE plpgsql AS $function$ declare c cursor(m bigint) for select distinct i from generate_series(1, m) g(i); t bigint; s bigint; begin for i in 1..iter loop open c(m := i * 1); s := 0; loop fetch c into t; exit when not found; s := s + t; end loop; raise notice '===before close'; raise notice '%', (select (pg_size_pretty(sum(used_bytes)), pg_size_pretty(sum(total_bytes)), pg_size_pretty(sum(free_bytes))) from pg_get_backend_memory_contexts()); --perform meminfo(); raise notice '---after close'; close c; raise notice '%=%', i, s; raise notice '%', (select (pg_size_pretty(sum(used_bytes)), pg_size_pretty(sum(total_bytes)), pg_size_pretty(sum(free_bytes))) from pg_get_backend_memory_contexts()); --perform meminfo(); end loop; end; $function$ meminfo is simple extension - see the attachment, I got interesting things NOTICE: ===before close NOTICE: ("149 MB","154 MB","5586 kB") NOTICE: Total non-mmapped bytes (arena): 132960256 NOTICE: # of free chunks (ordblks):49 NOTICE: # of free fastbin blocks (smblks): 0 NOTICE: # of mapped regions (hblks): 4 NOTICE: Bytes in mapped regions (hblkhd): 51265536 NOTICE: Max. total allocated space (usmblks): 0 NOTICE: Free bytes held in fastbins (fsmblks): 0 NOTICE: Total allocated space (uordblks): 110730576 NOTICE: Total free space (fordblks): 9680 NOTICE: Topmost releasable block (keepcost): 133008 so this script really used mbytes memory, but it is related to query `select distinct i from generate_series(1, m) g(i);` This maybe is in correlation to my default work mem 64MB - when I set work mem to 10MB, then it consumes only 15MB So I was confused because it uses only about 3x work_mem, which is not too bad. Regards Pavel > > regards, tom lane >
Re: Synchronizing slots from primary to standby
On Fri, Jan 12, 2024 at 5:50 PM shveta malik wrote: > > There are multiple approaches discussed and tried when it comes to > starting a slot-sync worker. I am summarizing all here: > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > walwriter, walreceiver etc). The benefit this approach provides is, it > can control begin and stop in a more flexible way as each auxiliary > process could have different checks before starting and can have > different stop conditions. But it needs code duplication for process > management(start, stop, crash handling, signals etc) and currently it > does not support db-connection smoothly (none of the auxiliary process > has one so far) > As slotsync worker needs to perform transactions and access syscache, we can't make it an auxiliary process as that doesn't initialize the required stuff like syscache. Also, see the comment "Auxiliary processes don't run transactions ..." in AuxiliaryProcessMain() which means this is not an option. > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > which is neither an Auxiliary process nor a bgworker one. It allows > db-connection and also provides flexibility to have start and stop > conditions for a process. > Yeah, due to these reasons, I think this option is worth considering and another plus point is that this allows us to make enable_syncslot a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > 3) Make slotysnc worker a bgworker. Here we just need to register our > process as a bgworker (RegisterBackgroundWorker()) by providing a > relevant start_time and restart_time and then the process management > is well taken care of. It does not need any code-duplication and > allows db-connection smoothly in registered process. The only thing it > lacks is that it does not provide flexibility of having > start-condition which then makes us to have 'enable_syncslot' as > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > feel enable_syncslot is something which will not be changed frequently > and with the benefits provided by bgworker infra, it seems a > reasonably good option to choose this approach. > I agree but it may be better to make it a PGC_SIGHUP parameter. > 4) Another option is to have Logical Replication Launcher(or a new > process) to launch slot-sync worker. But going by the current design > where we have only 1 slotsync worker, it may be an overhead to have an > additional manager process maintained. > I don't see any good reason to have an additional launcher process here. > > Thus weighing pros and cons of all these options, we have currently > implemented the bgworker approach (approach 3). Any feedback is > welcome. > I vote to go for (2) unless we face difficulties in doing so but (3) is also okay especially if others also think so. -- With Regards, Amit Kapila.