Re: psql tests hangs
On Fri, May 12, 2023 at 2:40 AM Pavel Stehule wrote: > pá 12. 5. 2023 v 8:20 odesílatel Kirk Wolak napsal: > >> On Fri, May 12, 2023 at 1:46 AM Pavel Stehule >> wrote: >> >>> pá 12. 5. 2023 v 6:50 odesílatel Kirk Wolak napsal: >>> On Fri, May 12, 2023 at 12:14 AM Tom Lane wrote: > Kirk Wolak writes: > > Did you try the print statement that Andrey asked Pavel to try? > ... >>> The strange thing is hanging. Broken tests depending on locale are >>> usual. But I didn't remember hanging. >>> >>> Regards >>> >>> Pavel >>> >> >> So, if you do psql -c "..." >> with both of those \watch instructions, do either one hang? (I am now >> guessing "no") >> >> I know that perl is using a special library to "remote control psql" >> (like a pseudo terminal, I guess). >> [I had to abort some of the perl testing in Windows because that perl >> library didn't work with my psql in Windows] >> >> Next, can you detect which process is hanging? (is it perl, the library, >> psql, ?). >> > > It hangs in perl > > but now I found there is dependency on PSQL_PAGER setting > > it started pager in background, I had lot of zombie pspg processes > > Unfortunately, when I unset this variable, the test hangs still > > here is backtrace > > Missing separate debuginfos, use: dnf debuginfo-install > perl-interpreter-5.36.1-496.fc38.x86_64 > (gdb) bt > #0 0x7fbbc1129ade in select () from /lib64/libc.so.6 > #1 0x7fbbc137363b in Perl_pp_sselect () from /lib64/libperl.so.5.36 > #2 0x7fbbc1317958 in Perl_runops_standard () from > /lib64/libperl.so.5.36 > #3 0x7fbbc128259d in perl_run () from /lib64/libperl.so.5.36 > #4 0x56392bd9034a in main () > > It is waiting on reading from pipe probably > > psql is living too, and it is waiting too > > Using host libthread_db library "/lib64/libthread_db.so.1". > 0x7f071740bc37 in wait4 () from /lib64/libc.so.6 > Missing separate debuginfos, use: dnf debuginfo-install > glibc-2.37-4.fc38.x86_64 ncurses-libs-6.4-3.20230114.fc38.x86_64 > readline-8.2-3.fc38.x86_64 > (gdb) bt > #0 0x7f071740bc37 in wait4 () from /lib64/libc.so.6 > #1 0x7f07173a9a10 in _IO_proc_close@@GLIBC_2.2.5 () from > /lib64/libc.so.6 > #2 0x7f07173b51e9 in __GI__IO_file_close_it () from /lib64/libc.so.6 > #3 0x7f07173a79fb in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6 > #4 0x00406be4 in do_watch (query_buf=query_buf@entry=0x5ae540, > sleep=sleep@entry=0.01, iter=0, iter@entry=3) at command.c:5348 > #5 0x004087a5 in exec_command_watch > (scan_state=scan_state@entry=0x5ae490, > active_branch=active_branch@entry=true, query_buf=query_buf@entry=0x5ae540, > previous_buf=previous_buf@entry=0x5ae560) at command.c:2875 > #6 0x0040d4ba in exec_command (previous_buf=0x5ae560, > query_buf=0x5ae540, cstack=0x5ae520, scan_state=0x5ae490, cmd=0x5ae9a0 > "watch") at command.c:413 > #7 HandleSlashCmds (scan_state=scan_state@entry=0x5ae490, > cstack=cstack@entry=0x5ae520, query_buf=0x5ae540, previous_buf=0x5ae560) > at command.c:230 > > I am not sure, it is still doesn't work but probably there are some > dependencies on my setting > > PSQL_PAGER and PSQL_WATCH_PAGER > > so this tests fails due my setting > > [pavel@localhost postgresql.master]$ set |grep PSQL > PSQL_PAGER='pspg -X' > PSQL_WATCH_PAGER='pspg -X --stream' > > Regards > > Pavel > > Ummm... We are testing PSQL \watch and you potentially have a PSQL_WATCH_PAGER that is kicking in? By chance does that attempt to read/process/understand the \watch ? Also, if it is interfering with the stream, that would explain it. The perl library is trying to "control" psql. If it ends up talking to you instead... All bets are off, imo. I don't know enough about PSQL_WATCH_PAGER. Now I would be curious if you changed the test from SELECT 1 \watch c=3 0.01 to SELECT 1 \watch 0.01 because that should work. Then I would test SELECT \watch 0.01 c=3 If you are trying to parse the watch at all, that could break. Then your code might be trying to "complain", and then that is screwing up the planned interaction (Just Guessing). Kirk...
Re: Assert failure of the cross-check for nullingrels
On Fri, Mar 17, 2023 at 11:05 AM Richard Guo wrote: > Here is an updated patch with comments and test case. I also change the > code to store 'group_clause_relids' directly in RestrictInfo. > BTW, I've added an open item for this issue. Thanks Richard
Re: psql tests hangs
On 2023-May-12, Pavel Stehule wrote: > It hangs in perl I wonder if "hanging" actually means that it interpreted the sleep time as a very large integer, so it's just sleeping for a long time. About the server locale, note that the ->new() call explicitly requests the C locale -- it's only psql that is using the Czech locale. Supposedly, the Perl code should also be using the Czech locale, so the sprintf('%g') should be consistent with what psql \watch expects. However, you cannot ask the server to be consistent with that -- say, if you hypothetically tried to use "to_char(9D99)" and \gset that to use as \watch argument, it wouldn't work, because that'd use the server's C locale, not Czech. (I know because I tried.) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
Re: Add standard collation UNICODE
On 08.05.23 17:48, Peter Eisentraut wrote: On 27.04.23 13:44, Daniel Verite wrote: This collation has an empty pg_collation.collversion column, instead of being set to the same value as "und-x-icu" to track its version. The original patch implements this as an INSERT in which it would be easy to fix I guess, but in current HEAD it comes as an entry in include/catalog/pg_collation.dat: { oid => '963', descr => 'sorts using the Unicode Collation Algorithm with default settings', collname => 'unicode', collprovider => 'i', collencoding => '-1', colliculocale => 'und' }, Should it be converted back into an INSERT or better left in this file and collversion being updated afterwards? How about we do it with an UPDATE command. We already do this for pg_database in a similar way. See attached patch. This has been committed.
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
On 10.05.23 20:04, Andres Freund wrote: This commit adds a test is(scalar @lp_off, $ROWCOUNT, "acquired row offsets"); *before* that skip_all call. This appears to be invalid. If the skip_all happens, you get a complaint like t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0) Parse errors: Bad plan. You planned 0 tests but ran 1. We could move the is() test after all the skip_all's. Any thoughts? I think the easiest fix is to just die if we can't get the offsets - it's not like we can really continue afterwards... This should do it: -is(scalar @lp_off, $ROWCOUNT, "acquired row offsets"); +scalar @lp_off == $ROWCOUNT or BAIL_OUT("row offset counts mismatch"); But I'm not sure what the latest thinking on BAIL_OUT is. It is used nearby in a similar way though.
Re: psql tests hangs
pá 12. 5. 2023 v 9:46 odesílatel Alvaro Herrera napsal: > On 2023-May-12, Pavel Stehule wrote: > > > It hangs in perl > > I wonder if "hanging" actually means that it interpreted the sleep time > as a very large integer, so it's just sleeping for a long time. > There is some interaction with pspg in stream mode The probable scenario It is starting pspg due to my setting PSQL_WATCH_PAGER. pspg is waiting on quit command, or on pipe ending. Quit command cannot to come because it is not on tty, so it is dead lock I can write to safeguard the fast ending on pspg when it is in stream mode, and tty is not available. And generally, the root perl should to reset PSQL_WATCH_PAGER env variable before executing psql. Probably it does with PSQL_PAGER, and maybe with PAGER. Regards Pavel > > About the server locale, note that the ->new() call explicitly requests > the C locale -- it's only psql that is using the Czech locale. > Supposedly, the Perl code should also be using the Czech locale, so the > sprintf('%g') should be consistent with what psql \watch expects. > However, you cannot ask the server to be consistent with that -- say, if > you hypothetically tried to use "to_char(9D99)" and \gset that to use as > \watch argument, it wouldn't work, because that'd use the server's C > locale, not Czech. (I know because I tried.) > > -- > Álvaro HerreraBreisgau, Deutschland — > https://www.EnterpriseDB.com/ > "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente" >
Re: Allow pg_archivecleanup to remove backup history files
On 2023-05-10 17:52, Bharath Rupireddy wrote: Thanks for your comments! Just curious to know the driving point behind this proposal - is pg_archivecleanup deployed in production that was unable to clean up the history files and there were many such history files left? It will help us know how pg_archivecleanup is being used. Yes. Just curious to know the driving point behind this proposal - is pg_archivecleanup deployed in production that was unable to clean up the history files and there were many such history files left? It will help us know how pg_archivecleanup is being used. I'm wondering if making -x generic with '-x' '.backup', is simpler than adding another option? Since according to the current semantics, deleting backup history files with -x demands not '-x .backup' but '-x .007C9330.backup' when the file name is 0001123455CD.007C9330.backup, it needs special treatment for backup history files, right? I think it would be intuitive and easier to remember than new option. I was a little concerned about what to do when deleting both the files ending in .gz and backup history files. Is making it possible to specify both "-x .backup" and "-x .gz" the way to go? I also concerned someone might add ".backup" to WAL files, but does that usually not happen? Comments on the patch: 1. Why just only the backup history files? Why not remove the timeline history files too? Is it because there may not be as many tli switches happening as backups? Yeah, do you think we should also add logic for '-x .history'? 2.+sub remove_backuphistoryfile_run_check +{ Why to invent a new function when run_check() can be made generic with few arguments passed? Thanks, I'm going to reconsider it. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Orphaned files in base/[oid]
12.05.2023, 14:17, "Andres Freund" :Alternatively we could do something without marker files, with someadded complexity: Keep track of all "uncommitted new files" in memory,and log them every checkpoint. Commit/abort records clear elements ofthat list. Since we always start replay at the beginning of acheckpoint, we'd always reach a moment with such an up2date list ofpending-action files before reaching end-of-recovery. At end-of-recoverywe can delete all unconfirmed files. To avoid out-of-memory due to toomany tracked relations, we'd possibly still have to have marker files...Hi, hackers. I'm sorry, but I want to bump this thread, because there is still no good solution to solve the problem. I see there are few threads with undo-based approaches, which looks preferable, but have some pitfalls. Is there any chance to return to non-undo approaches partially discussed here? What do you think about the following solutions?1) Make `pendingDeletes` shared and let postmaster clean all garbage in case of child process dying. Cons: Not works in case of postmaster dying. Should care about `pendingDeletes` pointers validity.2) Catch and store all records with relfilenode during WAL replay, delete all orphaned nodes at the end of replaying. Cons: The final delete may use an incomplete list of nodes, as there was something before the latest checkpoint. The general opacity - we remove something without a corresponded WAL record (or possibly do it in wrong place in general).3) This way is close to one I quoted and a combination of two above. `pendingDeletes` is shared. Each checkpoint creates a WAL record with a list of open transactions and created nodes. WAL replaying can use this list as base, adding nodes to it from newer records. The final delete operation has a complete list of orphaned nodes. Cons: Complexity(?). Others(?). Can it work? Are any of this approaches still relevant?-- Regards,Alex Go, C developerg...@arenadata.io, www.arenadata.tech
RE: Non-superuser subscription owners
On Tuesday, April 4, 2023 1:57 AM Robert Haas wrote: > > On Sat, Apr 1, 2023 at 12:00 PM Alexander Lakhin > wrote: > > I've managed to reproduce it using the following script: > > for ((i=1;i<=10;i++)); do > > echo "iteration $i" > > echo " > > CREATE ROLE sub_user; > > CREATE SUBSCRIPTION testsub CONNECTION 'dbname=db' > > PUBLICATION testpub WITH (connect = false); ALTER SUBSCRIPTION > > testsub ENABLE; DROP SUBSCRIPTION testsub; SELECT pg_sleep(0.001); > > DROP ROLE sub_user; " | psql psql -c "ALTER SUBSCRIPTION testsub > > DISABLE;" > > psql -c "ALTER SUBSCRIPTION testsub SET (slot_name = NONE);" > > psql -c "DROP SUBSCRIPTION testsub;" > > grep 'TRAP' server.log && break > > done > > After a bit of experimentation this repro worked for me -- I needed > -DRELCACHE_FORCE_RELEASE as well, and a bigger iteration count. I verified > that the patch fixed it, and committed the patch with the addition of a > comment. Thanks for pushing! While testing this, I found a similar problem in table sync worker, as we also invoke superuser_arg() in table sync worker which is not in a transaction. LogicalRepSyncTableStart ... /* Is the use of a password mandatory? */ must_use_password = MySubscription->passwordrequired && !superuser_arg(MySubscription->owner); #0 0x7f18bb55aaff in raise () from /lib64/libc.so.6 #1 0x7f18bb52dea5 in abort () from /lib64/libc.so.6 #2 0x00b69a22 in ExceptionalCondition (conditionName=0xda4338 "IsTransactionState()", fileName=0xda403e "catcache.c", lineNumber=1208) at assert.c:66 #3 0x00b4842a in SearchCatCacheInternal (cache=0x27cab80, nkeys=1, v1=10, v2=0, v3=0, v4=0) at catcache.c:1208 #4 0x00b48329 in SearchCatCache1 (cache=0x27cab80, v1=10) at catcache.c:1162 #5 0x00b630c7 in SearchSysCache1 (cacheId=11, key1=10) at syscache.c:825 #6 0x00b982e3 in superuser_arg (roleid=10) at superuser.c:70 I can reproduce this via gdb following similar steps in [1]. I think we need to move this call into a transaction as well and here is an attempt to do that. [1] https://www.postgresql.org/message-id/OS0PR01MB5716E596E4FB83DE46F592FE948C9%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best Regards, Hou zj 0001-Fix-possible-logical-replication-table-sync-crash.patch Description: 0001-Fix-possible-logical-replication-table-sync-crash.patch
Re: Orphaned files in base/[oid]
On 14.08.2017 23:56, Andres Freund wrote: Alternatively we could do something without marker files, with some added complexity: Keep track of all "uncommitted new files" in memory, and log them every checkpoint. Commit/abort records clear elements of that list. Since we always start replay at the beginning of a checkpoint, we'd always reach a moment with such an up2date list of pending-action files before reaching end-of-recovery. At end-of-recovery we can delete all unconfirmed files. To avoid out-of-memory due to too many tracked relations, we'd possibly still have to have marker files... Hi, hackers. I'm sorry, but I want to bump this thread, because there is still no good solution to solve the problem. I see there are few threads with undo-based approaches, which looks preferable, but have some pitfalls. Is there any chance to return to non-undo approaches partially discussed here? What do you think about the following solutions? 1) Make `pendingDeletes` shared and let postmaster clean all garbage in case of child process dying. Cons: Not works in case of postmaster dying. Should care about `pendingDeletes` pointers validity. 2) Catch and store all records with relfilenode during WAL replay, delete all orphaned nodes at the end of replaying. Cons: The final delete may use an incomplete list of nodes, as there was something before the latest checkpoint. The general opacity - we remove something without a corresponded WAL record (or possibly do it in wrong place in general). 3) This way is close to one I quoted and a combination of two above. `pendingDeletes` is shared. Each checkpoint creates a WAL record with a list of open transactions and created nodes. WAL replaying can use this list as base, adding nodes to it from newer records. The final delete operation has a complete list of orphaned nodes. Cons: Complexity(?). Others(?). Can it work? Are any of this approaches still relevant? I'm sorry for the last html-formatted message. Our web-based app is too smart. -- Regards, Alex Go, C developer g...@arenadata.io, www.arenadata.tech
Re: Large files for relations
Thomas Munro writes: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: >>> I am not aware of any modern/non-historic filesystem[2] that can't do >>> large files with ease. Anyone know of anything to worry about on that >>> front? >> >> There is some trouble in the ambiguity of what we mean by "modern" and >> "large files". There are still a large number of users of ext4 where >> the max file size is 16TB. Switching to a single large file per >> relation would effectively cut the max table size in half for those >> users. How would a user with say a 20TB table running on ext4 be >> impacted by this change? […] > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). If we're going to have to keep the segment code for the foreseeable future anyway, could we not get most of the benefit by increasing the segment size to something like 1TB? The vast majority of tables would fit in one file, and there would be less risk of hitting filesystem limits. - ilmari
Re: v16 regression - wrong query results with LEFT JOINs + join removal
On Thu, May 11, 2023 at 4:16 PM Kirk Wolak wrote: > Forgive the noob question... But does this trigger a regression test to be > created? > And who tracks/pushes that? Tom included one in the commit. -- Robert Haas EDB: http://www.enterprisedb.com
Re: running logical replication as the subscription owner
On Fri, May 12, 2023 at 1:49 PM Amit Kapila wrote: > > On Fri, May 12, 2023 at 9:10 AM Masahiko Sawada wrote: > > > > On Fri, May 12, 2023 at 1:12 AM Robert Haas wrote: > > > > > > On Thu, May 11, 2023 at 7:38 AM Amit Kapila > > > wrote: > > > > Do we want the initial sync to also respect 'run_as_owner' option? I > > > > might be missing something but I don't see anything in the docs about > > > > initial sync interaction with this option. In the commit a2ab9c06ea, > > > > we did the permission checking during the initial sync so I thought we > > > > should do it here as well. > > > > > > It definitely should work that way. lf it doesn't, that's a bug. > > > > After some tests, it seems that the initial sync worker respects > > 'run_as_owner' during catching up but not during COPYing. > > > > Yeah, I was worried during copy phase only. During catchup, the code > is common with apply worker code, so it will work. > I tried the following test: Repeat On the publisher and subscriber: /* Create role regress_alice with NOSUPERUSER on publisher and subscriber and a table for replication */ CREATE ROLE regress_alice NOSUPERUSER LOGIN; CREATE ROLE regress_admin SUPERUSER LOGIN; GRANT CREATE ON DATABASE postgres TO regress_alice; SET SESSION AUTHORIZATION regress_alice; CREATE SCHEMA alice; GRANT USAGE ON SCHEMA alice TO regress_admin; CREATE TABLE alice.test (i INTEGER); ALTER TABLE alice.test REPLICA IDENTITY FULL; On the publisher: postgres=> insert into alice.test values(1); postgres=> insert into alice.test values(2); postgres=> insert into alice.test values(3); postgres=> CREATE PUBLICATION alice FOR TABLE alice.test WITH (publish_via_partition_root = true); On the subscriber: /* create table admin_audit which regress_alice does not have access to */ SET SESSION AUTHORIZATION regress_admin; create table admin_audit (i integer); On the subscriber: /* Create a trigger for table alice.test which inserts on table admin_audit which the table owner of alice.test does not have access to */ SET SESSION AUTHORIZATION regress_alice; CREATE OR REPLACE FUNCTION alice.alice_audit() RETURNS trigger AS $$ BEGIN insert into public.admin_audit values(2); RETURN NEW; END; $$ LANGUAGE 'plpgsql'; create trigger test_alice after insert on alice.test for each row execute procedure alice.alice_audit(); alter table alice.test enable always trigger test_alice; On the subscriber: /* Create a subscription with run_as_owner = false */ CREATE SUBSCRIPTION admin_sub CONNECTION 'dbname=postgres host=localhost port=6972' PUBLICATION alice WITH (run_as_owner = false); === What I see is that as part of tablesync, the trigger invokes an updates admin_audit which it shouldn't, as the table owner of alice.test should not have access to the table admin_audit. This means the table copy is being invoked as the subscription owner and not the table owner. However, I see subsequent inserts fail on replication with permission denied error, so the apply worker correctly applies the inserts as the table owner. If nobody else is working on this, I can come up with a patch to fix this regards, Ajin Cherian Fujitsu Australia
Re: walsender performance regression due to logical decoding on standby changes
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand wrote: > > >> My current guess is that mis-using a condition variable is the best bet. I > >> think it should work to use ConditionVariablePrepareToSleep() before a > >> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > >> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > >> would still cause the necessary wakeup. > > > > How about something like the attached? Recovery and subscription tests > > don't complain with the patch. > > I launched a full Cirrus CI test with it but it failed on one environment > (did not look in details, > just sharing this here): https://cirrus-ci.com/task/6570140767092736 Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for every backend as opposed to just once after the WAL sender shmem was created. > Also I have a few comments: Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which has comments and passing CI runs on all platforms - https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2. On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) wrote: > > if (AllowCascadeReplication()) > - WalSndWakeup(switchedTLI, true); > + ConditionVariableBroadcast(&WalSndCtl->cv); > > After the change, we wakeup physical walsender regardless of switchedTLI flag. > Is this intentional ? if so, I think It would be better to update the > comments above this. That's not the case with the attached v2 patch. Please have a look. On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada wrote: > > We can have two condition variables for > logical and physical walsenders, and selectively wake up walsenders > sleeping on the condition variables. It should work, it seems like > much of a hack, though. Andres, rightly put it - 'mis-using' CV infrastructure. It is simple, works, and makes the WalSndWakeup() easy solving the performance regression. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 02c4770964737d1e7a3c7d579080bc6c5bc01fdc Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 12 May 2023 11:30:14 + Subject: [PATCH v2] Optimize walsender wake up logic with Conditional Variables WalSndWakeup() currently loops through all the walsenders slots, with a spinlock acquisition and release for every iteration, just to wake up only the waiting walsenders. WalSndWakeup() gets costlier even when there's a single walsender available on the standby (i.e., a cascading replica or a logical decoding client). This wasn't a problem before e101dfac3a53c. But, for allowing logical decoding on standby, we needed to wake up logical walsenders after every WAL record is applied on the standby. This really made WalSndWakeup() costly, causing performance regression. To solve this, we use condition variable (CV) to efficiently wake up walsenders in WalSndWakeup(). Every walsender prepares to sleep on a shared memory CV. Note that it just prepares to sleep on the CV (i.e., adds itself to the CV's waitlist), but not actually waits on the CV (IOW, it never calls ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting, because CV infrastructure doesn't handle FeBe socket events currently. The processes (startup process, walreceiver etc.) wanting to wake up walsenders use ConditionVariableBroadcast(), which in turn calls SetLatch(), helping walsenders come out of WaitEventSetWait(). This approach is simple and efficient because it makes WalSndWakeup() life easy. When available, WaitEventSetWait() can be replaced with its CV's counterpart. And, we use separate shared memory CVs for physical and logical walsenders for selective wake ups, see WalSndWakeup() for more details. Reported-by: Andres Freund Suggested-by: Andres Freund Author: Bharath Rupireddy Reviewed-by: Drouvot, Bertrand Reviewed-by: Zhijie Hou Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de --- src/backend/replication/walsender.c | 72 ++--- src/include/replication/walsender_private.h | 7 ++ 2 files changed, 55 insertions(+), 24 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 45b8b3684f..dc4d376e7c 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -3309,6 +3309,9 @@ WalSndShmemInit(void) SpinLockInit(&walsnd->mutex); } + + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); } } @@ -3330,31 +,17 @@ WalSndShmemInit(void) void WalSndWakeup(bool physical, bool logical) { - int i; - - for (i = 0; i < max_wal_senders; i++) - { - Latch *latch; - ReplicationKind kind; - WalSnd *walsnd = &WalSndCtl->walsnds[i]; - - /* - * Get latch pointer with spinlock held, for the unlikely ca
Re: psql tests hangs
pá 12. 5. 2023 v 10:31 odesílatel Pavel Stehule napsal: > > > pá 12. 5. 2023 v 9:46 odesílatel Alvaro Herrera > napsal: > >> On 2023-May-12, Pavel Stehule wrote: >> >> > It hangs in perl >> >> I wonder if "hanging" actually means that it interpreted the sleep time >> as a very large integer, so it's just sleeping for a long time. >> > > There is some interaction with pspg in stream mode > > The probable scenario > > It is starting pspg due to my setting PSQL_WATCH_PAGER. pspg is waiting on > quit command, or on pipe ending. Quit command cannot to come because it is > not on tty, so it is dead lock > > I can write to safeguard the fast ending on pspg when it is in stream > mode, and tty is not available. > > And generally, the root perl should to reset PSQL_WATCH_PAGER env variable > before executing psql. Probably it does with PSQL_PAGER, and maybe with > PAGER. > with last change in pspg, this tests fails as "expected" aster/src/bin/psql/../../../src/test/regress/pg_regress' /usr/bin/prove -I ../../../src/test/perl/ -I . t/*.pl # +++ tap check in src/bin/psql +++ t/001_basic.pl ... 59/? # Failed test '\watch with 3 iterations: no stderr' # at t/001_basic.pl line 356. # got: 'stream mode can be used only in interactive mode (tty is not available)' # expected: '' # Failed test '\watch with 3 iterations: matches' # at t/001_basic.pl line 356. # '' # doesn't match '(?^l:1\n1\n1)' # Looks like you failed 2 tests of 80. t/001_basic.pl ... Dubious, test returned 2 (wstat 512, 0x200) Failed 2/80 subtests t/010_tab_completion.pl .. ok t/020_cancel.pl .. ok Test Summary Report --- t/001_basic.pl (Wstat: 512 (exited 2) Tests: 80 Failed: 2) Failed tests: 69-70 Non-zero exit status: 2 Files=3, Tests=169, 7 wallclock secs ( 0.16 usr 0.03 sys + 3.31 cusr 1.31 csys = 4.81 CPU) Result: FAIL make: *** [Makefile:87: check] Chyba 1 Regards Pavel > > Regards > > Pavel > > >> >> About the server locale, note that the ->new() call explicitly requests >> the C locale -- it's only psql that is using the Czech locale. >> Supposedly, the Perl code should also be using the Czech locale, so the >> sprintf('%g') should be consistent with what psql \watch expects. >> However, you cannot ask the server to be consistent with that -- say, if >> you hypothetically tried to use "to_char(9D99)" and \gset that to use as >> \watch argument, it wouldn't work, because that'd use the server's C >> locale, not Czech. (I know because I tried.) >> >> -- >> Álvaro HerreraBreisgau, Deutschland — >> https://www.EnterpriseDB.com/ >> "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente" >> >
Re: psql tests hangs
Pavel Stehule writes: > And generally, the root perl should to reset PSQL_WATCH_PAGER env variable > before executing psql. Probably it does with PSQL_PAGER, and maybe with > PAGER. Oh! AFAICS, we don't do any of those things, but I agree it seems like a good idea. Can you confirm that if you unset PSQL_WATCH_PAGER then the test passes for you? regards, tom lane
Re: Large files for relations
On Thu, May 11, 2023 at 7:38 PM Thomas Munro wrote: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > > On Mon, May 1, 2023 at 9:29 PM Thomas Munro > wrote: > >> I am not aware of any modern/non-historic filesystem[2] that can't do > >> large files with ease. Anyone know of anything to worry about on that > >> front? > > > > There is some trouble in the ambiguity of what we mean by "modern" and > "large files". There are still a large number of users of ext4 where the > max file size is 16TB. Switching to a single large file per relation would > effectively cut the max table size in half for those users. How would a > user with say a 20TB table running on ext4 be impacted by this change? > > Hrmph. Yeah, that might be a bit of a problem. I see it discussed in > various places that MySQL/InnoDB can't have tables bigger than 16TB on > ext4 because of this, when it's in its default one-file-per-object > mode (as opposed to its big-tablespace-files-to-hold-all-the-objects > mode like DB2, Oracle etc, in which case I think you can have multiple > 16TB segment files and get past that ext4 limit). It's frustrating > because 16TB is still really, really big and you probably should be > using partitions, or more partitions, to avoid all kinds of other > scalability problems at that size. But however hypothetical the > scenario might be, it should work, > Agreed, it is frustrating, but it is not hypothetical. I have seen a number of users having single tables larger than 16TB and don't use partitioning because of the limitations we have today. The most common reason is needing multiple unique constraints on the table that don't include the partition key. Something like a user_id and email. There are workarounds for those cases, but usually it's easier to deal with a single large table than to deal with the sharp edges those workarounds introduce.
Re: Large files for relations
Greetings, * Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote: > Thomas Munro writes: > > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro wrote: > >>> I am not aware of any modern/non-historic filesystem[2] that can't do > >>> large files with ease. Anyone know of anything to worry about on that > >>> front? > >> > >> There is some trouble in the ambiguity of what we mean by "modern" and > >> "large files". There are still a large number of users of ext4 where > >> the max file size is 16TB. Switching to a single large file per > >> relation would effectively cut the max table size in half for those > >> users. How would a user with say a 20TB table running on ext4 be > >> impacted by this change? > […] > > A less aggressive version of the plan would be that we just keep the > > segment code for the foreseeable future with no planned cut off, and > > we make all of those "piggy back" transformations that I showed in the > > patch set optional. For example, I had it so that CLUSTER would > > quietly convert your relation to large format, if it was still in > > segmented format (might as well if you're writing all the data out > > anyway, right?), but perhaps that could depend on a GUC. Likewise for > > base backup. Etc. Then someone concerned about hitting the 16TB > > limit on ext4 could opt out. Or something like that. It seems funny > > though, that's exactly the user who should want this feature (they > > have 16,000 relation segment files). > > If we're going to have to keep the segment code for the foreseeable > future anyway, could we not get most of the benefit by increasing the > segment size to something like 1TB? The vast majority of tables would > fit in one file, and there would be less risk of hitting filesystem > limits. While I tend to agree that 1GB is too small, 1TB seems like it's possibly going to end up on the too big side of things, or at least, if we aren't getting rid of the segment code then it's possibly throwing away the benefits we have from the smaller segments without really giving us all that much. Going from 1G to 10G would reduce the number of open file descriptors by quite a lot without having much of a net change on other things. 50G or 100G would reduce the FD handles further but starts to make us lose out a bit more on some of the nice parts of having multiple segments. Just some thoughts. Thanks, Stephen signature.asc Description: PGP signature
Re: createuser --memeber and PG 16
On 11.05.23 16:07, Robert Haas wrote: On Wed, May 10, 2023 at 1:33 PM Bruce Momjian wrote: This seems like it will be forever confusing to people. I frankly don't know why --role matching CREATE ROLE ... ROLE IN was not already confusing in pre-PG 16. Any new ideas for improvement? Yeah, it's a bad situation. I think --role is basically misnamed. Something like --add-to-group would have been clearer, but that also has the problem of being inconsistent with the SQL command. The whole ROLE vs. IN ROLE thing is inherently quite confusing, I think. It's very easy to get confused about which direction the membership arrows are pointing. It's hard to tell that for the --member option as well. For createuser foo --member bar it's not intuitive whether foo becomes a member of bar or bar becomes a member of foo. Maybe something more verbose like --member-of would help?
Re: psql tests hangs
pá 12. 5. 2023 v 15:26 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > And generally, the root perl should to reset PSQL_WATCH_PAGER env > variable > > before executing psql. Probably it does with PSQL_PAGER, and maybe with > > PAGER. > > Oh! AFAICS, we don't do any of those things, but I agree it seems like > a good idea. Can you confirm that if you unset PSQL_WATCH_PAGER then > the test passes for you? > yes, I tested it now, and unset PSQL_WATCH_PAGER fixed this issue. Regards Pavel > regards, tom lane >
[PATCH] Slight improvement of worker_spi.c example
Hi, Currently the example uses the following order of calls: StartTransactionCommand(); SPI_connect(); PushActiveSnapshot(...); ... SPI_finish(); PopActiveSnapshot(); CommitTransactionCommand(); This could be somewhat misleading. Typically one expects something to be freed in a reverse order compared to initialization. This creates a false impression that PushActiveSnapshot(...) _should_ be called after SPI_connect(). The patch changes the order to: StartTransactionCommand(); PushActiveSnapshot(...); SPI_connect(); ... SPI_finish(); PopActiveSnapshot(); CommitTransactionCommand(); ... and also clarifies that the order of PushActiveSnapshot(...) and SPI_connect() is not important. -- Best regards, Aleksander Alekseev From 9f20a508c3b52d94455d67e3bdf7872787c63255 Mon Sep 17 00:00:00 2001 From: Aleksander Alekseev Date: Fri, 12 May 2023 17:15:04 +0300 Subject: [PATCH v1] Slight improvement of worker_spi.c example Previously the example used the following order of calls: StartTransactionCommand(); SPI_connect(); PushActiveSnapshot(...); ... SPI_finish(); PopActiveSnapshot(); CommitTransactionCommand(); This could be somewhat misleading. Typically one expects something to be freed in a reverse order comparing to initialization. This creates a false impression that PushActiveSnapshot(...) _should_ be called after SPI_connect(). The patch changes the order to: StartTransactionCommand(); PushActiveSnapshot(...); SPI_connect(); ... SPI_finish(); PopActiveSnapshot(); CommitTransactionCommand(); ... and also clarifies that the order of PushActiveSnapshot(...) and SPI_connect() is not important. Aleksander Alekseev, reviewed by TODO FIXME Discussion: TODO FIXME --- src/test/modules/worker_spi/worker_spi.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c index ad491d7722..8045bb3546 100644 --- a/src/test/modules/worker_spi/worker_spi.c +++ b/src/test/modules/worker_spi/worker_spi.c @@ -74,8 +74,8 @@ initialize_worker_spi(worktable *table) SetCurrentStatementStartTimestamp(); StartTransactionCommand(); - SPI_connect(); PushActiveSnapshot(GetTransactionSnapshot()); + SPI_connect(); pgstat_report_activity(STATE_RUNNING, "initializing worker_spi schema"); /* XXX could we use CREATE SCHEMA IF NOT EXISTS? */ @@ -222,17 +222,19 @@ worker_spi_main(Datum main_arg) * preceded by SetCurrentStatementStartTimestamp(), so that statement * start time is always up to date. * - * The SPI_connect() call lets us run queries through the SPI manager, - * and the PushActiveSnapshot() call creates an "active" snapshot + * The PushActiveSnapshot() call creates an "active" snapshot, * which is necessary for queries to have MVCC data to work on. + * The SPI_connect() call lets us run queries through the SPI manager. + * The order of PushActiveSnapshot() and SPI_connect() is not really + * important. * * The pgstat_report_activity() call makes our activity visible * through the pgstat views. */ SetCurrentStatementStartTimestamp(); StartTransactionCommand(); - SPI_connect(); PushActiveSnapshot(GetTransactionSnapshot()); + SPI_connect(); debug_query_string = buf.data; pgstat_report_activity(STATE_RUNNING, buf.data); -- 2.40.1
Re: psql tests hangs
Pavel Stehule writes: > pá 12. 5. 2023 v 15:26 odesílatel Tom Lane napsal: >> Oh! AFAICS, we don't do any of those things, but I agree it seems like >> a good idea. Can you confirm that if you unset PSQL_WATCH_PAGER then >> the test passes for you? > yes, I tested it now, and unset PSQL_WATCH_PAGER fixed this issue. OK. So after looking at this a bit, the reason PAGER and PSQL_PAGER don't cause us any problems in the test environment is that they are not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)). It seems to me that it's a bug that there is no such check before using PSQL_WATCH_PAGER. Is there actually any defensible reason for that? I think we do need to clear out all three variables in Cluster::interactive_psql. But our regular psql tests shouldn't be at risk here. regards, tom lane
Re: psql tests hangs
pá 12. 5. 2023 v 17:50 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > pá 12. 5. 2023 v 15:26 odesílatel Tom Lane napsal: > >> Oh! AFAICS, we don't do any of those things, but I agree it seems like > >> a good idea. Can you confirm that if you unset PSQL_WATCH_PAGER then > >> the test passes for you? > > > yes, I tested it now, and unset PSQL_WATCH_PAGER fixed this issue. > > OK. So after looking at this a bit, the reason PAGER and PSQL_PAGER > don't cause us any problems in the test environment is that they are > not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)). > It seems to me that it's a bug that there is no such check before > using PSQL_WATCH_PAGER. Is there actually any defensible reason > for that? > Theoretically, we can write tests for these features, and then stdout, stdin may not be tty. Except for testing, using pager in non-interactive mode makes no sense. Regards Pavel > I think we do need to clear out all three variables in > Cluster::interactive_psql. But our regular psql tests shouldn't > be at risk here. > > regards, tom lane >
Re: Large files for relations
Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read. The workaround for writes was one of: 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010) 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write I have a vague memory that filesystems have improved in this regard. On Thu, May 11, 2023 at 4:38 PM Thomas Munro wrote: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski wrote: > > On Mon, May 1, 2023 at 9:29 PM Thomas Munro > wrote: > >> I am not aware of any modern/non-historic filesystem[2] that can't do > >> large files with ease. Anyone know of anything to worry about on that > >> front? > > > > There is some trouble in the ambiguity of what we mean by "modern" and > "large files". There are still a large number of users of ext4 where the > max file size is 16TB. Switching to a single large file per relation would > effectively cut the max table size in half for those users. How would a > user with say a 20TB table running on ext4 be impacted by this change? > > Hrmph. Yeah, that might be a bit of a problem. I see it discussed in > various places that MySQL/InnoDB can't have tables bigger than 16TB on > ext4 because of this, when it's in its default one-file-per-object > mode (as opposed to its big-tablespace-files-to-hold-all-the-objects > mode like DB2, Oracle etc, in which case I think you can have multiple > 16TB segment files and get past that ext4 limit). It's frustrating > because 16TB is still really, really big and you probably should be > using partitions, or more partitions, to avoid all kinds of other > scalability problems at that size. But however hypothetical the > scenario might be, it should work, and this is certainly a plausible > argument against the "aggressive" plan described above with the hard > cut-off where we get to drop the segmented mode. > > Concretely, a 20TB pg_upgrade in copy mode would fail while trying to > concatenate with the above patches, so you'd have to use link or > reflink mode (you'd probably want to use that anyway unless due to > sheer volume of data to copy otherwise, since ext4 is also not capable > of block-range sharing), but then you'd be out of luck after N future > major releases, according to that plan where we start deleting the > code, so you'd need to organise some smaller partitions before that > time comes. Or pg_upgrade to a target on xfs etc. I wonder if a > future version of extN will increase its max file size. > > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). > > > -- Mark Callaghan mdcal...@gmail.com
Re: psql tests hangs
Pavel Stehule writes: > pá 12. 5. 2023 v 17:50 odesílatel Tom Lane napsal: >> OK. So after looking at this a bit, the reason PAGER and PSQL_PAGER >> don't cause us any problems in the test environment is that they are >> not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)). >> It seems to me that it's a bug that there is no such check before >> using PSQL_WATCH_PAGER. Is there actually any defensible reason >> for that? > Theoretically, we can write tests for these features, and then stdout, > stdin may not be tty. Well, you'd test using pty's, so that psql thinks it's talking to a terminal. That's what we're doing now to test tab completion, for example. > Except for testing, using pager in non-interactive mode makes no sense. Agreed. Let's solve this by inserting isatty tests in psql, rather than hacking the test environment. regards, tom lane
Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Thanks for the continued work, Peter. I hate to be the guy that starts this way, but this is my first ever response on pgsql-hackers. (insert awkward smile face). Hopefully I've followed etiquette well, but please forgive any missteps, and I'm happy for any help in making better contributions in the future. On Thu, May 11, 2023 at 9:19 PM Peter Geoghegan wrote: > > On Thu, May 4, 2023 at 3:18 PM samay sharma wrote: > > What do you think about the term "Exhaustion"? Maybe something like "XID > > allocation exhaustion" or "Exhaustion of allocatable XIDs"? > > I use the term "transaction ID exhaustion" in the attached revision, > v4. Overall, v4 builds on the work that went into v2 and v3, by > continuing to polish the overhaul of everything related to freezing, > relfrozenxid advancement, and anti-wraparound autovacuum. Just to say on the outset, as has been said earlier in the tread by others, that this is herculean work. Thank you for putting the effort you have thus far. There's a lot of good from where I sit in the modification efforts. It's a heavy, dense topic, so there's probably never going to be a perfect way to get it all in, but some of the context early on, especially, is helpful for framing. > > It would be nice if it was possible to add an animation/diagram a > little like this one: https://tuple-freezing-demo.angusd.com (this is > how I tend to think about the "transaction ID space".) Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But, this or something akin to the "clock" image I've seen elsewhere when describing the transaction ID space would probably be helpful if it were ever possible. In fact, there's just a lot about the MVCC stuff in general that would benefit from diagrams. But alas, I guess that's why we have some good go-to community talks/slide decks. :-) > v4 also limits use of the term "wraparound" to places that directly > discuss anti-wraparound autovacuums (plus one place in xact.sgml, > where discussion of "true unsigned integer wraparound" and related > implementation details has been moved). Otherwise we use the term > "transaction ID exhaustion", which is pretty much the user-facing name > for "xidStopLimit". I feel that this is a huge improvement, for the > reason given to Greg earlier. I'm flexible on the details, but I feel > strongly that we should minimize use of the term wraparound wherever > it might have the connotation of "the past becoming the future". This > is not a case of inventing a new terminology for its own sake. If > anybody is skeptical I ask that they take a look at what I came up > with before declaring it a bad idea. I have made that as easy as > possible, by once again attaching a prebuilt routine-vacuuming.html. Thanks again for doing this. Really helpful for doc newbies like me that want to help but are still working through the process. Really helpful and appreciated. > > > Other changes in v4, compared to v3: > > * Improved discussion of the differences between non-aggressive and > aggressive VACUUM. This was helpful for me and not something I've previously put much thought into. Helpful context that is missing from the current docs. > * Explains "catch-up freezing" performed by aggressive VACUUMs directly. > > "Catch-up" freezing is the really important "consequence" -- something > that emerges from how each type of VACUUM behaves over time. It is an > indirect consequence of the behaviors. I would like to counter the > perception that some users have about freezing only happening during > aggressive VACUUMs (or anti-wraparound autovacuums). But more than > that, talking about catch-up freezing seems essential because it is > the single most important difference. > Similarly, this was helpful overall context of various things happening with freezing. > * Much improved handling of the discussion of anti-wraparound > autovacuum, and how it relates to aggressive VACUUMs, following > feedback from Samay. > > There is now only fairly minimal overlap in the discussion of > aggressive VACUUM and anti-wraparound autovacuuming. We finish the > discussion of aggressive VACUUM just after we start discussing > anti-wraparound autovacuum. This transition works well, because it > enforces the idea that anti-wraparound autovacuum isn't really special > compared to any other aggressive autovacuum. This was something that > Samay expressed particularly concern about: making anti-wraparound > autovacuums sound less scary. Though it's also a concern I had from > the outset, based on practical experience and interactions with people > that have much less knowledge of Postgres than I do. Agree. This flows fairly well and helps the user understand that each "next step" in the vacuum/freezing process has a distinct job based on previous work. > > * Anti-wraparound autovacuum is now mostly discussed as something that > happens to static or mostly-static tables > ...This moves discussion of anti-wraparound av in the direction
Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
And, of course, I forgot that I switch to text-mode after writing most of this, so the carriage returns were unnecessary. (facepalm... sigh) -- Ryan On Fri, May 12, 2023 at 1:36 PM Ryan Booz wrote: > > Thanks for the continued work, Peter. I hate to be the guy that starts this > way, > but this is my first ever response on pgsql-hackers. (insert awkward > smile face). > Hopefully I've followed etiquette well, but please forgive any > missteps, and I'm > happy for any help in making better contributions in the future. > > On Thu, May 11, 2023 at 9:19 PM Peter Geoghegan wrote: > > > > On Thu, May 4, 2023 at 3:18 PM samay sharma wrote: > > > What do you think about the term "Exhaustion"? Maybe something like "XID > > > allocation exhaustion" or "Exhaustion of allocatable XIDs"? > > > > I use the term "transaction ID exhaustion" in the attached revision, > > v4. Overall, v4 builds on the work that went into v2 and v3, by > > continuing to polish the overhaul of everything related to freezing, > > relfrozenxid advancement, and anti-wraparound autovacuum. > > Just to say on the outset, as has been said earlier in the tread by others, > that this is herculean work. Thank you for putting the effort you have thus > far. > There's a lot of good from where I sit in the modification efforts. > It's a heavy, > dense topic, so there's probably never going to be a perfect way to > get it all in, > but some of the context early on, especially, is helpful for framing. > > > > > It would be nice if it was possible to add an animation/diagram a > > little like this one: https://tuple-freezing-demo.angusd.com (this is > > how I tend to think about the "transaction ID space".) > > Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But, > this or something akin to the "clock" image I've seen elsewhere when > describing the transaction ID space would probably be helpful if it were ever > possible. In fact, there's just a lot about the MVCC stuff in general that > would benefit from diagrams. But alas, I guess that's why we have some > good go-to community talks/slide decks. :-) > > > v4 also limits use of the term "wraparound" to places that directly > > discuss anti-wraparound autovacuums (plus one place in xact.sgml, > > where discussion of "true unsigned integer wraparound" and related > > implementation details has been moved). Otherwise we use the term > > "transaction ID exhaustion", which is pretty much the user-facing name > > for "xidStopLimit". I feel that this is a huge improvement, for the > > reason given to Greg earlier. I'm flexible on the details, but I feel > > strongly that we should minimize use of the term wraparound wherever > > it might have the connotation of "the past becoming the future". This > > is not a case of inventing a new terminology for its own sake. If > > anybody is skeptical I ask that they take a look at what I came up > > with before declaring it a bad idea. I have made that as easy as > > possible, by once again attaching a prebuilt routine-vacuuming.html. > > Thanks again for doing this. Really helpful for doc newbies like me that > want to help but are still working through the process. Really helpful > and appreciated. > > > > > > > Other changes in v4, compared to v3: > > > > * Improved discussion of the differences between non-aggressive and > > aggressive VACUUM. > > This was helpful for me and not something I've previously put much thought > into. Helpful context that is missing from the current docs. > > > * Explains "catch-up freezing" performed by aggressive VACUUMs directly. > > > > "Catch-up" freezing is the really important "consequence" -- something > > that emerges from how each type of VACUUM behaves over time. It is an > > indirect consequence of the behaviors. I would like to counter the > > perception that some users have about freezing only happening during > > aggressive VACUUMs (or anti-wraparound autovacuums). But more than > > that, talking about catch-up freezing seems essential because it is > > the single most important difference. > > > > Similarly, this was helpful overall context of various things > happening with freezing. > > > * Much improved handling of the discussion of anti-wraparound > > autovacuum, and how it relates to aggressive VACUUMs, following > > feedback from Samay. > > > > There is now only fairly minimal overlap in the discussion of > > aggressive VACUUM and anti-wraparound autovacuuming. We finish the > > discussion of aggressive VACUUM just after we start discussing > > anti-wraparound autovacuum. This transition works well, because it > > enforces the idea that anti-wraparound autovacuum isn't really special > > compared to any other aggressive autovacuum. This was something that > > Samay expressed particularly concern about: making anti-wraparound > > autovacuums sound less scary. Though it's also a concern I had from > > the outset, based on practical experience and interactions with people > > that have much
Re: smgrzeroextend clarification
On Thu, 11 May 2023 at 05:37, Peter Eisentraut wrote: > > Maybe it was never meant that way and only works accidentally? Maybe > hash indexes are broken? It's explicitly documented to be this way. And I think it has to work this way for recovery to work. I think the reason you and Bharath and Andres are talking past each other is that they're thinking about how the implementation works and you're talking about the API definition. If you read the API definition and treat the functions as a black box I think you're right -- those two definitions sound pretty much equivalent to me. They both extend the file, possibly multiple blocks, and zero fill. The only difference is that smgrextend() additionally allows you to provide data. -- greg
Re: smgrzeroextend clarification
On Sat, May 13, 2023 at 6:07 AM Greg Stark wrote: > On Thu, 11 May 2023 at 05:37, Peter Eisentraut > wrote: > > Maybe it was never meant that way and only works accidentally? Maybe > > hash indexes are broken? > > It's explicitly documented to be this way. And I think it has to work > this way for recovery to work. > > I think the reason you and Bharath and Andres are talking past each > other is that they're thinking about how the implementation works and > you're talking about the API definition. > > If you read the API definition and treat the functions as a black box > I think you're right -- those two definitions sound pretty much > equivalent to me. They both extend the file, possibly multiple blocks, > and zero fill. The only difference is that smgrextend() additionally > allows you to provide data. Just a thought: should RelationCopyStorageUsingBuffer(), the new code used by CREATE DATABASE with the default strategy WAL_LOG, use the newer interface so that it creates fully allocated files instead of sparse ones?
Re: Should CSV parsing be stricter about mid-field quotes?
On Thu, 11 May 2023 at 10:04, Joel Jacobson wrote: > > The parser currently accepts quoting within an unquoted field. This can lead > to > data misinterpretation when the quote is part of the field data (e.g., > for inches, like in the example). I think you're thinking about it differently than the parser. I think the parser is treating this the way, say, the shell treats quotes. That is, it sees a quoted "I bought this for my 6" followed by an unquoted "a laptop but it didn't fit my 8" followed by a quoted " tablet". So for example, in that world you might only quote commas and newlines so you might print something like 1,2,I bought this for my "6"" laptop " but it "didn't" fit my "8""" laptop The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only allows fully quoted or fully unquoted fields and there can only be escaped double-doublequote characters in quoted fields and no doublequote characters in unquoted fields. But it also says Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2. So the real question is are there tools out there that generate entries like this and what are their intentions? > I think we should throw a parsing error for unescaped mid-field quotes, > and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field > quotes are necessary. The error message could suggest this option when it > encounters an unescaped mid-field quote. > > I think the convenience of not having to use an extra option doesn't outweigh > the risk of undetected data integrity issues. It's also a pretty annoying experience to get a message saying "error, turn this option on to not get an error". I get what you're saying too, which is more of a risk depends on whether turning off the error is really the right thing most of the time or is just causing data to be read incorrectly. -- greg
Re: psql tests hangs
I wrote: > Pavel Stehule writes: >> Except for testing, using pager in non-interactive mode makes no sense. > Agreed. Let's solve this by inserting isatty tests in psql, rather > than hacking the test environment. Here's a proposed patch for this. I noticed that another memo the PSQL_WATCH_PAGER patch had not gotten was the lesson learned in commit 18f8f784c, namely that it's a good idea to ignore empty or all-blank settings. regards, tom lane diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 97f7d97220..607a57715a 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -5197,14 +5197,20 @@ do_watch(PQExpBuffer query_buf, double sleep, int iter) /* * For \watch, we ignore the size of the result and always use the pager - * if PSQL_WATCH_PAGER is set. We also ignore the regular PSQL_PAGER or - * PAGER environment variables, because traditional pagers probably won't - * be very useful for showing a stream of results. + * as long as we're talking to a terminal and "\pset pager" is enabled. + * However, we'll only use the pager identified by PSQL_WATCH_PAGER. We + * ignore the regular PSQL_PAGER or PAGER environment variables, because + * traditional pagers probably won't be very useful for showing a stream + * of results. */ #ifndef WIN32 pagerprog = getenv("PSQL_WATCH_PAGER"); + /* if variable is empty or all-white-space, don't use pager */ + if (pagerprog && strspn(pagerprog, " \t\r\n") == strlen(pagerprog)) + pagerprog = NULL; #endif - if (pagerprog && myopt.topt.pager) + if (pagerprog && myopt.topt.pager && + isatty(fileno(stdin)) && isatty(fileno(stdout))) { fflush(NULL); disable_sigpipe_trap();
Re: psql tests hangs
pá 12. 5. 2023 v 21:08 odesílatel Tom Lane napsal: > I wrote: > > Pavel Stehule writes: > >> Except for testing, using pager in non-interactive mode makes no sense. > > > Agreed. Let's solve this by inserting isatty tests in psql, rather > > than hacking the test environment. > > Here's a proposed patch for this. I noticed that another memo the > PSQL_WATCH_PAGER patch had not gotten was the lesson learned in > commit 18f8f784c, namely that it's a good idea to ignore empty > or all-blank settings. > +1 Pavel > regards, tom lane > >
Re: Should CSV parsing be stricter about mid-field quotes?
On 2023-05-11 Th 10:03, Joel Jacobson wrote: Hi hackers, I've come across an unexpected behavior in our CSV parser that I'd like to bring up for discussion. % cat example.csv id,rating,review 1,5,"Great product, will buy again." 2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet" % psql CREATE TABLE reviews (id int, rating int, review text); \COPY reviews FROM example.csv WITH CSV HEADER; SELECT * FROM reviews; This gives: id | rating | review ++- 1 | 5 | Great product, will buy again. 2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet (2 rows) Maybe this is unexpected by you, but it's not by me. What other sane interpretation of that data could there be? And what CSV producer outputs such horrible content? As you've noted, ours certainly does not. Our rules are clear: quotes within quotes must be escaped (default escape is by doubling the quote char). Allowing partial fields to be quoted was a deliberate decision when CSV parsing was implemented, because examples have been seen in the wild. So I don't think our behaviour is broken or needs fixing. As mentioned by Greg, this is an example of the adage about being liberal in what you accept. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
improve more permissions-related error messages
This is intended as a follow-up to de4d456 [0]. I noticed that c3afe8c introduced another "must have privileges" error message that I think should be adjusted to use the new style introduced in de4d456. І've attached a small patch for this. While looking around for other such error messages, I found a few dozen "must be superuser" errors that might be improved with the new style. If folks feel this is worthwhile, I'll put together a patch. [0] https://postgr.es/m/20230126002251.GA1506128%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 8b3de032ee..e8b288d01c 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -611,7 +611,9 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, if (!has_privs_of_role(owner, ROLE_PG_CREATE_SUBSCRIPTION)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have privileges of pg_create_subscription to create subscriptions"))); + errmsg("permission denied to create subscription"), + errdetail("Only roles with privileges of the \"%s\" role may create subscriptions.", + "pg_create_subscription"))); /* * Since a subscription is a database object, we also check for CREATE diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out index d736246259..8b5f657897 100644 --- a/src/test/regress/expected/subscription.out +++ b/src/test/regress/expected/subscription.out @@ -79,7 +79,8 @@ ERROR: subscription "regress_testsub" already exists -- fail - must be superuser SET SESSION AUTHORIZATION 'regress_subscription_user2'; CREATE SUBSCRIPTION regress_testsub2 CONNECTION 'dbname=regress_doesnotexist' PUBLICATION foo WITH (connect = false); -ERROR: must have privileges of pg_create_subscription to create subscriptions +ERROR: permission denied to create subscription +DETAIL: Only roles with privileges of the "pg_create_subscription" role may create subscriptions. SET SESSION AUTHORIZATION 'regress_subscription_user'; -- fail - invalid option combinations CREATE SUBSCRIPTION regress_testsub2 CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, copy_data = true);
Re: improve more permissions-related error messages
Nathan Bossart writes: > This is intended as a follow-up to de4d456 [0]. I noticed that c3afe8c > introduced another "must have privileges" error message that I think should > be adjusted to use the new style introduced in de4d456. І've attached a > small patch for this. +1 > While looking around for other such error messages, I found a few dozen > "must be superuser" errors that might be improved with the new style. If > folks feel this is worthwhile, I'll put together a patch. The new style is better for cases where we've broken out a predefined role that has the necessary privilege. I'm not sure it's worth troubling with cases that are still just "must be superuser". It seems like you'd mostly just be creating work for the translation team. regards, tom lane
Re: improve more permissions-related error messages
On Fri, May 12, 2023 at 04:43:08PM -0400, Tom Lane wrote: > Nathan Bossart writes: >> While looking around for other such error messages, I found a few dozen >> "must be superuser" errors that might be improved with the new style. If >> folks feel this is worthwhile, I'll put together a patch. > > The new style is better for cases where we've broken out a predefined role > that has the necessary privilege. I'm not sure it's worth troubling > with cases that are still just "must be superuser". It seems like > you'd mostly just be creating work for the translation team. Makes sense, thanks. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Memory leak from ExecutorState context?
Thanks for continuing to work on this. Are you planning to modify what is displayed for memory usage in EXPLAIN ANALYZE? Also, since that won't help a user who OOMs, I wondered if the spillCxt is helpful on its own or if we need some kind of logging message for users to discover that this is what led them to running out of memory. On Wed, May 10, 2023 at 02:24:19PM +0200, Jehan-Guillaume de Rorthais wrote: > On Mon, 8 May 2023 11:56:48 -0400 Melanie Plageman > wrote: > > > > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) > > > hashtable, nbatch, hashtable->spaceUsed); > > > #endif > > > > > > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); > > > - > > > if (hashtable->innerBatchFile == NULL) > > > { > > > + MemoryContext oldcxt = > > > MemoryContextSwitchTo(hashtable->fileCxt); + > > > /* we had no file arrays before */ > > > hashtable->innerBatchFile = palloc0_array(BufFile *, > > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch); > > > + > > > + MemoryContextSwitchTo(oldcxt); > > > + > > > /* time to establish the temp tablespaces, too */ > > > PrepareTempTablespaces(); > > > } > > > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) > > > > I don't see a reason to call repalloc0_array() in a different context > > than the initial palloc0_array(). > > Unless I'm wrong, wrapping the whole if/else blocks in memory context > hashtable->fileCxt seemed useless as repalloc() actually realloc in the > context > the original allocation occurred. So I only wrapped the palloc() calls. My objection is less about correctness and more about the diff. The existing memory context switch is around the whole if/else block. If you want to move it to only wrap the if statement, I would do that in a separate commit with a message describing the rationale. It doesn't seem to save us much and it makes the diff a bit more confusing. I don't feel strongly enough about this to protest much more, though. > > > diff --git a/src/include/executor/hashjoin.h > > > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644 > > > --- a/src/include/executor/hashjoin.h > > > +++ b/src/include/executor/hashjoin.h > > > @@ -25,10 +25,14 @@ > > > * > > > * Each active hashjoin has a HashJoinTable control block, which is > > > * palloc'd in the executor's per-query context. All other storage > > > needed > > > - * for the hashjoin is kept in private memory contexts, two for each > > > hashjoin. > > > + * for the hashjoin is kept in private memory contexts, three for each > > > + * hashjoin: > > > > Maybe "hash table control block". I know the phrase "control block" is > > used elsewhere in the comments, but it is a bit confusing. Also, I wish > > there was a way to make it clear this is for the hashtable but relevant > > to all batches. > > I tried to reword the comment with this additional info in mind in v6. Does it > match what you had in mind? Review of that below. > > So, if we are going to allocate the array of pointers to the spill files > > in the fileCxt, we should revise this comment. As I mentioned above, I > > agree with you that if the SharedTupleStore-related buffers are also > > allocated in this context, perhaps it shouldn't be called the fileCxt. > > > > One idea I had is calling it the spillCxt. Almost all memory allocated in > > this > > context is related to needing to spill to permanent storage during > > execution. > > Agree > > > The one potential confusing part of this is batch 0 for parallel hash > > join. I would have to dig back into it again, but from a cursory look at > > ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a > > shared tuplestore with associated accessors and files even if it is a > > single batch parallel hashjoin. > > > > Are the parallel hash join read_buffer and write_chunk also used for a > > single batch parallel hash join? > > I don't think so. > > For the inner side, there's various Assert() around the batchno==0 special > case. Plus, it always has his own block when inserting in a batch, to directly > write in shared memory calling ExecParallelHashPushTuple(). > > The outer side of the join actually creates all batches using shared tuple > storage mechanism, including batch 0, **only** if the number of batch is > greater than 1. See in ExecParallelHashJoinOuterGetTuple: > > /* >* In the Parallel Hash case we only run the outer plan directly for >* single-batch hash joins. Otherwise we have to go to batch files, even >* for batch 0. >*/ > if (curbatch == 0 && hashtable->nbatch == 1) > { > slot = ExecProcNode(outerNode); > > So, for a single batch PHJ, it seems there's no temp files involved. spill context seems appropriate, then. > > > Though, perhaps spillCxt still makes sense? It doesn't specify > > multi-batch... > > I'm not sure the see
Re: smgrzeroextend clarification
Hi, On May 11, 2023 2:37:00 AM PDT, Peter Eisentraut wrote: >On 10.05.23 20:10, Andres Freund wrote: >>> Moreover, the text "except the relation can be extended by multiple blocks >>> at once and the added blocks will be filled with zeroes" doesn't make much >>> sense as a differentiation, because smgrextend() does that as well. >> >> Hm? smgrextend() writes a single block, and it's filled with the caller >> provided buffer. > >But there is nothing that says that the block written by smgrextend() has to >be the one right after the last existing block. You can give it any block >number, and it will write there, and the blocks in between that are skipped >over will effectively be filled with zeros. This is because of the way the >POSIX file system APIs work. Sure, but that's pretty much independent of my changes. With the exception of, I believe, hash indexes we are quite careful to never leave holes in files. And not just for performance reasons - itd make it much more likely to encounter ENOSPC while writing back blocks. Being unable to checkpoint (because they fail due to ENOSPC), is quite nasty. >Maybe it was never meant that way and only works accidentally? Maybe hash >indexes are broken? It's known behavior I think - but also quite bad. I think it got a good bit worse after WAL support for hash indexes went in. I think during replay we sometimes end up actually allocating the blocks one by one. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: WAL Insertion Lock Improvements
On Fri, May 12, 2023 at 07:35:20AM +0530, Bharath Rupireddy wrote: > --enable-atomics=no, -T60: > --enable-spinlocks=no, -T60: > --enable-atomics=no --enable-spinlocks=no, -T60: Thanks for these extra tests, I have not done these specific cases but the profiles look similar to what I've seen myself. If I recall correctly the fallback implementation of atomics just uses spinlocks internally to force the barriers required. -- Michael signature.asc Description: PGP signature
Re: Large files for relations
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN wrote: > Repeating what was mentioned on Twitter, because I had some experience with > the topic. With fewer files per table there will be more contention on the > per-inode mutex (which might now be the per-inode rwsem). I haven't read > filesystem source in a long time. Back in the day, and perhaps today, it was > locked for the duration of a write to storage (locked within the kernel) and > was briefly locked while setting up a read. > > The workaround for writes was one of: > 1) enable disk write cache or use battery-backed HW RAID to make writes > faster (yes disks, I encountered this prior to 2010) > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't > locked for the duration of a write > > I have a vague memory that filesystems have improved in this regard. (I am interpreting your "use XFS" to mean "use XFS instead of ext4".) Right, 80s file systems like UFS (and I suspect ext and ext2, which were probably based on similar ideas and ran on non-SMP machines?) used coarse grained locking including vnodes/inodes level. Then over time various OSes and file systems have improved concurrency. Brief digression, as someone who got started on IRIX in the 90 and still thinks those were probably the coolest computers: At SGI, first they replaced SysV UFS with EFS (E for extent-based allocation) and invented O_DIRECT to skip the buffer pool, and then blew the doors off everything with XFS, which maximised I/O concurrency and possibly (I guess, it's not open source so who knows?) involved a revamped VFS to lower stuff like inode locks, motivated by monster IRIX boxes with up to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I remember hearing lots of reports of various kinds of large systems going faster just by switching to XFS and there is lots of writing about that. ext4 certainly changed enormously. One reason back in those days (mid 2000s?) was the old fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file thing, and another was the lack of write concurrency especially for direct I/O, and probably lots more things. But that's all ancient history... As for ext4, we've detected and debugged clues about the gradual weakening of locking over time on this list: we know that concurrent read/write to the same page of a file was previously atomic, but when we switched to pread/pwrite for most data (ie not making use of the current file position), it ceased to be (a concurrent reader can see a mash-up of old and new data with visible cache line-ish stripes in it, so there isn't even a write-lock for the page); then we noticed that in later kernels even read/write ceased to be atomic (implicating a change in file size/file position interlocking, I guess). I also vaguely recall reading on here a long time ago that lseek() performance was dramatically improved with weaker inode interlocking, perhaps even in response to this very program's pathological SEEK_END call frequency (something I hope to fix, but I digress). So I think it's possible that the effect you mentioned is gone? I can think of a few differences compared to those other RDBMSs. There the discussion was about one-file-per-relation vs one-big-file-for-everything, whereas we're talking about one-file-per-relation vs many-files-per-relation (which doesn't change the point much, just making clear that I'm not proposing a 42PB file to whole everything, so you can still partition to get different files). We also usually call fsync in series in our checkpointer (after first getting the writebacks started with sync_file_range() some time sooner). Currently our code believes that it is not safe to call fdatasync() for files whose size might have changed. There is no basis for that in POSIX or in any system that I currently know of (though I haven't looked into it seriously), but I believe there was a historical file system that at some point in history interpreted "non-essential meta data" (the stuff POSIX allows it not to flush to disk) to include "the size of the file" (whereas POSIX really just meant that you don't have to synchronise the mtime and similar), which is probably why PostgreSQL has some code that calls fsync() on newly created empty WAL segments to "make sure the indirect blocks are down on disk" before allowing itself to use only fdatasync() later to overwrite it with data. The point being that, for the most important kind of interactive/user facing I/O latency, namely WAL flushes, we already use fdatasync(). It's possible that we could use it to flush relation data too (ie the relation files in question here, usually synchronised by the checkpointer) according to POSIX but it doesn't immediately seem like something that should be at all hot and it's background work. But perhaps I lack imagination. Thanks, thought-provoking stuff.
Re: smgrzeroextend clarification
Hi, On May 12, 2023 11:36:23 AM PDT, Thomas Munro wrote: >Just a thought: should RelationCopyStorageUsingBuffer(), the new code >used by CREATE DATABASE with the default strategy WAL_LOG, use the >newer interface so that it creates fully allocated files instead of >sparse ones? I played with that, but at least on Linux with ext4 and xfs it was hard to find cases where it really was beneficial. That's actually how I ended up finding the issues I'd fixed recently-ish. I think it might be different if we had an option to not use a strategy for the target database - right now we IIRC write back due to ring replacement. Entirely or largely in order, which I think removes most of the issues you could have. One issue is that it'd be worse on platforms / filesystems without fallocate support, because we would write the data back twice (once with zeros, once the real data). Perhaps we should add a separate parameter controlling the fallback behaviour. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCHES] Post-special page storage TDE support
Greetings, * David Christensen (david.christen...@crunchydata.com) wrote: > Refreshing this with HEAD as of today, v4. Thanks for updating this! > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > This space is reserved for extended data on the Page structure which will be > ultimately used for > encrypted data, extended checksums, and potentially other things. This data > appears at the end of > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > ControlFile features. > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > we will require logical replication to move data into a cluster with > different settings here. This initial patch, at least, does maintain pg_upgrade as the reserved_page_size (maybe not a great name?) is set to 0, right? Basically this is just introducing the concept of a reserved_page_size and adjusting all of the code that currently uses BLCKSZ or PageGetPageSize() to account for this extra space. Looking at the changes to bufpage.h, in particular ... > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h > @@ -19,6 +19,14 @@ > #include "storage/item.h" > #include "storage/off.h" > > +extern PGDLLIMPORT int reserved_page_size; > + > +#define SizeOfPageReservedSpace() reserved_page_size > +#define MaxSizeOfPageReservedSpace 0 > + > +/* strict upper bound on the amount of space occupied we have reserved on > + * pages in this cluster */ This will eventually be calculated based on what features are supported concurrently? > @@ -36,10 +44,10 @@ > * | v pd_upper > | > * +-++ > * | | tupleN ... | > - * +-+--+-+ > - * |... tuple3 tuple2 tuple1 | "special space" | > - * ++-+ > - * ^ > pd_special > + * +-+-++++ > + * | ... tuple2 tuple1 | "special space" | "reserved" | > + * +---++++ > + * ^ pd_special ^ > reserved_page_space Right, adds a dynamic amount of space 'post-special area'. > @@ -73,6 +81,8 @@ > * stored as the page trailer. an access method should always > * initialize its pages with PageInit and then set its own opaque > * fields. > + * > + * XXX - update more comments here about reserved_page_space > */ Would be good to do. ;) > @@ -325,7 +335,7 @@ static inline void > PageValidateSpecialPointer(Page page) > { > Assert(page); > - Assert(((PageHeader) page)->pd_special <= BLCKSZ); > + AssertPageHeader) page)->pd_special + reserved_page_size) <= > BLCKSZ); > Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData); > } This is just one usage ... but seems like maybe we should be using PageGetPageSize() here instead of BLCKSZ, and that more-or-less throughout? Nearly everywhere we're using BLCKSZ today to give us that compile-time advantage of a fixed block size is going to lose that advantage anyway thanks to reserved_page_size being run-time. Now, one up-side to this is that it'd also get us closer to being able to support dynamic block sizes concurrently which would be quite interesting. That is, a special tablespace with a 32KB block size while the rest are the traditional 8KB. This would likely require multiple shared buffer pools, of course... > diff --git a/src/backend/storage/page/bufpage.c > b/src/backend/storage/page/bufpage.c > index 9a302ddc30..a93cd9df9f 100644 > --- a/src/backend/storage/page/bufpage.c > +++ b/src/backend/storage/page/bufpage.c > @@ -26,6 +26,8 @@ > /* GUC variable */ > bool ignore_checksum_failure = false; > > +int reserved_page_size = 0; /* how much page space to > reserve for extended unencrypted metadata */ > + > > /* > * Page support functions > @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize) > { > PageHeader p = (PageHeader) page; > > - specialSize = MAXALIGN(specialSize); > + specialSize = MAXALIGN(specialSize) + reserved_page_size; Rather than make it part of specialSize, I would think we'd be better off just treating them independently. Eg, the later pd_upper setting would be done by: p->pd_upper = pageSize - specialSize - reserved_page_size; etc. > @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int > flags) > * one that is both unused and deallocated. > * > * If flag PAI_IS_HEAP is set, we enforce that there can't be more than > - * MaxHeapTuplesPerPage
Re: Large files for relations
On Sat, May 13, 2023 at 11:01 AM Thomas Munro wrote: > On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN wrote: > > use XFS and O_DIRECT As for direct I/O, we're only just getting started on that. We currently can't produce more than one concurrent WAL write, and then for relation data, we just got very basic direct I/O support but we haven't yet got the asynchronous machinery to drive it properly (work in progress, more soon). I was just now trying to find out what the state of parallel direct writes is in ext4, and it looks like it's finally happening: https://www.phoronix.com/news/Linux-6.3-EXT4
Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
On Fri, May 12, 2023 at 10:36 AM Ryan Booz wrote: > Just to say on the outset, as has been said earlier in the tread by others, > that this is herculean work. Thank you for putting the effort you have thus > far. Thanks! > > It would be nice if it was possible to add an animation/diagram a > > little like this one: https://tuple-freezing-demo.angusd.com (this is > > how I tend to think about the "transaction ID space".) > > Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But, > this or something akin to the "clock" image I've seen elsewhere when > describing the transaction ID space would probably be helpful if it were ever > possible. In fact, there's just a lot about the MVCC stuff in general that > would benefit from diagrams. But alas, I guess that's why we have some > good go-to community talks/slide decks. :-) A picture is worth a thousand words. This particular image may be worth even more, though. It happens to be *exactly* what I'd have done if I was tasked with coming up with an animation that conveys the central ideas. Obviously I brought this image up because I think that it would be great if we could find a way to do something like that directly (not impossible, there are a few images already). However, there is a less obvious reason why I brought it to your attention: it's a very intuitive way of understanding what I actually intend to convey through words -- at least as far as talk about the cluster-wide XID space is concerned. It might better equip you to review the patch series. Sure, the animation will make the general idea clearer to just about anybody -- that's a big part of what I like about it. But it also captures the nuance that might matter to experts (e.g., the oldest XID moves forward in jerky discrete jumps, while the next/unallocated XID moves forward in a smooth, continuous fashion). So it works on multiple levels, for multiple audiences/experience levels, without any conflicts -- which is no small thing. Do my words make you think of something a little like the animation? If so, good. > Thanks again for doing this. Really helpful for doc newbies like me that > want to help but are still working through the process. Really helpful > and appreciated. I think that this is the kind of thing that particularly benefits from diversity in perspectives. > Agree. This flows fairly well and helps the user understand that each > "next step" > in the vacuum/freezing process has a distinct job based on previous work. I'm trying to make it possible to read in short bursts, and to skim. The easiest wins in this area will come from simply having more individual sections/headings, and a more consistent structure. The really difficult part is coming up with prose that can sort of work for all audiences at the same time -- without alienating anybody. Here is an example of what I mean: The general idea of freezing can reasonably be summarized as "a process that VACUUM uses to make pages self-contained (no need to do pg_xact lookups anymore), that also has a role in avoiding transaction ID exhaustion". That is a totally reasonable beginner-level (well, relative-beginner-level) understanding of freezing. It *isn't* dumbed down. You, as a beginner, have a truly useful take-away. At the same time, you have avoided learning anything that you'll need to unlearn some day. If I can succeed in doing that, I'll feel a real sense of accomplishment. > > * Much improved "Truncating Transaction Status Information" subsection. > > > > My explanation of the ways in which autovacuum_freeze_max_age can > > affect the storage overhead of commit/abort status in pg_xact is much > > clearer than it was in v3 -- pg_xact truncation is now treated as > > something loosely related to the global config of anti-wraparound > > autovacuum, which makes most sense. > > > This one isn't totally sinking in with me yet. Need another read. "Truncating Transaction Status Information" is explicitly supposed to matter much less than the rest of the stuff on freezing. The main benefit that the DBA can expect from understanding this content is how to save a few GB of disk space for pg_xact, which isn't particularly likely to be possible, and is very very unlikely to be of any real consequence, compared to everything else. If you were reading the revised "Routine Vacuuming" as the average DBA, what you'd probably have ended up doing is just not reading this part at all. And that would probably be the ideal outcome. It's roughly the opposite of what you'll get right now, by the way (bizarrely, the current docs place a great deal of emphasis on this). (Of course I welcome your feedback here too. Just giving you the context.) > > It took a great deal of effort to find a structure that covered > > everything, and that highlighted all of the important relationships > > without going too far, while at the same time not being a huge mess. > > That's what I feel I've arrived at with v4. > > In most respec
Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog
Thanks for all your responses. It seems better to change the comments on the code rather than call RelationClose here. table_close(new_rel_desc, NoLock); /* do not unlock till end of xact */ Do I need to create another patch to fix the comments? Best regards, xiaoran From: tender wang Sent: Thursday, May 11, 2023 3:26 PM To: Tom Lane Cc: Bharath Rupireddy ; Xiaoran Wang ; pgsql-hackers@lists.postgresql.org Subject: Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog !! External Email Tom Lane mailto:t...@sss.pgh.pa.us>> 于2023年5月11日周四 00:32写道: Bharath Rupireddy mailto:bharath.rupireddyforpostg...@gmail.com>> writes: > And, the /* do not unlock till end of xact */, it looks like it's been > there from day 1. It may be indicating that the ref count fo the new > relation created in heap_create_with_catalog() will be decremented at > the end of xact, but I'm not sure what it means. Hmm, I think it's been copied-and-pasted from somewhere. It's quite common for us to not release locks on modified tables until end of transaction. However, that's not what's happening here, because we actually *don't have any such lock* at this point, as you can easily prove by stepping through this code and watching the contents of pg_locks from another session. We do acquire AccessExclusiveLock on the new table eventually, but not till control returns to DefineRelation. I'm not real sure that I like the proposed code change: it's unclear to me whether it's an unwise piercing of a couple of abstraction layers or an okay change because those abstraction layers haven't yet been applied to the new relation at all. However, I think the existing comment is actively misleading and needs to be changed. Maybe something like /* * Close the relcache entry, since we return only an OID not a * relcache reference. Note that we do not yet hold any lockmanager * lock on the new rel, so there's nothing to release. */ table_close(new_rel_desc, NoLock); /* * ok, the relation has been cataloged, so close catalogs and return * the OID of the newly created relation. */ table_close(pg_class_desc, RowExclusiveLock); +1 Personally, I prefer above code. Given these comments, maybe changing the first call to RelationClose would be sensible, but I'm still not quite convinced. regards, tom lane !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Attached is v9, which is mostly editing the steps for restoring normal operation, which are in 0003 now but will be squashed into 0002. Comments to polish the wording welcome. -- John Naylor EDB: http://www.enterprisedb.com From 0e9d6ea72216b196d37de4629736c0cf34e7cd5c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sat, 13 May 2023 11:03:40 +0700 Subject: [PATCH v9 3/3] Rough draft of complete steps to recover from (M)XID generation failure TODO: squash with previous --- doc/src/sgml/maintenance.sgml | 61 ++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 45d6cd1815..fee88cb647 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -675,7 +675,25 @@ HINT: Execute a database-wide VACUUM in that database. In this condition any transactions already in progress can continue, but only read-only transactions can be started. Operations that modify database records or truncate relations will fail. -The VACUUM command can still be run normally to recover. +The VACUUM command can still be run normally. +To restore normal operation: + + + + Commit or rollback any prepared transactions. + + + Terminate any sessions that might have open transactions. + + + Drop any old replication slots. + + + Ensure autovacuum is running, and execute a database-wide VACUUM. +To reduce the time required, it as also possible to issue manual VACUUM +commands on the tables where relminxid is oldest. + + @@ -761,6 +779,47 @@ HINT: Execute a database-wide VACUUM in that database. have the oldest multixact-age. Both of these kinds of aggressive scans will occur even if autovacuum is nominally disabled. + + +Similar to the XID case, if autovacuum fails to clear old MXIDs from a table, the +system will begin to emit warning messages like this when the database's +oldest MXIDs reach forty million transactions from the wraparound point: + + +WARNING: database "mydb" must be vacuumed within 39985967 transactions +HINT: To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database. + + +(A manual VACUUM should fix the problem, as suggested by the +hint; but note that the VACUUM must be performed by a +superuser, else it will fail to process system catalogs and thus not +be able to advance the database's datfrozenxid.) +If these warnings are ignored, the system will refuse to generate new +MXIDs once there are fewer than three million left until wraparound: + + +ERROR: database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database "mydb" +HINT: Execute a database-wide VACUUM in that database. + + + + +To restore normal operation: + + + Commit or rollback each prepared transaction that might appear in a multixact. + + + Resolve each transaction that might appear in a multixact. + + + Ensure autovacuum is running, and execute a database-wide VACUUM. +To reduce the time required, it as also possible to issue manual VACUUM +commands on the tables where relminmxid is oldest. + + + + -- 2.39.2 From 267609882a0be2764bc33fc289c0a962d47643c4 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 28 Apr 2023 16:08:33 +0700 Subject: [PATCH v9 1/3] Correct outdated docs and messages regarding XID/MXID limits Previously, when approaching xidStopLimit or xidWrapLimit, log messages would warn against a "database shutdown", and when it reached those limits claimed that it "is not accepting commands". This language originated in commit 60b2444cc when the xidStopLimit was added in 2005. At that time, even a trivial SELECT would have failed. Commit 295e63983d in 2007 introduced virtual transaction IDs, which allowed actual XIDs to be allocated lazily when it is necessary to do so, such as when modifying database records. Since then, the behavior at these limits is merely to refuse to allocate new XIDs, so read-only queries can continue to be initiated. The "database shutdown" message was also copied to the message warning for multiWarnLimit when it was added. This has been wrong for a very long time, so backpatch to all supported branches. Aleksander Alekseev, with some editing by me Reviewed by Pavel Borisov and Peter Geoghegan Discussion: https://postgr.es/m/caj7c6tm2d277u2wh8x78kg8ph3tduqebv3_jcjqakyqfhcf...@mail.gmail.com --- doc/src/sgml/maintenance.sgml | 22 -- src/backend/access/transam/multixact.c | 4 ++-- src/backend/access/transam/varsup.c| 12 ++-- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 9cf9d030a8..48d43cb339 100644
Re: cutting down the TODO list thread
On Mon, Feb 6, 2023 at 11:04 AM John Naylor wrote: > I'll try to get back to culling the list items at the end of April. I've made another pass at this. Previously, I went one or two sections at a time, but this time I tried just skimming the whole thing and noting what jumps out at me. Also, I've separated things into three categories: Remove, move to "not wanted list", and revise. Comments and objections welcome, as always. - 1. Remove These are either ill-specified, outdated, possibly done already, or not enough interest. If someone proposed them again, we could consider it, so I propose to just remove these, but not move them to the Not Wanted list. Also some questions: Improve UTF8 combined character handling? -> If this is referring to normalization, we have it now. If not, what needs improving? Improve COPY performance -> What's the criterion for declaring this done? There are many areas that get performance improvements -- why does this need to be called out separately? (There's been some work in the past couple years, so maybe at least we need to find out the current bottlenecks.) Improve line drawing characters -> This links to a proposal with no responses titled "Add a setting in psql that set the linestyle to unicode only if the client encoding was actually UTF8". If we don't drop this, the entry text should at least give an idea what the proposal is. Consider improving the continuation prompt -> A cosmetic proposal that stalled -- time to retire it? SMP scalability improvements -> Both threads are from 2007 Consider GnuTLS if OpenSSL license becomes a problem -> We now have the ability to swap in another implementation during the build process Allow creation of universal binaries for Darwin -> From 2008: Is this still a thing? Allow plug-in modules to emulate features from other databases -> This sounds like a hook or extension. Rethink our type system -> Way out of scope, and short on details. Add support for polymorphic arguments and return types to languages other than PL/PgSQL -> Seems if someone needed this, they would say so (no thread). Add support for OUT and INOUT parameters to languages other than PL/PgSQL -> Ditto 2. Propose to move to the "Not Wanted list": (Anything already at the bottom under the heading "Features We Do Not Want", with the exception of "threads in a single process". I'll just remove that -- if we ever decide that's worth pursuing, it'll be because we decided we can't really avoid it anymore, and in that case we surely don't need to put it here.) Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM ANALYZE, and CLUSTER -> There are external tools that help with this kind of analysis Allow regex operations in PL/Perl using UTF8 characters in non-UTF8 encoded databases -> Seems pie-in-the-sky as well as a niche problem? Add option to print advice for people familiar with other databases -> Doesn't seem relevant anymore? Consider having single-page pruning update the visibility map -> Comment from Heikki in the thread: "I think I was worried about the possible performance impact of having to clear the bit in visibility map again. If you're frequently updating a tuple so that HOT and page pruning is helping you, setting the bit in visibility map seems counter-productive; it's going to be cleared soon again by another UPDATE. That's just a hunch, though. Maybe the overhead is negligible." Consider mmap()'ing entire files into a backend? -> Isn't this a can of worms? Consider allowing control of upper/lower case folding of unquoted identifiers -> Would we ever consider this? - Other -- need adjustment or update? Do async I/O for faster random read-ahead of data -> This section needs to be revised, since there is on-going work on AIO. There are a couple other entries that should maybe be put under a different heading? *** The whole section on Windows has lots of old stuff -- which are still relevant? (ECPG) Fix nested C comments -> What needs fixing? It should work fine. Improve speed of tab completion -> Is this still a problem? Testing pgstat via pg_regress is tricky and inefficient. Consider making a dedicated pgstat test-suite. -> This has significantly changed recently -- how are things now? -- John Naylor EDB: http://www.enterprisedb.com
Re: cutting down the TODO list thread
John Naylor writes: > I've made another pass at this. Previously, I went one or two sections at a > time, but this time I tried just skimming the whole thing and noting what > jumps out at me. Also, I've separated things into three categories: Remove, > move to "not wanted list", and revise. Comments and objections welcome, as > always. Generally agree that the items you've listed are obsolete. Some comments: > Allow creation of universal binaries for Darwin > -> From 2008: Is this still a thing? This entry might stem from 4b362c662. It's still something that nobody has bothered to make happen, not even after another architecture transition on Apple's part. And there are things partly outside our control in the area, see d69a419e6. I doubt it will ever happen. > Add support for polymorphic arguments and return types to languages other > than PL/PgSQL > -> Seems if someone needed this, they would say so (no thread). I think this is still an issue. Surprised nobody has yet gotten annoyed enough to do something about it. > Add support for OUT and INOUT parameters to languages other than PL/PgSQL > -> Ditto And ditto. > Consider allowing control of upper/lower case folding of unquoted > identifiers > -> Would we ever consider this? I think that one's dead as a doornail. > (ECPG) Fix nested C comments > -> What needs fixing? It should work fine. I might be mistaken, but I think 8ac5e88f9 may have fixed this. > Improve speed of tab completion > -> Is this still a problem? I keep worrying that tab-complete.c will become so ungainly as to present a human-scale performance problem. But there's been pretty much zero complaints so far. Let's drop this one until some actual issue emerges. regards, tom lane