Re: SQL:2011 application time
On 23.03.24 18:42, Paul Jungwirth wrote: Now this is a long chain of reasoning to say rangetypes are safe. I added a comment. Note it doesn't apply to arbitrary types, so if we support those eventually we should just require a recheck always, or alternately use equals, not containedby. (That would require storing equals somewhere. It could go in ffeqop, but that feels like a footgun since pfeqop and ppeqop need overlaps.) Ok, this explanation is good enough for now. I have committed the patches v33-0001-Add-temporal-FOREIGN-KEYs.patch and v33-0002-Support-multiranges-in-temporal-FKs.patch (together).
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Sun, Mar 24, 2024 at 10:40 AM Amit Kapila wrote: > > > For instance, setting last_inactive_time_1 to an invalid value fails > > with the following error: > > > > error running SQL: 'psql::1: ERROR: invalid input syntax for > > type timestamp with time zone: "foo" > > LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli... > > > > It would be found at a later point. It would be probably better to > verify immediately after the test that fetches the last_inactive_time > value. Agree. I've added a few more checks explicitly to verify the last_inactive_time is sane with the following: qq[SELECT '$last_inactive_time'::timestamptz > to_timestamp(0) AND '$last_inactive_time'::timestamptz > '$slot_creation_time'::timestamptz;] I've attached the v18 patch set here. I've also addressed earlier review comments from Amit, Ajin Cherian. Note that I've added new invalidation mechanism tests in a separate TAP test file just because I don't want to clutter or bloat any of the existing files and spread tests for physical slots and logical slots into separate existing TAP files. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v18-0001-Track-last_inactive_time-in-pg_replication_slots.patch Description: Binary data v18-0002-Allow-setting-inactive_timeout-for-replication-s.patch Description: Binary data v18-0003-Introduce-new-SQL-funtion-pg_alter_replication_s.patch Description: Binary data v18-0004-Allow-setting-inactive_timeout-in-the-replicatio.patch Description: Binary data v18-0005-Add-inactive_timeout-based-replication-slot-inva.patch Description: Binary data
Re: Built-in CTYPE provider
Hello Jeff, 21.03.2024 03:13, Jeff Davis wrote: On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. Committed. Please look at a Valgrind-detected error caused by the following query (starting from f69319f2f): SELECT lower('Π' COLLATE pg_c_utf8); ==00:00:00:03.487 1429669== Invalid read of size 1 ==00:00:00:03.487 1429669== at 0x7C64A5: convert_case (unicode_case.c:107) ==00:00:00:03.487 1429669== by 0x7C: unicode_strlower (unicode_case.c:70) ==00:00:00:03.487 1429669== by 0x66B218: str_tolower (formatting.c:1698) ==00:00:00:03.488 1429669== by 0x6D6C55: lower (oracle_compat.c:55) Best regards, Alexander
Re: make dist using git archive
On 22.03.24 18:29, Tristan Partin wrote: On Thu Mar 21, 2024 at 3:44 AM CDT, Peter Eisentraut wrote: Here is an updated version of this patch set. You should add 'disabler: true' to the git find_program in Meson. If Git doesn't exist on the system with the way your patch is currently written, the targets would be defined, even though they would never succeed. Ok, added. (I had it in there in an earlier version, but I think I misread one of your earlier messages and removed it.) You may also want to make sure that we are actually in a Git repository. I don't think git-archive works outside one. Then git archive will print an error. That seems ok. Re the autoclrf, is this something we could throw in a .gitattributes files? We don't want to apply it to all git commands, just this one in this context. I would suggest poisoning `meson dist` in the following way: if not meson.is_subproject() [...] meson.add_dist_script(perl, '-e', 'exit 1') endif Good idea, added that. I have extracted the freebsd CI script fix into a separate patch (0002). I think this is useful even if we don't take the full CI patch (0003). 0002 looks pretty reasonable to me. Committed that one in the meantime. From 680967a621083756e1e182df7394a739008a21d6 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sun, 11 Feb 2024 23:58:42 +0100 Subject: [PATCH v5 1/2] make dist uses git archive This changes "make dist" to directly use "git archive", rather than the custom shell script it currently runs. This is to make the creation of the distribution tarball more directly traceable to the git repository. That is why we removed the "make distprep" step. "make dist" continues to produce a .gz and a .bz2 tarball as before. The archives produced this way are deterministic and reproducible, meaning for a given commit the result file should always be bit-for-bit identical. The exception is that if you use a git version older than 2.38.0, gzip records the platform in the archive, so you'd get a different output on Windows vs. macOS vs. "UNIX" (everything else). In git 2.38.0, this was changed so that everything is recorded as "UNIX" now. This is just something to keep in mind. This issue is specific to the gzip format, it does not affect other compression formats. Meson has its own distribution building command (meson dist), but we are not using that at this point. The main problem is that the way they have implemented it, it is not deterministic in the above sense. Also, we want a "make" version for the time being. But the target name "dist" in meson is reserved for that reason, so we call the custom target "pgdist" (so call something like "meson compile -C build pgdist"). Discussion: https://www.postgresql.org/message-id/flat/40e80f77-a294-4f29-a16f-e21bc7bc75fc%40eisentraut.org --- GNUmakefile.in | 35 ++ meson.build| 66 ++ 2 files changed, 79 insertions(+), 22 deletions(-) diff --git a/GNUmakefile.in b/GNUmakefile.in index 4d8fc794bbb..4dc13a3e2ea 100644 --- a/GNUmakefile.in +++ b/GNUmakefile.in @@ -87,29 +87,20 @@ update-unicode: | submake-generated-headers submake-libpgport distdir= postgresql-$(VERSION) dummy = =install= +GIT = git + dist: $(distdir).tar.gz $(distdir).tar.bz2 - rm -rf $(distdir) - -$(distdir).tar: distdir - $(TAR) chf $@ $(distdir) - -.INTERMEDIATE: $(distdir).tar - -distdir-location: - @echo $(distdir) - -distdir: - rm -rf $(distdir)* $(dummy) - for x in `cd $(top_srcdir) && find . \( -name CVS -prune \) -o \( -name .git -prune \) -o -print`; do \ - file=`expr X$$x : 'X\./\(.*\)'`; \ - if test -d "$(top_srcdir)/$$file" ; then \ - mkdir "$(distdir)/$$file" && chmod 777 "$(distdir)/$$file"; \ - else \ - ln "$(top_srcdir)/$$file" "$(distdir)/$$file" >/dev/null 2>&1 \ - || cp "$(top_srcdir)/$$file" "$(distdir)/$$file"; \ - fi || exit; \ - done - $(MAKE) -C $(distdir) distclean + +# Note: core.autocrlf=false is needed to avoid line-ending conversion +# in case the environment has a different setting. Without this, a +# tarball created on Windows might be different than on, and unusable +# on, Unix machines. + +$(distdir).tar.gz: + $(GIT) -C $(srcdir) -c core.autocrlf=false archive --format tar.gz --prefix $(distdir)/ HEAD -o $(abs_top_builddir)/$@ + +$(distdir).tar.bz2: + $(GIT) -C $(srcdir) -c core.autocrlf=false -c tar.tar.bz2.command='$(BZIP2) -c' archive --format tar.bz2 --prefix $(distdir)/ HEAD -o $(abs_top_builddir)/$@ distcheck: dist rm -rf $(dummy) diff --git a/meson.build b/meson.build index c8fdfeb0ec3..4981447414f 100644 --- a/meson.build +++ b/meson.build @@ -3359,6 +3359,72 @@ run_target('help', +### +# Distribution archive +###
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/23/24 01:26, Melanie Plageman wrote: > On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote: >> On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote: >>> On 18/03/2024 17:19, Melanie Plageman wrote: I've attached v7 rebased over this commit. >>> >>> If we delayed table_beginscan_bm() call further, after starting the TBM >>> iterator, we could skip it altogether when the iterator is empty. >>> >>> That's a further improvement, doesn't need to be part of this patch set. >>> Just caught my eye while reading this. >> >> Hmm. You mean like until after the first call to tbm_[shared]_iterate()? >> AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or >> not the iterator is "empty". Do you mean cases when the bitmap has no >> blocks in it? It seems like we should be able to tell that from the >> TIDBitmap. >> >>> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch >>> >>> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the >>> flag e.g. SO_NEED_TUPLE. >> >> Agreed. Done in attached v8. Though I wondered if it was a bit weird >> that the flag is set in the common case and not set in the uncommon >> case... > > v8 actually attached this time I tried to run the benchmarks with v8, but unfortunately it crashes for me very quickly (I've only seen 0015 to crash, so I guess the bug is in that patch). The backtrace attached, this doesn't seem right: (gdb) p hscan->rs_cindex $1 = 543516018 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyCore was generated by `postgres: postgres test-100 [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, exact_pages=0x556f9b9eebf0) at heapam_handler.c:2576 warning: Source file is more recent than executable. 2576targoffset = hscan->rs_vistuples[hscan->rs_cindex]; (gdb) bt #0 0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, exact_pages=0x556f9b9eebf0) at heapam_handler.c:2576 #1 0x556f99e17adb in table_scan_bitmap_next_tuple (exact_pages=0x556f9b9eebf0, lossy_pages=0x556f9b9eebf8, recheck=0x556f9b9eec10, slot=0x556f9b9ef440, scan=0x556f9b9e6960) at ../../../src/include/access/tableam.h:2003 #2 BitmapHeapNext (node=0x556f9b9eeb00) at nodeBitmapHeapscan.c:227 #3 0x556f99e1a331 in ExecProcNode (node=0x556f9b9eeb00) at ../../../src/include/executor/executor.h:274 #4 gather_getnext (gatherstate=0x556f9b9ee970) at nodeGather.c:287 #5 ExecGather (pstate=0x556f9b9ee970) at nodeGather.c:222 #6 0x556f99e24ac3 in ExecProcNode (node=0x556f9b9ee970) at ../../../src/include/executor/executor.h:274 #7 ExecLimit (pstate=0x556f9b9ee698) at nodeLimit.c:95 #8 0x556f99e0174a in ExecProcNode (node=0x556f9b9ee698) at ../../../src/include/executor/executor.h:274 #9 ExecutePlan (execute_once=, dest=0x556f9b9f2360, direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, planstate=0x556f9b9ee698, estate=0x556f9b9ee458) at execMain.c:1644 #10 standard_ExecutorRun (queryDesc=0x556f9b944f88, direction=, count=0, execute_once=) at execMain.c:363 #11 0x556f99f9cc7f in PortalRunSelect (portal=portal@entry=0x556f9b997008, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x556f9b9f2360) at pquery.c:924 #12 0x556f99f9dffa in PortalRun (portal=portal@entry=0x556f9b997008, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x556f9b9f2360, altdest=altdest@entry=0x556f9b9f2360, qc=0x7ffc123a0b60) at pquery.c:768 #13 0x556f99f9a57c in exec_simple_query (query_string=0x556f9b91ab18 "SELECT * FROM linear WHERE (a BETWEEN 2013 AND 8061) OFFSET 100;") at postgres.c:1274 #14 0x556f99f9c051 in PostgresMain (dbname=, username=) at postgres.c:4680 #15 0x556f99f96def in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:101 #16 0x556f99f0c564 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0x7ffc123a0f90 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0x7ffc123a0fb0) at launch_backend.c:265 #17 0x556f99f0fea9 in BackendStartup (client_sock=0x7ffc123a0fb0) at postmaster.c:3593 #18 ServerLoop () at postmaster.c:1674 #19 0x556f99f11b50 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x556f9b915260) at postmaster.c:1372 #20 0x556f99c5b0c3 in main (argc=3, argv=0x556f9b915260) at main.c:197 (gdb) p hscan->rs_cindex $1 = 543516018 (gdb) p hscan $2 = (HeapScanDesc) 0x556f9b9e6960 (gdb) p *hscan $3 = {rs_base = {rs_rd = 0x7f6
Extension for PostgreSQL WIP
Cast_jsonb_to_hstore WIP v1 This extension add function that can cast jsonb to hstore. That link to my github where does my extension lie https://github.com/antuanviolin/cast_jsonb_to_hstore
Re: Streaming I/O, vectored I/O (WIP)
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas wrote: > On 12/03/2024 15:02, Thomas Munro wrote: > > src/backend/storage/aio/streaming_read.c > > src/include/storage/streaming_read.h > > Standard file header comments missing. Fixed. > It would be nice to have a comment at the top of streaming_read.c, > explaining at a high level how the circular buffer, lookahead and all > that works. Maybe even some diagrams. Done. > For example, what is head and what is tail? Before reading the code, I > assumed that 'head' was the next block range to return in > pg_streaming_read_buffer_get_next(). But I think it's actually the other > way round? Yeah. People seem to have different natural intuitions about head vs tail in this sort of thing, so I've switched to descriptive names: stream->{oldest,next}_buffer_index (see also below). > > /* > > * Create a new streaming read object that can be used to perform the > > * equivalent of a series of ReadBuffer() calls for one fork of one > > relation. > > * Internally, it generates larger vectored reads where possible by looking > > * ahead. > > */ > > PgStreamingRead * > > pg_streaming_read_buffer_alloc(int flags, > > void *pgsr_private, > > size_t > > per_buffer_data_size, > > > > BufferAccessStrategy strategy, > > > > BufferManagerRelation bmr, > > ForkNumber forknum, > > > > PgStreamingReadBufferCB next_block_cb) > > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of > the work it does is memory allocation. But I'd suggest something like > 'pg_streaming_read_begin' instead. I like it. Done. > Do we really need the pg_ prefix in these? Good question. My understanding of our convention is that pg_ is needed for local replacements/variants/extensions of things with well known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in a few places where the word is very common/short and we want to avoid collisions and make sure it's obviously ours (pg_popcount?), and I guess places that reflect the name of a SQL identifier with a prefix, but this doesn't seem to qualify for any of those things. It's a new thing, our own thing entirely, and sufficiently distinctive and unconfusable with standard stuff. So, prefix removed. Lots of other patches on top of this one are using "pgsr" as a variable name, ie containing that prefix; perhaps they would use "sr" or "streaming_read" or "stream". I used "stream" in a few places in this version. Other names improved in this version IMHO: pgsr_private -> callback_private. I find it clearer, as a way to indicate that the provider of the callback "owns" it. I also reordered the arguments: now it's streaming_read_buffer_begin(..., callback, callback_private, per_buffer_data_size), to keep those three things together. > > Buffer > > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void > > **per_buffer_data) > > Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', > for a shorter name. Hmm. The idea of 'buffer' appearing in a couple of names is that there are conceptually other kinds of I/O that we might want to stream, like raw files or buffers other than the buffer pool, maybe even sockets, so this would be part of a family of similar interfaces. I think it needs to be clear that this variant gives you buffers. I'm OK with removing "get" but I guess it would be better to keep the words in the same order across the three functions? What about these? streaming_read_buffer_begin(); streaming_read_buffer_next(); streaming_read_buffer_end(); Tried like that in this version. Other ideas would be to make "stream" the main noun, buffered_read_stream_begin() or something. Ideas welcome. It's also a bit grammatically weird to say StartReadBuffers() and WaitReadBuffers() in the bufmgr API... Hmm. Perhaps we should just call it ReadBuffers() and WaitForBufferIO()? Maybe surprising because the former isn't just like ReadBuffer() ... but on the other hand no one said it has to be, and sometimes it even is (when it gets a hit). I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite to make the asynchrony optional or explicit if someone has a problem with that. (Hmm, that'd be a bit like the Windows native file API, where ReadFile() is synchronous or asynchronous depending on flags.) > > > > /* > >* pgsr->ranges is a circular buffer. When it is empty, head == tail. > >* When it is full, there is an empty element between head and tail. > > Head > >* can also be empty (nblocks == 0), therefore we need two extra > > elements > >* for non-occupied ranges, on top of max_pi
Re: Extension for PostgreSQL WIP
On Sun, Mar 24, 2024 at 5:49 AM ShadowGhost wrote: > Cast_jsonb_to_hstore WIP > v1 > This extension add function that can cast jsonb to hstore. > That link to my github where does my extension lie > https://github.com/antuanviolin/cast_jsonb_to_hstore > If you are intending to submit this to the project you need to follow the correct procedures. https://wiki.postgresql.org/wiki/Submitting_a_Patch Otherwise, this is not the correct place to be promoting your extension. David J. p.s. I would advise that including the whole bit about jsonb and hstore in the email subject line (if you resubmit in a proper format) instead of only in the body of the email. Subject lines are very important on a mailing list such as this and should be fairly precise - not just the word extension.
Re: pg_stat_statements and "IN" conditions
Thanks for the information. I can apply these 4 patches from 0eb23285a2 . I tested this branch from Ruby on Rails and it gets some unexpected behavior from my point of view. Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not normalize sql queries whose number of in clauses exceeds 5. Here are test steps. https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior It would be appreciated if I can get my understanding correct. -- Yasuo Honda On Sun, Mar 24, 2024 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Sure, I can rebase, give me a moment. If you don't want to wait, there > is a base commit in the patch, against which it should be applied > without issues, 0eb23285a2.
Re: Sync scan & regression tests
Heikki Linnakangas writes: > On 19/09/2023 01:57, Andres Freund wrote: >> On 2023-09-18 13:49:24 +0300, Heikki Linnakangas wrote: >>> d) Copy fewer rows to the table in the test. If we copy only 6 rows, for >>> example, the table will have only two pages, regardless of shared_buffers. >>> >>> I'm leaning towards d). The whole test is a little fragile, it will also >>> fail with a non-default block size, for example. But c) seems like a simple >>> fix and wouldn't look too out of place in the test. >> Hm, what do you mean with the last sentence? Oh, is the test you're >> referencing the relation-extension logic? > Sorry, I said "c) seems like a simple fix ...", but I meant "d) seems > like a simple fix ..." > I meant the attached. This thread stalled out months ago, but chipmunk is still failing in HEAD and v16. Can we please have a fix? I'm good with Heikki's adjustment to the pg_visibility test case. regards, tom lane
Re: make dist using git archive
3 comments left that are inconsequential. Feel free to ignore. +# Meson has its own distribution building command (meson dist), but we +# are not using that at this point. The main problem is that the way +# they have implemented it, it is not deterministic. Also, we want it +# to be equivalent to the "make" version for the time being. But the +# target name "dist" in meson is reserved for that reason, so we call +# the custom target "pgdist". The second sentence is a run-on. +if bzip2.found() + tar_bz2 = custom_target('tar.bz2', +build_always_stale: true, +command: [git, '-C', '@SOURCE_ROOT@', + '-c', 'core.autocrlf=false', + '-c', 'tar.tar.bz2.command="' + bzip2.path() + '" -c', + 'archive', + '--format', 'tar.bz2', + '--prefix', distdir + '/', + '-o', join_paths(meson.build_root(), '@OUTPUT@'), + 'HEAD', '.'], +install: false, +output: distdir + '.tar.bz2', + ) You might find Meson's string formatting syntax creates a more readable command string: 'tar.tar.bz2.command=@0@ -c'.format(bzip2.path()) And then 'install: false' is the default if you feel like leaving it out. Otherwise, let's get this in! -- Tristan Partin Neon (https://neon.tech)
Re: Combine Prune and Freeze records emitted by vacuum
On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas wrote: > > In heap_page_prune_and_freeze(), we now do some extra work on each live > tuple, to set the all_visible_except_removable correctly. And also to > update live_tuples, recently_dead_tuples and hastup. When we're not > freezing, that's a waste of cycles, the caller doesn't care. I hope it's > enough that it doesn't matter, but is it? Last year on an early version of the patch set I did some pgbench tpcb-like benchmarks -- since there is a lot of on-access pruning in that workload -- and I don't remember it being a showstopper. The code has changed a fair bit since then. However, I think it might be safer to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip the rest of the loop after heap_prune_satisifies_vacuum() when on-access pruning invokes it. I had avoided that because it felt ugly and error-prone, however it addresses a few other of your points as well. For example, I think we should set a bit in the prune/freeze WAL record's flags to indicate if pruning was done by vacuum or on access (mentioned in another of your recent emails). > The first commit (after the WAL format changes) changes the all-visible > check to use GlobalVisTestIsRemovableXid. The commit message says that > it's because we don't have 'cutoffs' available, but we only care about > that when we're freezing, and when we're freezing, we actually do have > 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems > sensible anyway, because that's what we use in > heap_prune_satisfies_vacuum() too, but just wanted to point that out. Yes, this is true. If we skip this part of the loop when on-access pruning invokes it, we can go back to using the OldestXmin. I have done that as well as some other changes in the attached patch, conflict_horizon_updates.diff. Note that this patch may not apply on your latest patch as I wrote it on top of an older version. Switching back to using OldestXmin for page visibility determination makes this patch set more similar to master as well. We could keep the alternative check (with GlobalVisState) to maintain the illusion that callers passing by_vacuum as True can pass NULL for pagefrz, but I was worried about the extra overhead. It would be nice to pick a single reasonable visibility horizon (the oldest running xid we compare things to) at the beginning of heap_page_prune_and_freeze() and use it for determining if we can remove tuples, if we can freeze tuples, and if the page is all visible. It makes it confusing that we use OldestXmin for freezing and setting the page visibility in the VM and GlobalVisState for determining prunability. I think using GlobalVisState for freezing has been discussed before and ruled out for a variety of reasons, and I definitely don't suggest making that change in this patch set. > The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily > these patches's fault). This at least is wrong, because Max(a, b) > doesn't handle XID wraparound correctly: > > > if (do_freeze) > > conflict_xid = > > Max(prstate.snapshotConflictHorizon, > > > > presult->frz_conflict_horizon); > > else > > conflict_xid = > > prstate.snapshotConflictHorizon; > > Then there's this in lazy_scan_prune(): > > > /* Using same cutoff when setting VM is now unnecessary */ > > if (presult.all_frozen) > > presult.frz_conflict_horizon = InvalidTransactionId; > This does the right thing in the end, but if all the tuples are frozen > shouldn't frz_conflict_horizon already be InvalidTransactionId? The > comment says it's "newest xmin on the page", and if everything was > frozen, all xmins are FrozenTransactionId. In other words, that should > be moved to heap_page_prune_and_freeze() so that it doesn't lie to its > caller. Also, frz_conflict_horizon is only set correctly if > 'all_frozen==true', would be good to mention that in the comments too. Yes, this is a good point. I've spent some time swapping all of this back into my head. I think we should change the names of all these conflict horizon variables and introduce some local variables again. In the attached patch, I've updated the name of the variable in PruneFreezeResult to vm_conflict_horizon, as it is only used for emitting a VM update record. Now, I don't set it until the end of heap_page_prune_and_freeze(). It is only updated from InvalidTransactionId if the page is not all frozen. As you say, if the page is all frozen, there can be no conflict. I've also changed PruneState->snapshotConflictHorizon to PruneState->latest_xid_removed. And I introduced the local variables visibility_cutoff_xid and frz_conflict_horizon. I think it is important we distinguish between the latest xid pruned, the latest xmin of tuples frozen, and the latest
Re: [PoC] Improve dead tuple storage for lazy vacuum
John Naylor writes: > Done. I pushed this with a few last-minute cosmetic adjustments. This > has been a very long time coming, but we're finally in the home > stretch! I'm not sure why it took a couple weeks for Coverity to notice ee1b30f12, but it saw it today, and it's not happy: /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down() 1615node = child; 1616shift -= RT_SPAN; 1617} 1618 1619/* Reserve slot for the value. */ 1620n4 = (RT_NODE_4 *) node.local; >>> CID 1594658: Integer handling issues (BAD_SHIFT) >>> In expression "key >> shift", shifting by a negative amount has >>> undefined behavior. The shift amount, "shift", is as little as -7. 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); 1622n4->base.count = 1; 1623 1624return &n4->children[0]; 1625 } 1626 I think the point here is that if you start with an arbitrary non-negative shift value, the preceding loop may in fact decrement it down to something less than zero before exiting, in which case we would indeed have trouble. I suspect that the code is making undocumented assumptions about the possible initial values of shift. Maybe some Asserts would be good? Also, if we're effectively assuming that shift must be exactly zero here, why not let the compiler hard-code that? - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); regards, tom lane
Re: Streaming I/O, vectored I/O (WIP)
On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro wrote: > > On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas wrote: > > On 12/03/2024 15:02, Thomas Munro wrote: > > > src/backend/storage/aio/streaming_read.c > > > src/include/storage/streaming_read.h > > > > Standard file header comments missing. > > Fixed. > > > It would be nice to have a comment at the top of streaming_read.c, > > explaining at a high level how the circular buffer, lookahead and all > > that works. Maybe even some diagrams. > > Done. > > > For example, what is head and what is tail? Before reading the code, I > > assumed that 'head' was the next block range to return in > > pg_streaming_read_buffer_get_next(). But I think it's actually the other > > way round? > > Yeah. People seem to have different natural intuitions about head vs > tail in this sort of thing, so I've switched to descriptive names: > stream->{oldest,next}_buffer_index (see also below). > > > > /* > > > * Create a new streaming read object that can be used to perform the > > > * equivalent of a series of ReadBuffer() calls for one fork of one > > > relation. > > > * Internally, it generates larger vectored reads where possible by > > > looking > > > * ahead. > > > */ > > > PgStreamingRead * > > > pg_streaming_read_buffer_alloc(int flags, > > > void > > > *pgsr_private, > > > size_t > > > per_buffer_data_size, > > > > > > BufferAccessStrategy strategy, > > > > > > BufferManagerRelation bmr, > > > ForkNumber > > > forknum, > > > > > > PgStreamingReadBufferCB next_block_cb) > > > > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of > > the work it does is memory allocation. But I'd suggest something like > > 'pg_streaming_read_begin' instead. > > I like it. Done. > > > Do we really need the pg_ prefix in these? > > Good question. My understanding of our convention is that pg_ is > needed for local replacements/variants/extensions of things with well > known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in > a few places where the word is very common/short and we want to avoid > collisions and make sure it's obviously ours (pg_popcount?), and I > guess places that reflect the name of a SQL identifier with a prefix, > but this doesn't seem to qualify for any of those things. It's a new > thing, our own thing entirely, and sufficiently distinctive and > unconfusable with standard stuff. So, prefix removed. > > Lots of other patches on top of this one are using "pgsr" as a > variable name, ie containing that prefix; perhaps they would use "sr" > or "streaming_read" or "stream". I used "stream" in a few places in > this version. > > Other names improved in this version IMHO: pgsr_private -> > callback_private. I find it clearer, as a way to indicate that the > provider of the callback "owns" it. I also reordered the arguments: > now it's streaming_read_buffer_begin(..., callback, callback_private, > per_buffer_data_size), to keep those three things together. I haven't reviewed the whole patch, but as I was rebasing bitmapheapscan streaming read user, I found callback_private confusing because it seems like it is a private callback, not private data belonging to the callback. Perhaps call it callback_private_data? Also maybe mention what it is for in the comment above streaming_read_buffer_begin() and in the StreamingRead structure itself. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/24/24 18:38, Melanie Plageman wrote: > On Sun, Mar 24, 2024 at 01:36:19PM +0100, Tomas Vondra wrote: >> >> >> On 3/23/24 01:26, Melanie Plageman wrote: >>> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote: On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote: > On 18/03/2024 17:19, Melanie Plageman wrote: >> I've attached v7 rebased over this commit. > > If we delayed table_beginscan_bm() call further, after starting the TBM > iterator, we could skip it altogether when the iterator is empty. > > That's a further improvement, doesn't need to be part of this patch set. > Just caught my eye while reading this. Hmm. You mean like until after the first call to tbm_[shared]_iterate()? AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or not the iterator is "empty". Do you mean cases when the bitmap has no blocks in it? It seems like we should be able to tell that from the TIDBitmap. > >> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch > > I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call > the > flag e.g. SO_NEED_TUPLE. Agreed. Done in attached v8. Though I wondered if it was a bit weird that the flag is set in the common case and not set in the uncommon case... >>> >>> v8 actually attached this time >> >> I tried to run the benchmarks with v8, but unfortunately it crashes for >> me very quickly (I've only seen 0015 to crash, so I guess the bug is in >> that patch). >> >> The backtrace attached, this doesn't seem right: >> >> (gdb) p hscan->rs_cindex >> $1 = 543516018 > > Thanks for reporting this! I hadn't seen it crash on my machine, so I > didn't realize that I was no longer initializing rs_cindex and > rs_ntuples on the first call to heapam_bitmap_next_tuple() (since > heapam_bitmap_next_block() wasn't being called first). I've done this in > attached v9. > OK, I've restarted the tests with v9. > I haven't had a chance yet to reproduce the regressions you saw in the > streaming read user patch or to look closely at the performance results. So you tried to reproduce it and didn't hit the issue? Or didn't have time to look into that yet? FWIW with v7 it failed almost immediately (only a couple queries until hitting one triggering the issue), but v9 that's not the case (hundreds of queries without an error). > I don't anticipate the streaming read user will have any performance > differences in this v9 from v6, since I haven't yet rebased in Thomas' > latest streaming read API changes nor addressed any other potential > regression sources. > OK, understood. It'll be interesting to see the behavior with the new version of Thomas' patch. I however wonder what the plan with these patches is - do we still plan to get some of this into v17? It seems to me we're getting uncomfortably close to the end of the cycle, with a fairly incomplete idea of how it affects performance. Which is why I've been focusing more on the refactoring patches (up to 0015), to make sure those don't cause regressions if committed. And I think that's generally true. But for the main StreamingRead API the situation is very different. > I tried rebasing in Thomas' latest version today and something is > causing a crash that I have yet to figure out. v10 of this patchset will > have his latest version once I get that fixed. I wanted to share this > version with what I think is a bug fix for the crash you saw first. > Understood. I'll let the tests with v9 run for now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_dump versus enum types, round N+1
Andrew Dunstan writes: > On Sat, Mar 23, 2024 at 3:00 PM Tom Lane wrote: >> So I'm glad we found that sooner not later, but something needs >> to be done about it if [1] is to get committed. It doesn't seem >> particularly hard to fix though: we just have to track the enum >> type OIDs made in the current transaction, using largely the same >> approach as is already used in pg_enum.c to track enum value OIDs. > Makes sense, Nice clear comments. Thanks for looking. Pushed after a bit more work on the comments. regards, tom lane
Re: Combine Prune and Freeze records emitted by vacuum
On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote: > On 20/03/2024 21:17, Melanie Plageman wrote: > > Attached patch changes-for-0001.patch has a bunch of updated comments -- > > especially for heapam_xlog.h (including my promised diagram). And a few > > suggestions (mostly changes that I should have made before). > > New version of these WAL format changes attached. Squashed to one patch now. > I spent more time on the comments throughout the patch. And one > notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with > XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a > cleanup lock to replay the record. It must always be set when > XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those > always needs a cleanup lock. That felt easier to document and understand > than XLHP_LP_TRUNCATE_ONLY. Makes sense to me. > > > From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001 > > > From: Heikki Linnakangas > > > Date: Wed, 20 Mar 2024 14:53:31 +0200 > > > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze() > > > > > > Mostly to make local variables more tightly-scoped. > > > > So, I don't think you can move those sub-records into the tighter scope. > > If you run tests with this applied, you'll see it crashes and a lot of > > the data in the record is wrong. If you move the sub-record declarations > > out to a wider scope, the tests pass. > > > > The WAL record data isn't actually copied into the buffer until > > > > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE); > > > > after registering everything. > > So all of those sub-records you made are out of scope the time it tries > > to copy from them. > > > > I spent the better part of a day last week trying to figure out what was > > happening after I did the exact same thing. I must say that I found the > > xloginsert API incredibly unhelpful on this point. > > Oops. I had that in mind and that was actually why I moved the > XLogRegisterData() call to the end of the function, because I found it > confusing to register the struct before filling it in completely, even > though it works perfectly fine. But then I missed it anyway when I moved the > local variables. I added a brief comment on that. Comment was a good idea. > There is another patch in the commitfest that touches this area: "Recording > whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning" > [1]. That actually goes in the opposite direction than this patch. That > patch wants to add more information, to show whether a record was emitted by > VACUUM or on-access pruning, while this patch makes the freezing and > VACUUM's 2nd phase records also look the same. We could easily add more > flags to xl_heap_prune to distinguish them. Or assign different xl_info code > for them, like that other patch proposed. But I don't think that needs to > block this patch, that can be added as a separate patch. > > [1] > https://www.postgresql.org/message-id/cah2-wzmsevhox+hjpfmqgcxwwdgnzp0r9f+vbnpok6tgxmp...@mail.gmail.com I think it would be very helpful to distinguish amongst vacuum pass 1, 2, and on-access pruning. I often want to know what did most of the pruning -- and I could see also wanting to know if the first or second vacuum pass was responsible for removing the tuples. I agree it could be done separately, but it would be very helpful to have as soon as possible now that the record type will be the same for all three. > From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Fri, 22 Mar 2024 23:10:22 +0200 > Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats > > /* > - * Handles XLOG_HEAP2_PRUNE record type. > - * > - * Acquires a full cleanup lock. > + * Replay XLOG_HEAP2_PRUNE_FREEZE record. > */ > static void > -heap_xlog_prune(XLogReaderState *record) > +heap_xlog_prune_freeze(XLogReaderState *record) > { > XLogRecPtr lsn = record->EndRecPtr; > - xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record); > + char *ptr; > + xl_heap_prune *xlrec; > Buffer buffer; > RelFileLocator rlocator; > BlockNumber blkno; > XLogRedoAction action; > > XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno); > + ptr = XLogRecGetData(record); I don't love having two different pointers that alias each other and we don't know which one is for what. Perhaps we could memcpy xlrec like in my attached diff (log_updates.diff). It also might perform slightly better than accessing flags through a xl_heap_prune * -- since it wouldn't be doing pointer dereferencing. > + xlrec = (xl_heap_prune *) ptr; > + ptr += SizeOfHeapPrune; > > /* > - * We're about to remove tuples. In Hot Standby mode, ensure that > there's > - * no queries running for which the removed tuples are still visible. > + *
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra wrote: > > On 3/24/24 18:38, Melanie Plageman wrote: > > I haven't had a chance yet to reproduce the regressions you saw in the > > streaming read user patch or to look closely at the performance results. > > So you tried to reproduce it and didn't hit the issue? Or didn't have > time to look into that yet? FWIW with v7 it failed almost immediately > (only a couple queries until hitting one triggering the issue), but v9 > that's not the case (hundreds of queries without an error). I haven't started trying to reproduce it yet. > I however wonder what the plan with these patches is - do we still plan > to get some of this into v17? It seems to me we're getting uncomfortably > close to the end of the cycle, with a fairly incomplete idea of how it > affects performance. > > Which is why I've been focusing more on the refactoring patches (up to > 0015), to make sure those don't cause regressions if committed. And I > think that's generally true. Thank you for testing the refactoring patches with this in mind! Out of the refactoring patches, I think there is a subset of them that have independent value without the streaming read user. I think it is worth committing the first few patches because they remove a table AM layering violation. IMHO, all of the patches up to "Make table_scan_bitmap_next_block() async friendly" make the code nicer and better. And, if folks like the patch "Remove table_scan_bitmap_next_block()", then I think I could rebase that back in on top of "Make table_scan_bitmap_next_block() async friendly". This would mean table AMs would only have to implement one callback (table_scan_bitmap_next_tuple()) which I also think is a net improvement and simplification. The other refactoring patches may not be interesting without the streaming read user. > But for the main StreamingRead API the situation is very different. My intent for the bitmapheapscan streaming read user was to get it into 17, but I'm not sure that looks likely. The main issues Thomas is looking into right now are related to regressions for a fully cached scan (noticeable with the pg_prewarm streaming read user). With all of these fixed, I anticipate we will still see enough behavioral differences with the bitmapheap scan streaming read user that it may not be committable in time. Though, I have yet to work on reproducing the regressions with the BHS streaming read user mostly because I was focused on getting the refactoring ready and not as much because the streaming read API is unstable. - Melanie
Re: SET ROLE documentation improvement
On Sat, Mar 23, 2024 at 08:37:20AM -0400, Robert Haas wrote: > On Fri, Mar 22, 2024 at 4:51 PM Nathan Bossart > wrote: >> Actually, shouldn't this one be back-patched to v16? If so, I'd do that >> one separately from the other changes we are discussing. > > Sure, that seems fine. Committed that part. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add AVX2 support to simd.h
Here's a new version of 0001 with some added #ifdefs that cfbot revealed were missing. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From cc2bc5ca5b49cd8641af8b2231a34a1aa5091bb9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 20 Mar 2024 14:20:24 -0500 Subject: [PATCH v7 1/1] pg_lfind32(): add "overlap" code for remaining elements --- src/include/port/pg_lfind.h | 105 1 file changed, 72 insertions(+), 33 deletions(-) diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h index b8dfa66eef..5830cc7cb3 100644 --- a/src/include/port/pg_lfind.h +++ b/src/include/port/pg_lfind.h @@ -80,6 +80,51 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem) return false; } +#ifndef USE_NO_SIMD +/* + * pg_lfind32_helper + * + * Searches one 4-register-block of integers. The caller is responsible for + * ensuring that there are at least 4-registers-worth of integers remaining. + */ +static inline bool +pg_lfind32_helper(const Vector32 keys, uint32 *base) +{ + const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32); + Vector32 vals1, +vals2, +vals3, +vals4, +result1, +result2, +result3, +result4, +tmp1, +tmp2, +result; + + /* load the next block into 4 registers */ + vector32_load(&vals1, base); + vector32_load(&vals2, &base[nelem_per_vector]); + vector32_load(&vals3, &base[nelem_per_vector * 2]); + vector32_load(&vals4, &base[nelem_per_vector * 3]); + + /* compare each value to the key */ + result1 = vector32_eq(keys, vals1); + result2 = vector32_eq(keys, vals2); + result3 = vector32_eq(keys, vals3); + result4 = vector32_eq(keys, vals4); + + /* combine the results into a single variable */ + tmp1 = vector32_or(result1, result2); + tmp2 = vector32_or(result3, result4); + result = vector32_or(tmp1, tmp2); + + /* return whether there was a match */ + return vector32_is_highbit_set(result); +} +#endif /* ! USE_NO_SIMD */ + /* * pg_lfind32 * @@ -119,46 +164,40 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem) } #endif - for (i = 0; i < tail_idx; i += nelem_per_iteration) + /* + * If there aren't enough elements for the SIMD code, jump to the standard + * one-by-one linear search code. + */ + if (nelem <= nelem_per_iteration) + goto one_by_one; + + /* + * Process as many elements as possible with a block of 4 registers. + */ + do { - Vector32 vals1, - vals2, - vals3, - vals4, - result1, - result2, - result3, - result4, - tmp1, - tmp2, - result; - - /* load the next block into 4 registers */ - vector32_load(&vals1, &base[i]); - vector32_load(&vals2, &base[i + nelem_per_vector]); - vector32_load(&vals3, &base[i + nelem_per_vector * 2]); - vector32_load(&vals4, &base[i + nelem_per_vector * 3]); - - /* compare each value to the key */ - result1 = vector32_eq(keys, vals1); - result2 = vector32_eq(keys, vals2); - result3 = vector32_eq(keys, vals3); - result4 = vector32_eq(keys, vals4); - - /* combine the results into a single variable */ - tmp1 = vector32_or(result1, result2); - tmp2 = vector32_or(result3, result4); - result = vector32_or(tmp1, tmp2); - - /* see if there was a match */ - if (vector32_is_highbit_set(result)) + if (pg_lfind32_helper(keys, &base[i])) { Assert(assert_result == true); return true; } - } + + i += nelem_per_iteration; + + } while (i < tail_idx); + + /* + * Process the last 'nelem_per_iteration' elements in the array with a + * 4-register block. This will cause us to check some of the elements + * more than once, but that won't affect correctness, and testing has + * demonstrated that this helps more cases than it harms. + */ + Assert(assert_result == pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration])); + return pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]); + #endif /* ! USE_NO_SIMD */ +one_by_one: /* Process the remaining elements one at a time. */ for (; i < nelem; i++) { -- 2.25.1
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/24/24 21:12, Melanie Plageman wrote: > On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra > wrote: >> >> On 3/24/24 18:38, Melanie Plageman wrote: >>> I haven't had a chance yet to reproduce the regressions you saw in the >>> streaming read user patch or to look closely at the performance results. >> >> So you tried to reproduce it and didn't hit the issue? Or didn't have >> time to look into that yet? FWIW with v7 it failed almost immediately >> (only a couple queries until hitting one triggering the issue), but v9 >> that's not the case (hundreds of queries without an error). > > I haven't started trying to reproduce it yet. > >> I however wonder what the plan with these patches is - do we still plan >> to get some of this into v17? It seems to me we're getting uncomfortably >> close to the end of the cycle, with a fairly incomplete idea of how it >> affects performance. >> >> Which is why I've been focusing more on the refactoring patches (up to >> 0015), to make sure those don't cause regressions if committed. And I >> think that's generally true. > > Thank you for testing the refactoring patches with this in mind! Out > of the refactoring patches, I think there is a subset of them that > have independent value without the streaming read user. I think it is > worth committing the first few patches because they remove a table AM > layering violation. IMHO, all of the patches up to "Make > table_scan_bitmap_next_block() async friendly" make the code nicer and > better. And, if folks like the patch "Remove > table_scan_bitmap_next_block()", then I think I could rebase that back > in on top of "Make table_scan_bitmap_next_block() async friendly". > This would mean table AMs would only have to implement one callback > (table_scan_bitmap_next_tuple()) which I also think is a net > improvement and simplification. > > The other refactoring patches may not be interesting without the > streaming read user. > I admit not reviewing the individual patches very closely yet, but this matches how I understood them - that at least some are likely an improvement on their own, not just as a refactoring preparing for the switch to streaming reads. We only have ~2 weeks left, so it's probably time to focus on getting at least those improvements committed. I see Heikki was paying way more attention to the patches than me, though ... BTW when you say "up to 'Make table_scan_bitmap_next_block() async friendly'" do you mean including that patch, or that this is the first patch that is not one of the independently useful patches. (I took a quick look at the first couple patches and I appreciate that you keep separate patches with small cosmetic changes to keep the actual patch smaller and easier to understand.) >> But for the main StreamingRead API the situation is very different. > > My intent for the bitmapheapscan streaming read user was to get it > into 17, but I'm not sure that looks likely. The main issues Thomas is > looking into right now are related to regressions for a fully cached > scan (noticeable with the pg_prewarm streaming read user). With all of > these fixed, I anticipate we will still see enough behavioral > differences with the bitmapheap scan streaming read user that it may > not be committable in time. Though, I have yet to work on reproducing > the regressions with the BHS streaming read user mostly because I was > focused on getting the refactoring ready and not as much because the > streaming read API is unstable. > I don't have a very good intuition regarding impact of the streaming API patch on performance. I haven't been following that thread very closely, but AFAICS there wasn't much discussion about that - perhaps it happened offlist, not sure. So who knows, really? Which is why I started looking at this patch instead - it seemed easier to benchmark with a somewhat realistic workload. But yeah, there certainly were significant behavior changes, and it's unlikely that whatever Thomas did in v8 made them go away. FWIW I certainly am *not* suggesting there must be no behavior changes, that's simply not possible. I'm not even suggesting no queries must get slower - given the dependence on storage, I think some regressions are pretty much inevitable. But it's still be good to know the regressions are reasonably rare exceptions rather than the common case, and that's not what I'm seeing ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: add AVX2 support to simd.h
On Sun, Mar 24, 2024 at 03:53:17PM -0500, Nathan Bossart wrote: > Here's a new version of 0001 with some added #ifdefs that cfbot revealed > were missing. Sorry for the noise. cfbot revealed another silly mistake (forgetting to reset the "i" variable in the assertion path). That should be fixed in v8. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From f15d85844370aef8505559fc0f2db629b135a9e8 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 20 Mar 2024 14:20:24 -0500 Subject: [PATCH v8 1/1] pg_lfind32(): add "overlap" code for remaining elements --- src/include/port/pg_lfind.h | 109 1 file changed, 74 insertions(+), 35 deletions(-) diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h index b8dfa66eef..f746aabbf9 100644 --- a/src/include/port/pg_lfind.h +++ b/src/include/port/pg_lfind.h @@ -80,6 +80,51 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem) return false; } +#ifndef USE_NO_SIMD +/* + * pg_lfind32_helper + * + * Searches one 4-register-block of integers. The caller is responsible for + * ensuring that there are at least 4-registers-worth of integers remaining. + */ +static inline bool +pg_lfind32_helper(const Vector32 keys, uint32 *base) +{ + const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32); + Vector32 vals1, +vals2, +vals3, +vals4, +result1, +result2, +result3, +result4, +tmp1, +tmp2, +result; + + /* load the next block into 4 registers */ + vector32_load(&vals1, base); + vector32_load(&vals2, &base[nelem_per_vector]); + vector32_load(&vals3, &base[nelem_per_vector * 2]); + vector32_load(&vals4, &base[nelem_per_vector * 3]); + + /* compare each value to the key */ + result1 = vector32_eq(keys, vals1); + result2 = vector32_eq(keys, vals2); + result3 = vector32_eq(keys, vals3); + result4 = vector32_eq(keys, vals4); + + /* combine the results into a single variable */ + tmp1 = vector32_or(result1, result2); + tmp2 = vector32_or(result3, result4); + result = vector32_or(tmp1, tmp2); + + /* return whether there was a match */ + return vector32_is_highbit_set(result); +} +#endif /* ! USE_NO_SIMD */ + /* * pg_lfind32 * @@ -109,9 +154,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem) bool assert_result = false; /* pre-compute the result for assert checking */ - for (i = 0; i < nelem; i++) + for (int j = 0; j < nelem; j++) { - if (key == base[i]) + if (key == base[j]) { assert_result = true; break; @@ -119,46 +164,40 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem) } #endif - for (i = 0; i < tail_idx; i += nelem_per_iteration) + /* + * If there aren't enough elements for the SIMD code, jump to the standard + * one-by-one linear search code. + */ + if (nelem <= nelem_per_iteration) + goto one_by_one; + + /* + * Process as many elements as possible with a block of 4 registers. + */ + do { - Vector32 vals1, - vals2, - vals3, - vals4, - result1, - result2, - result3, - result4, - tmp1, - tmp2, - result; - - /* load the next block into 4 registers */ - vector32_load(&vals1, &base[i]); - vector32_load(&vals2, &base[i + nelem_per_vector]); - vector32_load(&vals3, &base[i + nelem_per_vector * 2]); - vector32_load(&vals4, &base[i + nelem_per_vector * 3]); - - /* compare each value to the key */ - result1 = vector32_eq(keys, vals1); - result2 = vector32_eq(keys, vals2); - result3 = vector32_eq(keys, vals3); - result4 = vector32_eq(keys, vals4); - - /* combine the results into a single variable */ - tmp1 = vector32_or(result1, result2); - tmp2 = vector32_or(result3, result4); - result = vector32_or(tmp1, tmp2); - - /* see if there was a match */ - if (vector32_is_highbit_set(result)) + if (pg_lfind32_helper(keys, &base[i])) { Assert(assert_result == true); return true; } - } + + i += nelem_per_iteration; + + } while (i < tail_idx); + + /* + * Process the last 'nelem_per_iteration' elements in the array with a + * 4-register block. This will cause us to check some of the elements + * more than once, but that won't affect correctness, and testing has + * demonstrated that this helps more cases than it harms. + */ + Assert(assert_result == pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration])); + return pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]); + #endif /* ! USE_NO_SIMD */ +one_by_one: /* Process the remaining elements one at a time. */ for (; i < nelem; i++) { -- 2.25.1
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra wrote: > > BTW when you say "up to 'Make table_scan_bitmap_next_block() async > friendly'" do you mean including that patch, or that this is the first > patch that is not one of the independently useful patches. I think the code is easier to understand with "Make table_scan_bitmap_next_block() async friendly". Prior to that commit, table_scan_bitmap_next_block() could return false even when the bitmap has more blocks and expects the caller to handle this and invoke it again. I think that interface is very confusing. The downside of the code in that state is that the code for prefetching is still in the BitmapHeapNext() code and the code for getting the current block is in the heap AM-specific code. I took a stab at fixing this in v9's 0013, but the outcome wasn't very attractive. What I will do tomorrow is reorder and group the commits such that all of the commits that are useful independent of streaming read are first (I think 0014 and 0015 are independently valuable but they are on top of some things that are only useful to streaming read because they are more recently requested changes). I think I can actually do a bit of simplification in terms of how many commits there are and what is in each. Just to be clear, v9 is still reviewable. I am just going to go back and change what is included in each commit. > (I took a quick look at the first couple patches and I appreciate that > you keep separate patches with small cosmetic changes to keep the actual > patch smaller and easier to understand.) Thanks!
Re: Streaming I/O, vectored I/O (WIP)
On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman wrote: > I haven't reviewed the whole patch, but as I was rebasing > bitmapheapscan streaming read user, I found callback_private confusing > because it seems like it is a private callback, not private data > belonging to the callback. Perhaps call it callback_private_data? Also WFM. > maybe mention what it is for in the comment above > streaming_read_buffer_begin() and in the StreamingRead structure > itself. Yeah. I've tried to improve the comments on all three public functions. I also moved the three public functions _begin(), _next(), _end() to be next to each other after the static helper functions. Working on perf regression/tuning reports today, more soon... From edd3d078cf8d4b0c2f08df82295825f7107ec62b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 26 Feb 2024 23:48:31 +1300 Subject: [PATCH v9 1/4] Provide vectored variant of ReadBuffer(). Break ReadBuffer() up into two steps: StartReadBuffers() and WaitReadBuffers(). This has two advantages: 1. Multiple consecutive blocks can be read with one system call. 2. Advice (hints of future reads) can optionally be issued to the kernel. The traditional ReadBuffer() function is now implemented in terms of those functions, to avoid duplication. For now we still only read a block at a time so there is no change to generated system calls yet, but later commits will provide infrastructure to help build up larger calls. Callers should respect the new GUC buffer_io_size, and the limit on per-backend pins which is now exposed as a public interface. With some more infrastructure in later work, StartReadBuffers() could be extended to start real asynchronous I/O instead of advice. Reviewed-by: Melanie Plageman Reviewed-by: Heikki Linnakangas Reviewed-by: Nazir Bilal Yavuz Reviewed-by: Dilip Kumar Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com --- doc/src/sgml/config.sgml | 14 + src/backend/storage/buffer/bufmgr.c | 658 -- src/backend/storage/buffer/localbuf.c | 14 +- src/backend/utils/misc/guc_tables.c | 14 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/storage/bufmgr.h | 45 +- src/tools/pgindent/typedefs.list | 1 + 7 files changed, 535 insertions(+), 212 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 65a6e6c4086..3af86c59384 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2719,6 +2719,20 @@ include_dir 'conf.d' + + buffer_io_size (integer) + +buffer_io_size configuration parameter + + + + + Controls the target I/O size in operations that coalesce buffer I/O. + The default is 128kB. + + + + max_worker_processes (integer) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f0f8d4259c5..b5347678726 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -19,6 +19,11 @@ * and pin it so that no one can destroy it while this process * is using it. * + * StartReadBuffers() -- as above, but for multiple contiguous blocks in + * two steps. + * + * WaitReadBuffers() -- second step of StartReadBuffers(). + * * ReleaseBuffer() -- unpin a buffer * * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty". @@ -160,6 +165,12 @@ int checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER; int bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER; int backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER; +/* + * How many buffers should be coalesced into single I/O operations where + * possible. + */ +int buffer_io_size = DEFAULT_BUFFER_IO_SIZE; + /* local state for LockBufferForCleanup */ static BufferDesc *PinCountWaitBuf = NULL; @@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref) ) -static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence, +static Buffer ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum, BlockNumber blockNum, -ReadBufferMode mode, BufferAccessStrategy strategy, -bool *hit); +ReadBufferMode mode, BufferAccessStrategy strategy); static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr, ForkNumber fork, BufferAccessStrategy strategy, @@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf); static int SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context); static void WaitIO(BufferDesc *buf); -static bool StartBufferIO(BufferDesc *buf, bool forInput); +static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait); static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, ui
Re: Built-in CTYPE provider
On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote: > Please look at a Valgrind-detected error caused by the following > query > (starting from f69319f2f): > SELECT lower('Π' COLLATE pg_c_utf8); Thank you for the report! Fixed in 503c0ad976. Valgrind did not detect the problem in my setup, so I added a unit test in case_test.c where it's easier to see the valgrind problem. Regards, Jeff Davis
Re: Adding OLD/NEW support to RETURNING
On Mon, Mar 18, 2024 at 6:48 PM Dean Rasheed wrote: > > On Tue, 12 Mar 2024 at 18:21, Dean Rasheed wrote: > > > > Updated version attached tidying up a couple of things and fixing another > > bug: > > > > Rebased version attached, on top of c649fa24a4 (MERGE ... RETURNING support). > hi, some minor issues I found out. +/* + * ReplaceReturningVarsFromTargetList - + * replace RETURNING list Vars with items from a targetlist + * + * This is equivalent to calling ReplaceVarsFromTargetList() with a + * nomatch_option of REPLACEVARS_REPORT_ERROR, but with the added effect of + * copying varreturningtype onto any Vars referring to new_result_relation, + * allowing RETURNING OLD/NEW to work in the rewritten query. + */ + +typedef struct +{ + ReplaceVarsFromTargetList_context rv_con; + int new_result_relation; +} ReplaceReturningVarsFromTargetList_context; + +static Node * +ReplaceReturningVarsFromTargetList_callback(Var *var, + replace_rte_variables_context *context) +{ + ReplaceReturningVarsFromTargetList_context *rcon = (ReplaceReturningVarsFromTargetList_context *) context->callback_arg; + Node *newnode; + + newnode = ReplaceVarsFromTargetList_callback(var, context); + + if (var->varreturningtype != VAR_RETURNING_DEFAULT) + SetVarReturningType((Node *) newnode, rcon->new_result_relation, + var->varlevelsup, var->varreturningtype); + + return newnode; +} + +Node * +ReplaceReturningVarsFromTargetList(Node *node, + int target_varno, int sublevels_up, + RangeTblEntry *target_rte, + List *targetlist, + int new_result_relation, + bool *outer_hasSubLinks) +{ + ReplaceReturningVarsFromTargetList_context context; + + context.rv_con.target_rte = target_rte; + context.rv_con.targetlist = targetlist; + context.rv_con.nomatch_option = REPLACEVARS_REPORT_ERROR; + context.rv_con.nomatch_varno = 0; + context.new_result_relation = new_result_relation; + + return replace_rte_variables(node, target_varno, sublevels_up, + ReplaceReturningVarsFromTargetList_callback, + (void *) &context, + outer_hasSubLinks); +} the ReplaceReturningVarsFromTargetList related comment should be placed right above the function ReplaceReturningVarsFromTargetList, not above ReplaceReturningVarsFromTargetList_context? struct ReplaceReturningVarsFromTargetList_context adds some comments about new_result_relation would be great. /* INDEX_VAR is handled by default case */ this comment appears in execExpr.c and execExprInterp.c. need to move to default case's switch default case? /* * set_deparse_context_plan - Specify Plan node containing expression * * When deparsing an expression in a Plan tree, we might have to resolve * OUTER_VAR, INNER_VAR, or INDEX_VAR references. To do this, the caller must * provide the parent Plan node. ... */ does the comment in set_deparse_context_plan need to be updated? + * buildNSItemForReturning - + * add a ParseNamespaceItem for the OLD or NEW alias in RETURNING. + */ +static void +addNSItemForReturning(ParseState *pstate, const char *aliasname, + VarReturningType returning_type) comment "buildNSItemForReturning" should be "addNSItemForReturning"? * results. If include_dropped is true then empty strings and NULL constants * (not Vars!) are returned for dropped columns. * - * rtindex, sublevels_up, and location are the varno, varlevelsup, and location - * values to use in the created Vars. Ordinarily rtindex should match the - * actual position of the RTE in its rangetable. + * rtindex, sublevels_up, returning_type, and location are the varno, + * varlevelsup, varreturningtype, and location values to use in the created + * Vars. Ordinarily rtindex should match the actual position of the RTE in + * its rangetable. we already updated the comment in expandRTE. but it seems we only do RTE_RELATION, some part of RTE_FUNCTION. do we need ` varnode->varreturningtype = returning_type; ` for other `rte->rtekind` when there is a makeVar? (I don't understand this part, in the case where rte->rtekind is RTE_SUBQUERY, if I add `varnode->varreturningtype = returning_type;` the tests still pass.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > > John Naylor writes: > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > has been a very long time coming, but we're finally in the home > > stretch! Thank you for the report. > > I'm not sure why it took a couple weeks for Coverity to notice > ee1b30f12, but it saw it today, and it's not happy: Hmm, I've also done Coverity Scan in development but I wasn't able to see this one for some reason... > > /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in > local_ts_extend_down() > 1615node = child; > 1616shift -= RT_SPAN; > 1617} > 1618 > 1619/* Reserve slot for the value. */ > 1620n4 = (RT_NODE_4 *) node.local; > >>> CID 1594658: Integer handling issues (BAD_SHIFT) > >>> In expression "key >> shift", shifting by a negative amount has > >>> undefined behavior. The shift amount, "shift", is as little as -7. > 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > 1622n4->base.count = 1; > 1623 > 1624return &n4->children[0]; > 1625 } > 1626 > > I think the point here is that if you start with an arbitrary > non-negative shift value, the preceding loop may in fact decrement it > down to something less than zero before exiting, in which case we > would indeed have trouble. I suspect that the code is making > undocumented assumptions about the possible initial values of shift. > Maybe some Asserts would be good? Also, if we're effectively assuming > that shift must be exactly zero here, why not let the compiler > hard-code that? > > - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); Sounds like a good solution. I've attached the patch for that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_radixtree.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
Masahiko Sawada writes: > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: >> I think the point here is that if you start with an arbitrary >> non-negative shift value, the preceding loop may in fact decrement it >> down to something less than zero before exiting, in which case we >> would indeed have trouble. I suspect that the code is making >> undocumented assumptions about the possible initial values of shift. >> Maybe some Asserts would be good? Also, if we're effectively assuming >> that shift must be exactly zero here, why not let the compiler >> hard-code that? > Sounds like a good solution. I've attached the patch for that. Personally I'd put the Assert immediately after the loop, because it's not related to the "Reserve slot for the value" comment. Seems reasonable otherwise. regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada wrote: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > > > > I'm not sure why it took a couple weeks for Coverity to notice > > ee1b30f12, but it saw it today, and it's not happy: > > Hmm, I've also done Coverity Scan in development but I wasn't able to > see this one for some reason... Hmm, before 30e144287 this code only ran in a test module, is it possible Coverity would not find it there?
Re: Cannot find a working 64-bit integer type on Illumos
On Sat, 23 Mar 2024 at 01:22, Japin Li wrote: > On Sat, 23 Mar 2024 at 01:04, Tom Lane wrote: >> Japin Li writes: >>> When I try to configure PostgreSQL 16.2 on Illumos using the following >>> command, >>> it complains $subject. >> >>> ./configure --enable-cassert --enable-debug --enable-nls --with-perl \ >>> --with-python --without-tcl --without-gssapi --with-openssl \ >>> --with-ldap --with-libxml --with-libxslt --without-systemd \ >>> --with-readline --enable-thread-safety --enable-dtrace \ >>> DTRACEFLAGS=-64 CFLAGS=-Werror >> >>> However, if I remove the `CFLAGS=-Werror`, it works fine. >>> I'm not sure what happened here. >> >> CFLAGS=-Werror breaks a whole lot of configure's tests, not only that >> one. (We even have this documented, see [1].) So you can't inject >> -Werror that way. What I do on my buildfarm animals is the equivalent >> of >> >> export COPT='-Werror' >> >> after configure and before build. I think configure pays no attention >> to COPT, so it'd likely be safe to keep that set all the time, but in >> the buildfarm client it's just as easy to be conservative. >> >> regards, tom lane >> >> [1] https://www.postgresql.org/docs/devel/install-make.html#CONFIGURE-ENVVARS > > Thank you very much! I didn't notice this part before. I try to use the following to compile it, however, it cannot compile it. $ ../configure --enable-cassert --enable-debug --enable-nls --with-perl --with-python --without-tcl --without-gssapi --with-openssl --with-ldap --with-libxml --with-libxslt --without-systemd --with-readline --enable-thread-safety --enable-dtrace DTRACEFLAGS=-64 $ make COPT='-Werror' -s /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function 'repairDependencyLoop': /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: format not a string literal and no format arguments [-Werror=format-security] 1276 | pg_log_warning(ngettext("there are circular foreign-key constraints on this table:", | ^~ cc1: all warnings being treated as errors make[3]: *** [: pg_dump_sort.o] Error 1 make[2]: *** [Makefile:43: all-pg_dump-recurse] Error 2 make[1]: *** [Makefile:42: all-bin-recurse] Error 2 make: *** [GNUmakefile:11: all-src-recurse] Error 2
Re: [PoC] Improve dead tuple storage for lazy vacuum
John Naylor writes: > Hmm, before 30e144287 this code only ran in a test module, is it > possible Coverity would not find it there? That could indeed explain why Coverity didn't see it. I'm not sure how our community run is set up, but it may not build the test modules. regards, tom lane
Re: Cannot find a working 64-bit integer type on Illumos
Japin Li writes: > /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function > 'repairDependencyLoop': > /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: > format not a string literal and no format arguments [-Werror=format-security] > 1276 | pg_log_warning(ngettext("there are circular foreign-key constraints > on this table:", > | ^~ Yeah, some of the older buildfarm animals issue that warning too. AFAICS it's a bogus compiler heuristic: there is not anything wrong with the code as given. regards, tom lane
Re: Cannot find a working 64-bit integer type on Illumos
On Mon, 25 Mar 2024 at 09:32, Tom Lane wrote: > Japin Li writes: >> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function >> 'repairDependencyLoop': >> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: >> format not a string literal and no format arguments [-Werror=format-security] >> 1276 | pg_log_warning(ngettext("there are circular foreign-key >> constraints on this table:", >> | ^~ > > Yeah, some of the older buildfarm animals issue that warning too. > AFAICS it's a bogus compiler heuristic: there is not anything > wrong with the code as given. > Thanks! It seems I should remove -Werror option on Illumos.
Re: Properly pathify the union planner
On Tue, 12 Mar 2024 at 23:21, David Rowley wrote: > I've attached v3. I spent quite a bit more time looking at this. I discovered that the dNumGroups wasn't being set as it should have been for INTERSECT and EXCEPT queries. There was a plan change as a result of this. I've fixed this by adjusting where dNumGroups is set. It must be delayed until after the setop child paths have been created. Aside from this, the changes I made were mostly cosmetic. However, I did notice that I wasn't setting the union child RelOptInfo's ec_indexes in add_setop_child_rel_equivalences(). I also discovered that we're not doing that properly for the top-level RelOptInfo for the UNION query prior to this change. The reason is that due to the Var.varno==0 for the top-level UNION targetlist. The code in get_eclass_for_sort_expr() effectively misses this relation due to "while ((i = bms_next_member(newec->ec_relids, i)) > 0)". This happens to be good because there is no root->simple_rel_array[0] entry, so happens to prevent that code crashing. It seems ok that the ec_indexes are not set for the top-level set RelOptInfo as get_eclass_for_sort_expr() does not make use of ec_indexes while searching for an existing EquivalenceClass. Maybe we should fix this varno == 0 hack and adjust get_eclass_for_sort_expr() so that it makes use of the ec_indexes. It's possible to see this happen with a query such as: SELECT oid FROM pg_class UNION SELECT oid FROM pg_class ORDER BY oid; I didn't see that as a reason not to push this patch as this occurs both with and without this change, so I've now pushed this patch. Thank you and Andy for reviewing this. David
Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
On Mon, Oct 30, 2023 at 11:29:04AM +0400, Pavel Borisov wrote: > On Wed, 25 Oct 2023 at 00:13, Alexander Korotkov wrote: > > I think this patch is ready to go. I'm going to push it if there are > > no objections. > It's very good that this long-standing patch is finally committed. Thanks a > lot! Agreed. I gave this feature (commit 5ae2087) a try. Thanks for implementing it. Could I get your input on two topics? 1. Cross-page comparison at "first key on the next page" only Cross-page comparisons got this discussion upthread: On Tue, Mar 02, 2021 at 07:10:32PM -0800, Peter Geoghegan wrote: > On Mon, Feb 8, 2021 at 2:46 AM Pavel Borisov wrote: > > Caveat: if the first entry on the next index page has a key equal to the > > key on a previous page AND all heap tid's corresponding to this entry are > > invisible, currently cross-page check can not detect unique constraint > > violation between previous index page entry and 2nd, 3d and next current > > index page entries. In this case, there would be a message that recommends > > doing VACUUM to remove the invisible entries from the index and repeat the > > check. (Generally, it is recommended to do vacuum before the check, but for > > the testing purpose I'd recommend turning it off to check the detection of > > visible-invisible-visible duplicates scenarios) > You're going to have to "couple" buffer locks in the style of > _bt_check_unique() (as well as keeping a buffer lock on "the first > leaf page a duplicate might be on" throughout) if you need the test to > work reliably. The amcheck feature has no lock coupling at its "first key on the next page" check. I think that's fine, because amcheck takes one snapshot at the beginning and looks for pairs of visible-to-that-snapshot heap tuples with the same scan key. _bt_check_unique(), unlike amcheck, must catch concurrent inserts. If amcheck "checkunique" wanted to detect duplicates that would appear when all transactions commit, it would need lock coupling. (I'm not suggesting it do that.) Do you see a problem with the lack of lock coupling at "first key on the next page"? > But why bother with that? The tool doesn't have to be > 100% perfect at detecting corruption (nothing can be), and it's rather > unlikely that it will matter for this test. A simple test that doesn't > handle cross-page duplicates is still going to be very effective. I agree, but perhaps the "first key on the next page" code is more complex than general-case code would be. If the lack of lock coupling is fine, then I think memory context lifecycle is the only obstacle making index page boundaries special. Are there factors beyond that? We already have state->lowkey kept across pages via MemoryContextAlloc(). Similar lines of code could preserve the scan key for checkunique, making the "first key on the next page" code unnecessary. 2. Raises runtime by 476% despite no dead tuples I used the following to create a table larger than RAM, 17GB table and 10GB index on a system with 12GB RAM: \set count 5 begin; set maintenance_work_mem = '1GB'; set client_min_messages = debug1; -- debug2 is per-block spam create temp table t as select n from generate_series(1,:count) t(n); create unique index t_idx on t(n); \dt+ t \di+ t_idx create extension amcheck; select bt_index_check('t_idx', heapallindexed => false, checkunique => false); select bt_index_check('t_idx', heapallindexed => false, checkunique => true); Adding checkunique raised runtime from 58s to 276s, because it checks visibility for every heap tuple. It could do the heap fetch and visibility check lazily, when the index yields two heap TIDs for one scan key. That should give zero visibility checks for this particular test case, and it doesn't add visibility checks to bloated-table cases. Pseudo-code: /*--- * scan_key is the last uniqueness-relevant scan key observed as * bt_check_level_from_leftmost() moves right to traverse the leaf level. * Will be NULL if the next tuple can't be the second tuple of a * uniqueness violation, because any of the following apply: * - we're evaluating the first leaf tuple of the entire index * - last scan key had anynullkeys (never forms a uniqueness violation w/ * any other scan key) */ scan_key = NULL; /* * scan_key_known_visible==true indicates that scan_key_heap_tid is the * last _visible_ heap TID observed for scan_key. Otherwise, * scan_key_heap_tid is the last heap TID observed for scan_key, and we've * not yet checked its visibility. */ bool scan_key_known_visible; scan_key_heap_tid; foreach itup (leftmost_leaf_level_tup .. rightmost_leaf_level_tup) { if (itup.anynullkeys) scan_key = NULL; else if (scan_key != NULL && _bt_compare(scan_k
Re: Add Index-level REINDEX with multiple jobs
Alexander Korotkov writes: > Here goes the revised patch. I'm going to push this if there are no > objections. Quite a lot of the buildfarm is complaining about this: reindexdb.c: In function 'reindex_one_database': reindexdb.c:434:54: error: 'indices_tables_cell' may be used uninitialized in this function [-Werror=maybe-uninitialized] 434 | strcmp(prev_index_table_name, indices_tables_cell->val) == 0) | ~~~^ I noticed it first on mamba, which is set up with -Werror, but a scrape of the buildfarm logs shows many other animals reporting this as a warning. I think they are right. Even granting that the compiler realizes that "parallel && process_type == REINDEX_INDEX" is enough to reach the one place where indices_tables_cell is initialized, that's not really enough, because that place is if (indices_tables_list) indices_tables_cell = indices_tables_list->head; So I believe this code will crash if get_parallel_object_list returns an empty list. Initializing indices_tables_cell to NULL in its declaration would stop the compiler warning, but if I'm right it will do nothing to prevent that crash. This needs a bit more effort. regards, tom lane
Re: Add Index-level REINDEX with multiple jobs
On Mon, Mar 25, 2024 at 10:07 AM Tom Lane wrote: > Alexander Korotkov writes: > > Here goes the revised patch. I'm going to push this if there are no > objections. > > Quite a lot of the buildfarm is complaining about this: > > reindexdb.c: In function 'reindex_one_database': > reindexdb.c:434:54: error: 'indices_tables_cell' may be used uninitialized > in this function [-Werror=maybe-uninitialized] > 434 | strcmp(prev_index_table_name, indices_tables_cell->val) == 0) > | ~~~^ > > I noticed it first on mamba, which is set up with -Werror, but a > scrape of the buildfarm logs shows many other animals reporting this > as a warning. I noticed the similar warning on cfbot: https://cirrus-ci.com/task/6298504306360320?logs=gcc_warning#L448 reindexdb.c: In function ‘reindex_one_database’: reindexdb.c:437:24: error: ‘indices_tables_cell’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 437 |indices_tables_cell = indices_tables_cell->next; |^~~ Although it's complaining on line 437 not 434, I think they are the same issue. > I think they are right. Even granting that the > compiler realizes that "parallel && process_type == REINDEX_INDEX" is > enough to reach the one place where indices_tables_cell is > initialized, that's not really enough, because that place is > > if (indices_tables_list) > indices_tables_cell = indices_tables_list->head; > > So I believe this code will crash if get_parallel_object_list returns > an empty list. Initializing indices_tables_cell to NULL in its > declaration would stop the compiler warning, but if I'm right it > will do nothing to prevent that crash. This needs a bit more effort. Agreed. And the comment of get_parallel_object_list() says that it may indeed return NULL. BTW, on line 373, it checks 'process_list' and bails out if this list is NULL. But it seems to me that 'process_list' cannot be NULL in this case, because it's initialized to be 'user_list' and we have asserted that user_list is not NULL on line 360. I wonder if we should check indices_tables_list instead of process_list on line 373. Thanks Richard
Re: add AVX2 support to simd.h
On Fri, Mar 22, 2024 at 12:09 AM Nathan Bossart wrote: > > On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote: > > If this were "<=" then the for long arrays we could assume there is > > always more than one block, and wouldn't need to check if any elements > > remain -- first block, then a single loop and it's done. > > > > The loop could also then be a "do while" since it doesn't have to > > check the exit condition up front. > > Good idea. That causes us to re-check all of the tail elements when the > number of elements is evenly divisible by nelem_per_iteration, but that > might be worth the trade-off. Yeah, if there's no easy way to avoid that it's probably fine. I wonder if we can subtract one first to force even multiples to round down, although I admit I haven't thought through the consequences of that. > [v8] Seems pretty good. It'd be good to see the results of 2- vs. 4-register before committing, because that might lead to some restructuring, but maybe it won't, and v8 is already an improvement over HEAD. /* Process the remaining elements one at a time. */ This now does all of them if that path is taken, so "remaining" can be removed.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 10:13 AM Tom Lane wrote: > > Masahiko Sawada writes: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > >> I think the point here is that if you start with an arbitrary > >> non-negative shift value, the preceding loop may in fact decrement it > >> down to something less than zero before exiting, in which case we > >> would indeed have trouble. I suspect that the code is making > >> undocumented assumptions about the possible initial values of shift. > >> Maybe some Asserts would be good? Also, if we're effectively assuming > >> that shift must be exactly zero here, why not let the compiler > >> hard-code that? > > > Sounds like a good solution. I've attached the patch for that. > > Personally I'd put the Assert immediately after the loop, because > it's not related to the "Reserve slot for the value" comment. > Seems reasonable otherwise. > Thanks. Pushed the fix after moving the Assert. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: A problem about partitionwise join
On Tue, Mar 19, 2024 at 3:40 PM Ashutosh Bapat wrote: > On Tue, Mar 19, 2024 at 8:18 AM Richard Guo > wrote: > >> On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat < >> ashutosh.bapat@gmail.com> wrote: >> >>> Approach >>> >>> The equijoin condition between partition keys doesn't appear in the >>> join's restrictilist because of 'best_score' strategy as you explained well >>> in [2]. What if we add an extra score for clauses between partition keys >>> and give preference to equijoin between partition keys? Have you given it a >>> thought? I feel that having an equijoin clause involving partition keys has >>> more usages compared to a clause with any random column. E.g. nextloop may >>> be able to prune partitions from inner relation if the clause contains a >>> partition key. >>> >> >> Hmm, I think this approach won't work in cases where one certain pair of >> partition keys has formed an EC that contains pseudoconstants. In such >> cases, the EC machinery will generate restriction clauses like 'pk = >> const' rather than any join clauses. >> > > That should be ok and more desirable. Clauses like pk = const will leave > only one partition around in each of the joining relations thus PWJ won't > be required OR it will be automatic - whichever way you see it. > No, that's not true. There could be multiple partition keys, and the particular key involved in the pushed-down restriction 'pk = const' may not be able to prune away any partitions. To be concrete, consider the query: create table p (k1 int, k2 int, val int) partition by range(k1, k2); create table p_1 partition of p for values from (1,1) to (10,100); create table p_2 partition of p for values from (10,100) to (20,200); set enable_partitionwise_join to on; explain (costs off) select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2 and foo.k2 = 5; QUERY PLAN - Hash Join Hash Cond: (foo.k1 = bar.k1) -> Append -> Seq Scan on p_1 foo_1 Filter: (k2 = 5) -> Seq Scan on p_2 foo_2 Filter: (k2 = 5) -> Hash -> Append -> Seq Scan on p_1 bar_1 Filter: (k2 = 5) -> Seq Scan on p_2 bar_2 Filter: (k2 = 5) (13 rows) Thanks Richard
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Sun, Mar 24, 2024 at 3:05 PM Bharath Rupireddy wrote: > > On Sun, Mar 24, 2024 at 10:40 AM Amit Kapila wrote: > > > > > For instance, setting last_inactive_time_1 to an invalid value fails > > > with the following error: > > > > > > error running SQL: 'psql::1: ERROR: invalid input syntax for > > > type timestamp with time zone: "foo" > > > LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli... > > > > > > > It would be found at a later point. It would be probably better to > > verify immediately after the test that fetches the last_inactive_time > > value. > > Agree. I've added a few more checks explicitly to verify the > last_inactive_time is sane with the following: > > qq[SELECT '$last_inactive_time'::timestamptz > to_timestamp(0) > AND '$last_inactive_time'::timestamptz > > '$slot_creation_time'::timestamptz;] > Such a test looks reasonable but shall we add equal to in the second part of the test (like '$last_inactive_time'::timestamptz >= > '$slot_creation_time'::timestamptz;). This is just to be sure that even if > the test ran fast enough to give the same time, the test shouldn't fail. I > think it won't matter for correctness as well. -- With Regards, Amit Kapila.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Mar 25, 2024 at 9:48 AM Amit Kapila wrote: > > > Such a test looks reasonable but shall we add equal to in the second > part of the test (like '$last_inactive_time'::timestamptz >= > > '$slot_creation_time'::timestamptz;). This is just to be sure that even if > > the test ran fast enough to give the same time, the test shouldn't fail. I > > think it won't matter for correctness as well. > Apart from this, I have made minor changes in the comments. See and let me know what you think of attached. -- With Regards, Amit Kapila. diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 2b36b5fef1..5f4165a945 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2529,8 +2529,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx The time at which the slot became inactive. -NULL if the slot is currently actively being -used. +NULL if the slot is currently being used. diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 0f48d6dc7c..77cb633812 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -623,7 +623,7 @@ retry: if (SlotIsLogical(s)) pgstat_acquire_replslot(s); - /* The slot is active by now, so reset the last inactive time. */ + /* Reset the last inactive time as the slot is active now. */ SpinLockAcquire(&s->mutex); s->last_inactive_time = 0; SpinLockRelease(&s->mutex); @@ -687,8 +687,8 @@ ReplicationSlotRelease(void) } /* -* Set the last inactive time after marking slot inactive. We get current -* time beforehand to avoid system call while holding the lock. +* Set the last inactive time after marking the slot inactive. We get the +* current time beforehand to avoid a system call while holding the lock. */ now = GetCurrentTimestamp(); @@ -2363,9 +2363,9 @@ RestoreSlotFromDisk(const char *name) slot->active_pid = 0; /* -* We set last inactive time after loading the slot from the disk into -* memory. Whoever acquires the slot i.e. makes the slot active will -* anyway reset it. +* We set the last inactive time after loading the slot from the disk +* into memory. Whoever acquires the slot i.e. makes the slot active +* will reset it. */ slot->last_inactive_time = GetCurrentTimestamp(); diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 2f18433ecc..eefd7abd39 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -202,7 +202,7 @@ typedef struct ReplicationSlot */ XLogRecPtr last_saved_confirmed_flush; - /* The time at which this slot become inactive */ + /* The time at which this slot becomes inactive */ TimestampTz last_inactive_time; } ReplicationSlot; diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index bff84cc9c4..81bd36f5d8 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -411,7 +411,7 @@ $node_primary3->stop; $node_standby3->stop; # = -# Testcase start: Check last_inactive_time property of streaming standby's slot +# Testcase start: Check last_inactive_time property of the streaming standby's slot # # Initialize primary node @@ -440,8 +440,8 @@ $primary4->safe_psql( SELECT pg_create_physical_replication_slot(slot_name := '$sb4_slot'); ]); -# Get last_inactive_time value after slot's creation. Note that the slot is still -# inactive unless it's used by the standby below. +# Get last_inactive_time value after the slot's creation. Note that the slot +# is still inactive till it's used by the standby below. my $last_inactive_time = $primary4->safe_psql('postgres', qq(SELECT last_inactive_time FROM pg_replication_slots WHERE slot_name = '$sb4_slot' AND last_inactive_time IS NOT NULL;) ); @@ -470,8 +470,8 @@ is( $primary4->safe_psql( # Stop the standby to check its last_inactive_time value is updated $standby4->stop; -# Let's also restart the primary so that the last_inactive_time is set upon -# loading the slot from disk. +# Let's restart the primary so that the last_inactive_time is set upon +# loading the slot from the disk. $primary4->restart; is( $primary4->safe_psql( @@ -483,11 +483,11 @@ is( $primary4->safe_psql( $standby4->stop; -# Testcase end: Check last_inactive_time property of streaming standby's slot +# Testcase end: Check last_inactive_time property of the streaming standby's slot # = # ===
Re: Improve readability by using designated initializers when possible
looking through v4 again. v4 looks good to me.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy wrote: > > I've attached the v18 patch set here. Thanks for the patches. Please find few comments: patch 001: 1) slot.h: + /* The time at which this slot become inactive */ + TimestampTz last_inactive_time; become -->became - patch 002: 2) slotsync.c: ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY, remote_slot->two_phase, remote_slot->failover, - true); + true, 0); + slot->data.inactive_timeout = remote_slot->inactive_timeout; Is there a reason we are not passing 'remote_slot->inactive_timeout' to ReplicationSlotCreate() directly? - 3) slotfuncs.c pg_create_logical_replication_slot(): + int inactive_timeout = PG_GETARG_INT32(5); Can we mention here that timeout is in seconds either in comment or rename variable to inactive_timeout_secs? Please do this for create_physical_replication_slot(), create_logical_replication_slot(), pg_create_physical_replication_slot() as well. - 4) + int inactive_timeout; /* The amount of time in seconds the slot + * is allowed to be inactive. */ } LogicalSlotInfo; Do we need to mention "before getting invalided" like other places (in last patch)? -- 5) Same at these two places. "before getting invalided" to be added in the last patch otherwise the info is incompleted. + + /* The amount of time in seconds the slot is allowed to be inactive */ + int inactive_timeout; } ReplicationSlotPersistentData; + * inactive_timeout: The amount of time in seconds the slot is allowed to be + * inactive. */ void ReplicationSlotCreate(const char *name, bool db_specific, Same here. "before getting invalidated" ? Reviewing more.. thanks Shveta
Re: Add new error_action COPY ON_ERROR "log"
On Wed, Mar 13, 2024 at 11:02 PM Bharath Rupireddy wrote: > > On Wed, Mar 13, 2024 at 11:09 AM Michael Paquier wrote: > > > > Hmm. This NOTICE is really bugging me. It is true that the clients > > would get more information, but the information is duplicated on the > > server side because the error context provides the same information as > > the NOTICE itself: > > NOTICE: data type incompatibility at line 1 for column "a" > > CONTEXT: COPY aa, line 1, column a: "a" > > STATEMENT: copy aa from stdin with (on_error ignore, log_verbosity > > verbose); > > Yes, if wanted, clients can also get the CONTEXT - for instance, using > '\set SHOW_CONTEXT always' in psql. > > I think we can enhance the NOTICE message to include the column value > (just like CONTEXT message is showing) and leverage relname_only to > emit only the relation name in the CONTEXT message. > > /* > * We suppress error context information other than the relation name, > * if one of the operations below fails. > */ > Assert(!cstate->relname_only); > cstate->relname_only = true; > > I'm attaching the v8 patch set implementing the above idea. With this, > [1] is sent to the client, [2] is sent to the server log. This > approach not only reduces the duplicate info in the NOTICE and CONTEXT > messages, but also makes it easy for users to get all the necessary > info in the NOTICE message without having to set extra parameters to > get CONTEXT message. > > Another idea is to move even the table name to NOTICE message and hide > the context with errhidecontext when we emit the new NOTICE messages. > > Thoughts? > The current approach, eliminating the duplicated information in CONTEXT, seems good to me. One question about the latest (v8) patch: + else + ereport(NOTICE, + errmsg("data type incompatibility at line %llu for column %s: null input", + (unsigned long long) cstate->cur_lineno, + cstate->cur_attname)); + How can we reach this path? It seems we don't cover this path by the tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: make dist using git archive
On 24.03.24 16:42, Tristan Partin wrote: You might find Meson's string formatting syntax creates a more readable command string: 'tar.tar.bz2.command=@0@ -c'.format(bzip2.path()) And then 'install: false' is the default if you feel like leaving it out. Otherwise, let's get this in! Done and committed.
RE: speed up a logical replica setup
Dear Amit, Euler, I also want to share my opinion, just in case. > On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira wrote: > > > > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote: > > > > If we commit this we might not be able to change the way the option > > behaves once the customers starts using it. How about removing these > > options in the first version and adding it in the next version after > > more discussion. > > > > > > We don't need to redesign this one if we want to add a format string in a > > next > > version. A long time ago, pg_dump started to accept pattern for tables > > without > > breaking or deprecating the -t option. If you have 100 databases and you > > don't > > want to specify the options or use a script to generate it for you, you also > > have the option to let pg_createsubscriber generate the object names for > > you. > > Per my experience, it will be a rare case. > > > > But, why go with some UI in the first place which we don't think is a > good one, or at least don't have a broader agreement that it is a good > one? The problem with self-generated names for users could be that > they won't be able to make much out of it. If one has to always use > those internally then probably that would be acceptable. I would > prefer what Tomas proposed a few emails ago as compared to this one as > that has fewer options to be provided by users but still, they can > later identify objects. But surely, we should discuss if you or others > have better alternatives. IIUC, added options were inspired by Tomas. And he said the limitation (pub/sub/slot name cannot be specified) was acceptable for the first version. I agree with him. (To be honest, I feel that options should be fewer for the first release) > The user choosing to convert a physical standby to a subscriber would > in some cases be interested in converting it for all the databases > (say for the case of upgrade [1]) and giving options for each database > would be cumbersome for her. +1 for the primary use case. > > Currently dry-run will do the check and might fail on identifying a > > few failures like after checking subscriber configurations. Then the > > user will have to correct the configuration and re-run then fix the > > next set of failures. Whereas the suggest-config will display all the > > optimal configuration for both the primary and the standby in a single > > shot. This is not a must in the first version, it can be done as a > > subsequent enhancement. > > > > > > Do you meant publisher, right? Per order, check_subscriber is done before > > check_publisher and it checks all settings on the subscriber before > > exiting. In > > v30, I changed the way it provides the required settings. In a previous > > version, > > it fails when it found a wrong setting; the current version, check all > > settings > > from that server before providing a suitable error. > > > > pg_createsubscriber: checking settings on publisher > > pg_createsubscriber: primary has replication slot "physical_slot" > > pg_createsubscriber: error: publisher requires wal_level >= logical > > pg_createsubscriber: error: publisher requires 2 replication slots, but > > only 0 > remain > > pg_createsubscriber: hint: Consider increasing max_replication_slots to at > > least > 3. > > pg_createsubscriber: error: publisher requires 2 wal sender processes, but > > only > 0 remain > > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least > > 3. > > > > If you have such an error, you will fix them all and rerun using dry run > > mode > > again to verify everything is ok. I don't have a strong preference about > > it. It > > can be changed easily (unifying the check functions or providing a return > > for > > each of the check functions). > > > > We can unify the checks but not sure if it is worth it. I am fine > either way. It would have been better if we provided a way for a user > to know the tool's requirement rather than letting him know via errors > but I think it should be okay to extend it later as well. Both ways are OK, but I prefer to unify checks a bit. The number of working modes in the same executables should be reduced as much as possible. Also, I feel the current specification is acceptable. pg_upgrade checks one by one and exits soon in bad cases. If both old and new clusters have issues, the first run only reports the old one's issues. After DBA fixes and runs again, issues on the new cluster are output. [1]: https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Properly pathify the union planner
On Mon, Mar 25, 2024 at 9:44 AM David Rowley wrote: > It seems ok that > the ec_indexes are not set for the top-level set RelOptInfo as > get_eclass_for_sort_expr() does not make use of ec_indexes while > searching for an existing EquivalenceClass. Maybe we should fix this > varno == 0 hack and adjust get_eclass_for_sort_expr() so that it makes > use of the ec_indexes. > > It's possible to see this happen with a query such as: > > SELECT oid FROM pg_class UNION SELECT oid FROM pg_class ORDER BY oid; I see what you said. Yeah, there might be some optimization possibilities in this area. And I agree that this should not be a blocker in pushing this patch. > I didn't see that as a reason not to push this patch as this occurs > both with and without this change, so I've now pushed this patch. Great to see this patch has been pushed! Thanks Richard
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Mar 25, 2024 at 10:33 AM shveta malik wrote: > > On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy > wrote: > > > > I've attached the v18 patch set here. > I have a question. Don't we allow creating subscriptions on an existing slot with a non-null 'inactive_timeout' set where 'inactive_timeout' of the slot is retained even after subscription creation? I tried this: === --On publisher, create slot with 120sec inactive_timeout: SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'pgoutput', false, true, true, 120); --On subscriber, create sub using logical_slot1 create subscription mysubnew1_1 connection 'dbname=newdb1 host=localhost user=shveta port=5433' publication mypubnew1_1 WITH (failover = true, create_slot=false, slot_name='logical_slot1'); --Before creating sub, pg_replication_slots output: slot_name | failover | synced | active | temp | conf | lat| inactive_timeout ---+--+++--+--+--+-- logical_slot1 | t| f | f | f| f| 2024-03-25 11:11:55.375736+05:30 | 120 --After creating sub pg_replication_slots output: (inactive_timeout is 0 now): slot_name |failover | synced | active | temp | conf | | lat | inactive_timeout ---+-+++--+--+-+-+-- logical_slot1 |t| f | t | f| f| | | 0 === In CreateSubscription, we call 'walrcv_alter_slot()' / 'ReplicationSlotAlter()' when create_slot is false. This call ends up setting active_timeout from 120sec to 0. Is it intentional? thanks Shveta
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada wrote: > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor wrote: > > v77-0001 > > > > - dead_items = (VacDeadItems *) > > palloc(vac_max_items_to_alloc_size(max_items)); > > - dead_items->max_items = max_items; > > - dead_items->num_items = 0; > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > + > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > This is confusing enough that it looks like a bug: > > > > [inside TidStoreCreate()] > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > while (16 * maxBlockSize > max_bytes * 1024L) > > maxBlockSize >>= 1; > > > > This was copied from CreateWorkExprContext, which operates directly on > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > here. If we're passing something measured in kilobytes, the parameter > > is badly named. Let's use convert once and use bytes everywhere. > > True. The attached 0001 patch fixes it. v78-0001 and 02 are fine, but for 0003 there is a consequence that I didn't see mentioned: vac_work_mem now refers to bytes, where before it referred to kilobytes. It seems pretty confusing to use a different convention from elsewhere, especially if it has the same name but different meaning across versions. Worse, this change is buried inside a moving-stuff-around diff, making it hard to see. Maybe "convert only once" is still possible, but I was actually thinking of + dead_items_info->max_bytes = vac_work_mem * 1024L; + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); That way it's pretty obvious that it's correct. That may require a bit of duplication and moving around for shmem, but there is some of that already. More on 0003: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TidStore, a storage for dead TIDs + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum I wonder if the comments here should refer to it using a more natural spelling, like "TID store". - * items in the dead_items array for later vacuuming, count live and + * items in the dead_items for later vacuuming, count live and Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? - * remaining LP_DEAD line pointers on the page in the dead_items - * array. These dead items include those pruned by lazy_scan_prune() - * as well we line pointers previously marked LP_DEAD. + * remaining LP_DEAD line pointers on the page in the dead_items. + * These dead items include those pruned by lazy_scan_prune() as well + * we line pointers previously marked LP_DEAD. Here maybe "into dead_items". Also, "we line pointers" seems to be a pre-existing typo. - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); This is a translated message, so let's keep the message the same. /* * Allocate dead_items (either using palloc, or in dynamic shared memory). * Sets dead_items in vacrel for caller. * * Also handles parallel initialization as part of allocating dead_items in * DSM when required. */ static void dead_items_alloc(LVRelState *vacrel, int nworkers) This comment didn't change at all. It's not wrong, but let's consider updating the specifics. v78-0004: > > +#define dsa_create(tranch_id) \ > > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > > > Since these macros are now referring to defaults, maybe their name > > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > > (*_MAX_*) > > It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that > the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current > name also makes sense to me. Right, that makes sense. v78-0005: "Although commit XXX allowed specifying the initial and maximum DSA segment sizes, callers still needed to clamp their own limits, which was not consistent and user-friendly." Perhaps s/still needed/would have needed/ ..., since we're preventing that necessity. > > Did you try it with 1MB m_w_m? > > I've incorporated the above comments and test results look good to me. Could you be more specific about what the test was? Does it work with 1MB m_w_m? + /* + * Choose the initial and maximum DSA segment sizes to be no longer + * than 1/16 and 1/8 of max_bytes, respectively. If the initial + * segment size is low, we end up having many segments, which risks + * exceeding the total number of segments the platform can have. The second sentence is technically correct, but I'm not sure how it relates to the cod
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Hi Tomas, > > On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra < tomas.von...@enterprisedb.com> wrote: > Hi David, > > Do you plan to work continue working on this patch? I did take a look, > and on the whole it looks reasonable - it modifies the right places etc. > > I think there are two things that may need an improvement: > > 1) Storing variable-length data in ParallelBitmapHeapState > > I agree with Robert the snapshot_and_stats name is not great. I see > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - > the reasons are somewhat different (phs_snapshot_off exists because we > don't know which exact struct will be allocated), while here we simply > need to allocate two variable-length pieces of memory. But it seems like > it would work nicely for this. That is, we could try adding an offset > for each of those pieces of memory: > > - snapshot_off > - stats_off > > I don't like the GetSharedSnapshotData name very much, it seems very > close to GetSnapshotData - quite confusing, I think. > > Dmitry also suggested we might add a separate piece of shared memory. I > don't quite see how that would work for ParallelBitmapHeapState, but I > doubt it'd be simpler than having two offsets. I don't think the extra > complexity (paid by everyone) would be worth it just to make EXPLAIN > ANALYZE work. This issue is now gone after Heikki's fix. > 2) Leader vs. worker counters > > It seems to me this does nothing to add the per-worker values from "Heap > Blocks" into the leader, which means we get stuff like this: > > Heap Blocks: exact=102 lossy=10995 > Worker 0: actual time=50.559..209.773 rows=215253 loops=1 >Heap Blocks: exact=207 lossy=19354 > Worker 1: actual time=50.543..211.387 rows=162934 loops=1 >Heap Blocks: exact=161 lossy=14636 > > I think this is wrong / confusing, and inconsistent with what we do for > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS, > where we *do* add the values to the leader: > > Heap Blocks: exact=125 lossy=10789 > Buffers: shared hit=11 read=45420 > Worker 0: actual time=51.419..221.904 rows=150437 loops=1 > Heap Blocks: exact=136 lossy=13541 > Buffers: shared hit=4 read=13541 > Worker 1: actual time=56.610..222.469 rows=229738 loops=1 > Heap Blocks: exact=209 lossy=20655 > Buffers: shared hit=4 read=20655 > > Here it's not entirely obvious, because leader participates in the > execution, but once we disable leader participation, it's clearer: > > Buffers: shared hit=7 read=45421 > Worker 0: actual time=28.540..247.683 rows=309112 loops=1 > Heap Blocks: exact=282 lossy=27806 > Buffers: shared hit=4 read=28241 > Worker 1: actual time=24.290..251.993 rows=190815 loops=1 > Heap Blocks: exact=188 lossy=17179 > Buffers: shared hit=3 read=17180 > > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap > Blocks" simply disappeared because the leader does nothing and we don't > print zeros. Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats do not aggregate workers' stats into leaders for some existing nodes. For example, Memorize node for Hits, Misses, etc -> Nested Loop (actual rows=17 loops=3) -> Parallel Seq Scan on t (actual rows=3 loops=3) -> Memoize (actual rows=5 loops=10) Cache Key: t.j Cache Mode: logical Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB -> Index Scan using uj on u (actual rows=5 loops=15) Index Cond: (j = t.j) Sort, HashAggregate also do the same stuff. > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely > correct. Consider this: > > EXPLAIN (ANALYZE): > >-> Parallel Bitmap Heap Scan on t (...) > Recheck Cond: (a < 5000) > Rows Removed by Index Recheck: 246882 > Worker 0: Heap Blocks: exact=168 lossy=15648 > Worker 1: Heap Blocks: exact=302 lossy=29337 > > EXPLAIN (ANALYZE, VERBOSE): > >-> Parallel Bitmap Heap Scan on public.t (...) > Recheck Cond: (t.a < 5000) > Rows Removed by Index Recheck: 246882 > Worker 0: actual time=35.067..300.882 rows=282108 loops=1 >Heap Blocks: exact=257 lossy=25358 > Worker 1: actual time=32.827..302.224 rows=217819 loops=1 >Heap Blocks: exact=213 lossy=19627 > > EXPLAIN (ANALYZE, BUFFERS): > >-> Parallel Bitmap Heap Scan on t (...) > Recheck Cond: (a < 5000) > Rows Removed by Index Recheck: 246882 > Buffers: shared hit=7 read=45421 > Worker 0: Heap Blocks: exact=236 lossy=21870 > Worker 1: Heap Blocks: exact=234 lossy=231
Re: altering a column's collation leaves an invalid foreign key
On 3/23/24 10:04, Paul Jungwirth wrote: Perhaps if the previous collation was nondeterministic we should force a re-check. Here is a patch implementing this. It was a bit more fuss than I expected, so maybe someone has a better way. We have had nondeterministic collations since v12, so perhaps it is something to back-patch? Yours, -- Paul ~{:-) p...@illuminatedcomputing.comFrom a8078f85859400f4e4cffb57d8ec1b069e46bfb8 Mon Sep 17 00:00:00 2001 From: "Paul A. Jungwirth" Date: Sun, 24 Mar 2024 23:37:48 -0700 Subject: [PATCH v1] Re-check foreign key if referenced collation was nondeterministic With deterministic collations, we break ties by looking at binary equality. But nondeterministic collations can allow non-identical values to be considered equal, e.g. 'a' and 'A' when case-insensitive. So some references that used to be valid may not be anymore. --- src/backend/commands/tablecmds.c | 96 --- src/include/nodes/parsenodes.h| 2 + .../regress/expected/collate.icu.utf8.out | 9 ++ src/test/regress/sql/collate.icu.utf8.sql | 8 ++ 4 files changed, 103 insertions(+), 12 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 71740984f33..940b1845edb 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -194,6 +194,7 @@ typedef struct AlteredTableInfo /* Objects to rebuild after completing ALTER TYPE operations */ List *changedConstraintOids; /* OIDs of constraints to rebuild */ List *changedConstraintDefs; /* string definitions of same */ + List *changedCollationOids; /* collation of each attnum */ List *changedIndexOids; /* OIDs of indexes to rebuild */ List *changedIndexDefs; /* string definitions of same */ char *replicaIdentityIndex; /* index to reset as REPLICA IDENTITY */ @@ -572,6 +573,7 @@ static ObjectAddress ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel, AlterTableCmd *cmd, LOCKMODE lockmode); static void RememberAllDependentForRebuilding(AlteredTableInfo *tab, AlterTableType subtype, Relation rel, AttrNumber attnum, const char *colName); +static void RememberCollationForRebuilding(AttrNumber attnum, AlteredTableInfo *tab); static void RememberConstraintForRebuilding(Oid conoid, AlteredTableInfo *tab); static void RememberIndexForRebuilding(Oid indoid, AlteredTableInfo *tab); static void RememberStatisticsForRebuilding(Oid stxoid, AlteredTableInfo *tab); @@ -579,12 +581,12 @@ static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMODE lockmode); static void ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId, char *cmd, List **wqueue, LOCKMODE lockmode, - bool rewrite); + bool rewrite, List *collationOids); static void RebuildConstraintComment(AlteredTableInfo *tab, AlterTablePass pass, Oid objid, Relation rel, List *domname, const char *conname); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); -static void TryReuseForeignKey(Oid oldId, Constraint *con); +static void TryReuseForeignKey(Oid oldId, Constraint *con, List *changedCollationOids); static ObjectAddress ATExecAlterColumnGenericOptions(Relation rel, const char *colName, List *options, LOCKMODE lockmode); static void change_owner_fix_column_acls(Oid relationOid, @@ -9826,6 +9828,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, bool old_check_ok; ObjectAddress address; ListCell *old_pfeqop_item = list_head(fkconstraint->old_conpfeqop); + ListCell *old_collation_item = list_head(fkconstraint->old_collations); /* * Grab ShareRowExclusiveLock on the pk table, so that someone doesn't @@ -10212,9 +10215,15 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, CoercionPathType new_pathtype; Oid old_castfunc; Oid new_castfunc; + Oid old_collation; Form_pg_attribute attr = TupleDescAttr(tab->oldDesc, fkattnum[i] - 1); + /* Get the collation for this key part. */ + old_collation = lfirst_oid(old_collation_item); + old_collation_item = lnext(fkconstraint->old_collations, + old_collation_item); + /* * Identify coercion pathways from each of the old and new FK-side * column types to the right (foreign) operand type of the pfeqop. @@ -10250,9 +10259,12 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, * turn conform to the domain. Consequently, we need not treat * domains specially here. * - * Since we require that all collations share the same notion of - * equality (which they do, because texteq reduces to bitwise - * equality), we don't compare collation here. + * All deterministic collations use bitwise equality to resolve + * ties, but if the previous collation was nondeterministic, + * we mu
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Mar 25, 2024 at 10:28 AM Amit Kapila wrote: > > On Mon, Mar 25, 2024 at 9:48 AM Amit Kapila wrote: > > > > > > Such a test looks reasonable but shall we add equal to in the second > > part of the test (like '$last_inactive_time'::timestamptz >= > > > '$slot_creation_time'::timestamptz;). This is just to be sure that even > > > if the test ran fast enough to give the same time, the test shouldn't > > > fail. I think it won't matter for correctness as well. Agree. I added that in v19 patch. I was having that concern in my mind. That's the reason I wasn't capturing current_time something like below for the same worry that current_timestamp might be the same (or nearly the same) as the slot creation time. That's why I ended up capturing current_timestamp in a separate query than clubbing it up with pg_create_physical_replication_slot. SELECT current_timestamp FROM pg_create_physical_replication_slot('foo'); > Apart from this, I have made minor changes in the comments. See and > let me know what you think of attached. LGTM. I've merged the diff into v19 patch. Please find the attached v19 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v19-0001-Track-last_inactive_time-in-pg_replication_slots.patch Description: Binary data