Re: SQL:2011 application time

2024-03-24 Thread Peter Eisentraut

On 23.03.24 18:42, Paul Jungwirth wrote:
Now this is a long chain of reasoning to say rangetypes are safe. I 
added a comment. Note it doesn't apply to arbitrary types, so if we 
support those eventually we should just require a recheck always, or 
alternately use equals, not containedby. (That would require storing 
equals somewhere. It could go in ffeqop, but that feels like a footgun 
since pfeqop and ppeqop need overlaps.)


Ok, this explanation is good enough for now.  I have committed the 
patches v33-0001-Add-temporal-FOREIGN-KEYs.patch and 
v33-0002-Support-multiranges-in-temporal-FKs.patch (together).






Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread Bharath Rupireddy
On Sun, Mar 24, 2024 at 10:40 AM Amit Kapila  wrote:
>
> > For instance, setting last_inactive_time_1 to an invalid value fails
> > with the following error:
> >
> > error running SQL: 'psql::1: ERROR:  invalid input syntax for
> > type timestamp with time zone: "foo"
> > LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli...
> >
>
> It would be found at a later point. It would be probably better to
> verify immediately after the test that fetches the last_inactive_time
> value.

Agree. I've added a few more checks explicitly to verify the
last_inactive_time is sane with the following:

qq[SELECT '$last_inactive_time'::timestamptz > to_timestamp(0)
AND '$last_inactive_time'::timestamptz >
'$slot_creation_time'::timestamptz;]

I've attached the v18 patch set here. I've also addressed earlier
review comments from Amit, Ajin Cherian. Note that I've added new
invalidation mechanism tests in a separate TAP test file just because
I don't want to clutter or bloat any of the existing files and spread
tests for physical slots and logical slots into separate existing TAP
files.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v18-0001-Track-last_inactive_time-in-pg_replication_slots.patch
Description: Binary data


v18-0002-Allow-setting-inactive_timeout-for-replication-s.patch
Description: Binary data


v18-0003-Introduce-new-SQL-funtion-pg_alter_replication_s.patch
Description: Binary data


v18-0004-Allow-setting-inactive_timeout-in-the-replicatio.patch
Description: Binary data


v18-0005-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data


Re: Built-in CTYPE provider

2024-03-24 Thread Alexander Lakhin

Hello Jeff,

21.03.2024 03:13, Jeff Davis wrote:

On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:

* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch

Looks ok.

Committed.


Please look at a Valgrind-detected error caused by the following query
(starting from f69319f2f):
SELECT lower('Π' COLLATE pg_c_utf8);

==00:00:00:03.487 1429669== Invalid read of size 1
==00:00:00:03.487 1429669==    at 0x7C64A5: convert_case (unicode_case.c:107)
==00:00:00:03.487 1429669==    by 0x7C: unicode_strlower (unicode_case.c:70)
==00:00:00:03.487 1429669==    by 0x66B218: str_tolower (formatting.c:1698)
==00:00:00:03.488 1429669==    by 0x6D6C55: lower (oracle_compat.c:55)

Best regards,
Alexander




Re: make dist using git archive

2024-03-24 Thread Peter Eisentraut

On 22.03.24 18:29, Tristan Partin wrote:

On Thu Mar 21, 2024 at 3:44 AM CDT, Peter Eisentraut wrote:

Here is an updated version of this patch set.


You should add 'disabler: true' to the git find_program in Meson. If Git 
doesn't exist on the system with the way your patch is currently 
written, the targets would be defined, even though they would never 
succeed.


Ok, added.  (I had it in there in an earlier version, but I think I 
misread one of your earlier messages and removed it.)


You may also want to make sure that we are actually in a Git repository. 
I don't think git-archive works outside one.


Then git archive will print an error.  That seems ok.

Re the autoclrf, is this something we could throw in a .gitattributes 
files?


We don't want to apply it to all git commands, just this one in this 
context.



I would suggest poisoning `meson dist` in the following way:

if not meson.is_subproject()

[...]

 meson.add_dist_script(perl, '-e', 'exit 1')
endif


Good idea, added that.

I have extracted the freebsd CI script fix into a separate patch 
(0002).   I think this is useful even if we don't take the full CI 
patch (0003).


0002 looks pretty reasonable to me.


Committed that one in the meantime.
From 680967a621083756e1e182df7394a739008a21d6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 11 Feb 2024 23:58:42 +0100
Subject: [PATCH v5 1/2] make dist uses git archive

This changes "make dist" to directly use "git archive", rather than
the custom shell script it currently runs.

This is to make the creation of the distribution tarball more directly
traceable to the git repository.  That is why we removed the "make
distprep" step.

"make dist" continues to produce a .gz and a .bz2 tarball as before.

The archives produced this way are deterministic and reproducible,
meaning for a given commit the result file should always be
bit-for-bit identical.  The exception is that if you use a git version
older than 2.38.0, gzip records the platform in the archive, so you'd
get a different output on Windows vs. macOS vs. "UNIX" (everything
else).  In git 2.38.0, this was changed so that everything is recorded
as "UNIX" now.  This is just something to keep in mind.  This issue is
specific to the gzip format, it does not affect other compression
formats.

Meson has its own distribution building command (meson dist), but we
are not using that at this point.  The main problem is that the way
they have implemented it, it is not deterministic in the above sense.
Also, we want a "make" version for the time being.  But the target
name "dist" in meson is reserved for that reason, so we call the
custom target "pgdist" (so call something like "meson compile -C build
pgdist").

Discussion: 
https://www.postgresql.org/message-id/flat/40e80f77-a294-4f29-a16f-e21bc7bc75fc%40eisentraut.org
---
 GNUmakefile.in | 35 ++
 meson.build| 66 ++
 2 files changed, 79 insertions(+), 22 deletions(-)

diff --git a/GNUmakefile.in b/GNUmakefile.in
index 4d8fc794bbb..4dc13a3e2ea 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -87,29 +87,20 @@ update-unicode: | submake-generated-headers 
submake-libpgport
 distdir= postgresql-$(VERSION)
 dummy  = =install=
 
+GIT = git
+
 dist: $(distdir).tar.gz $(distdir).tar.bz2
-   rm -rf $(distdir)
-
-$(distdir).tar: distdir
-   $(TAR) chf $@ $(distdir)
-
-.INTERMEDIATE: $(distdir).tar
-
-distdir-location:
-   @echo $(distdir)
-
-distdir:
-   rm -rf $(distdir)* $(dummy)
-   for x in `cd $(top_srcdir) && find . \( -name CVS -prune \) -o \( -name 
.git -prune \) -o -print`; do \
- file=`expr X$$x : 'X\./\(.*\)'`; \
- if test -d "$(top_srcdir)/$$file" ; then \
-   mkdir "$(distdir)/$$file" && chmod 777 "$(distdir)/$$file"; \
- else \
-   ln "$(top_srcdir)/$$file" "$(distdir)/$$file" >/dev/null 2>&1 \
- || cp "$(top_srcdir)/$$file" "$(distdir)/$$file"; \
- fi || exit; \
-   done
-   $(MAKE) -C $(distdir) distclean
+
+# Note: core.autocrlf=false is needed to avoid line-ending conversion
+# in case the environment has a different setting.  Without this, a
+# tarball created on Windows might be different than on, and unusable
+# on, Unix machines.
+
+$(distdir).tar.gz:
+   $(GIT) -C $(srcdir) -c core.autocrlf=false archive --format tar.gz 
--prefix $(distdir)/ HEAD -o $(abs_top_builddir)/$@
+
+$(distdir).tar.bz2:
+   $(GIT) -C $(srcdir) -c core.autocrlf=false -c 
tar.tar.bz2.command='$(BZIP2) -c' archive --format tar.bz2 --prefix $(distdir)/ 
HEAD -o $(abs_top_builddir)/$@
 
 distcheck: dist
rm -rf $(dummy)
diff --git a/meson.build b/meson.build
index c8fdfeb0ec3..4981447414f 100644
--- a/meson.build
+++ b/meson.build
@@ -3359,6 +3359,72 @@ run_target('help',
 
 
 
+###
+# Distribution archive
+###

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra


On 3/23/24 01:26, Melanie Plageman wrote:
> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote:
>> On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote:
>>> On 18/03/2024 17:19, Melanie Plageman wrote:
 I've attached v7 rebased over this commit.
>>>
>>> If we delayed table_beginscan_bm() call further, after starting the TBM
>>> iterator, we could skip it altogether when the iterator is empty.
>>>
>>> That's a further improvement, doesn't need to be part of this patch set.
>>> Just caught my eye while reading this.
>>
>> Hmm. You mean like until after the first call to tbm_[shared]_iterate()?
>> AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or
>> not the iterator is "empty". Do you mean cases when the bitmap has no
>> blocks in it? It seems like we should be able to tell that from the
>> TIDBitmap.
>>
>>>
 v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch
>>>
>>> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the
>>> flag e.g. SO_NEED_TUPLE.
>>
>> Agreed. Done in attached v8. Though I wondered if it was a bit weird
>> that the flag is set in the common case and not set in the uncommon
>> case...
> 
> v8 actually attached this time

I tried to run the benchmarks with v8, but unfortunately it crashes for
me very quickly (I've only seen 0015 to crash, so I guess the bug is in
that patch).

The backtrace attached, this doesn't seem right:

(gdb) p hscan->rs_cindex
$1 = 543516018


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyCore was generated by `postgres: postgres test-100 [local] SELECT   
  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, 
slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, 
exact_pages=0x556f9b9eebf0)
at heapam_handler.c:2576

warning: Source file is more recent than executable.
2576targoffset = hscan->rs_vistuples[hscan->rs_cindex];
(gdb) bt
#0  0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, 
slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, 
exact_pages=0x556f9b9eebf0)
at heapam_handler.c:2576
#1  0x556f99e17adb in table_scan_bitmap_next_tuple 
(exact_pages=0x556f9b9eebf0, lossy_pages=0x556f9b9eebf8, 
recheck=0x556f9b9eec10, slot=0x556f9b9ef440, scan=0x556f9b9e6960)
at ../../../src/include/access/tableam.h:2003
#2  BitmapHeapNext (node=0x556f9b9eeb00) at nodeBitmapHeapscan.c:227
#3  0x556f99e1a331 in ExecProcNode (node=0x556f9b9eeb00) at 
../../../src/include/executor/executor.h:274
#4  gather_getnext (gatherstate=0x556f9b9ee970) at nodeGather.c:287
#5  ExecGather (pstate=0x556f9b9ee970) at nodeGather.c:222
#6  0x556f99e24ac3 in ExecProcNode (node=0x556f9b9ee970) at 
../../../src/include/executor/executor.h:274
#7  ExecLimit (pstate=0x556f9b9ee698) at nodeLimit.c:95
#8  0x556f99e0174a in ExecProcNode (node=0x556f9b9ee698) at 
../../../src/include/executor/executor.h:274
#9  ExecutePlan (execute_once=, dest=0x556f9b9f2360, 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x556f9b9ee698, estate=0x556f9b9ee458) at execMain.c:1644
#10 standard_ExecutorRun (queryDesc=0x556f9b944f88, direction=, 
count=0, execute_once=) at execMain.c:363
#11 0x556f99f9cc7f in PortalRunSelect (portal=portal@entry=0x556f9b997008, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x556f9b9f2360)
at pquery.c:924
#12 0x556f99f9dffa in PortalRun (portal=portal@entry=0x556f9b997008, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, 
dest=dest@entry=0x556f9b9f2360, altdest=altdest@entry=0x556f9b9f2360, 
qc=0x7ffc123a0b60) at pquery.c:768
#13 0x556f99f9a57c in exec_simple_query (query_string=0x556f9b91ab18 
"SELECT * FROM linear WHERE (a BETWEEN 2013 AND 8061) OFFSET 100;") at 
postgres.c:1274
#14 0x556f99f9c051 in PostgresMain (dbname=, 
username=) at postgres.c:4680
#15 0x556f99f96def in BackendMain (startup_data=, 
startup_data_len=) at backend_startup.c:101
#16 0x556f99f0c564 in postmaster_child_launch 
(child_type=child_type@entry=B_BACKEND, 
startup_data=startup_data@entry=0x7ffc123a0f90 "", 
startup_data_len=startup_data_len@entry=4, 
client_sock=client_sock@entry=0x7ffc123a0fb0) at launch_backend.c:265
#17 0x556f99f0fea9 in BackendStartup (client_sock=0x7ffc123a0fb0) at 
postmaster.c:3593
#18 ServerLoop () at postmaster.c:1674
#19 0x556f99f11b50 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x556f9b915260) at postmaster.c:1372
#20 0x556f99c5b0c3 in main (argc=3, argv=0x556f9b915260) at main.c:197
(gdb) p hscan->rs_cindex
$1 = 543516018
(gdb) p hscan
$2 = (HeapScanDesc) 0x556f9b9e6960
(gdb) p *hscan
$3 = {rs_base = {rs_rd = 0x7f6

Extension for PostgreSQL WIP

2024-03-24 Thread ShadowGhost
Cast_jsonb_to_hstore WIP
v1
This extension add function that can cast jsonb to hstore.
That link to my github where does my extension lie
https://github.com/antuanviolin/cast_jsonb_to_hstore


Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Thomas Munro
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> On 12/03/2024 15:02, Thomas Munro wrote:
> > src/backend/storage/aio/streaming_read.c
> > src/include/storage/streaming_read.h
>
> Standard file header comments missing.

Fixed.

> It would be nice to have a comment at the top of streaming_read.c,
> explaining at a high level how the circular buffer, lookahead and all
> that works. Maybe even some diagrams.

Done.

> For example, what is head and what is tail? Before reading the code, I
> assumed that 'head' was the next block range to return in
> pg_streaming_read_buffer_get_next(). But I think it's actually the other
> way round?

Yeah.  People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

> > /*
> >  * Create a new streaming read object that can be used to perform the
> >  * equivalent of a series of ReadBuffer() calls for one fork of one 
> > relation.
> >  * Internally, it generates larger vectored reads where possible by looking
> >  * ahead.
> >  */
> > PgStreamingRead *
> > pg_streaming_read_buffer_alloc(int flags,
> >  void *pgsr_private,
> >  size_t 
> > per_buffer_data_size,
> >  
> > BufferAccessStrategy strategy,
> >  
> > BufferManagerRelation bmr,
> >  ForkNumber forknum,
> >  
> > PgStreamingReadBufferCB next_block_cb)
>
> I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> the work it does is memory allocation. But I'd suggest something like
> 'pg_streaming_read_begin' instead.

I like it.  Done.

> Do we really need the pg_ prefix in these?

Good question.  My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things.  It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff.  So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use  "sr"
or "streaming_read" or "stream".  I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private.  I find it clearer, as a way to indicate that the
provider of the callback "owns" it.  I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

> > Buffer
> > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
> > **per_buffer_data)
>
> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
> for a shorter name.

Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

It's also a bit grammatically weird to say StartReadBuffers() and
WaitReadBuffers() in the bufmgr API...  Hmm.  Perhaps we should just
call it ReadBuffers() and WaitForBufferIO()?  Maybe surprising because
the former isn't just like ReadBuffer() ... but on the other hand no
one said it has to be, and sometimes it even is (when it gets a hit).
I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite
to make the asynchrony optional or explicit if someone has a problem
with that.

(Hmm, that'd be a bit like the Windows native file API, where
ReadFile() is synchronous or asynchronous depending on flags.)

> >
> >   /*
> >* pgsr->ranges is a circular buffer.  When it is empty, head == tail.
> >* When it is full, there is an empty element between head and tail.  
> > Head
> >* can also be empty (nblocks == 0), therefore we need two extra 
> > elements
> >* for non-occupied ranges, on top of max_pi

Re: Extension for PostgreSQL WIP

2024-03-24 Thread David G. Johnston
On Sun, Mar 24, 2024 at 5:49 AM ShadowGhost 
wrote:

> Cast_jsonb_to_hstore WIP
> v1
> This extension add function that can cast jsonb to hstore.
> That link to my github where does my extension lie
> https://github.com/antuanviolin/cast_jsonb_to_hstore
>

If you are intending to submit this to the project you need to follow the
correct procedures.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Otherwise, this is not the correct place to be promoting your extension.

David J.

p.s. I would advise that including the whole bit about jsonb and hstore in
the email subject line (if you resubmit in a proper format) instead of only
in the body of the email.  Subject lines are very important on a mailing
list such as this and should be fairly precise - not just the word
extension.


Re: pg_stat_statements and "IN" conditions

2024-03-24 Thread Yasuo Honda
Thanks for the information. I can apply these 4 patches from
0eb23285a2 . I tested this branch from Ruby on Rails and it gets some
unexpected behavior from my point of view.
Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not
normalize sql queries whose number of in clauses exceeds 5.

Here are test steps.
https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior

It would be appreciated if I can get my understanding correct.
--
Yasuo Honda

On Sun, Mar 24, 2024 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Sure, I can rebase, give me a moment. If you don't want to wait, there
> is a base commit in the patch, against which it should be applied
> without issues, 0eb23285a2.




Re: Sync scan & regression tests

2024-03-24 Thread Tom Lane
Heikki Linnakangas  writes:
> On 19/09/2023 01:57, Andres Freund wrote:
>> On 2023-09-18 13:49:24 +0300, Heikki Linnakangas wrote:
>>> d) Copy fewer rows to the table in the test. If we copy only 6 rows, for
>>> example, the table will have only two pages, regardless of shared_buffers.
>>> 
>>> I'm leaning towards d). The whole test is a little fragile, it will also
>>> fail with a non-default block size, for example. But c) seems like a simple
>>> fix and wouldn't look too out of place in the test.

>> Hm, what do you mean with the last sentence? Oh, is the test you're
>> referencing the relation-extension logic?

> Sorry, I said "c) seems like a simple fix ...", but I meant "d) seems 
> like a simple fix ..."
> I meant the attached.

This thread stalled out months ago, but chipmunk is still failing in
HEAD and v16.  Can we please have a fix?  I'm good with Heikki's
adjustment to the pg_visibility test case.

regards, tom lane




Re: make dist using git archive

2024-03-24 Thread Tristan Partin

3 comments left that are inconsequential. Feel free to ignore.


+# Meson has its own distribution building command (meson dist), but we
+# are not using that at this point.  The main problem is that the way
+# they have implemented it, it is not deterministic.  Also, we want it
+# to be equivalent to the "make" version for the time being.  But the
+# target name "dist" in meson is reserved for that reason, so we call
+# the custom target "pgdist".


The second sentence is a run-on.


+if bzip2.found()
+  tar_bz2 = custom_target('tar.bz2',
+build_always_stale: true,
+command: [git, '-C', '@SOURCE_ROOT@',
+  '-c', 'core.autocrlf=false',
+  '-c', 'tar.tar.bz2.command="' + bzip2.path() + '" -c',
+  'archive',
+  '--format', 'tar.bz2',
+  '--prefix', distdir + '/',
+  '-o', join_paths(meson.build_root(), '@OUTPUT@'),
+  'HEAD', '.'],
+install: false,
+output: distdir + '.tar.bz2',
+  )


You might find Meson's string formatting syntax creates a more readable 
command string:


'tar.tar.bz2.command=@0@ -c'.format(bzip2.path())

And then 'install: false' is the default if you feel like leaving it 
out.


Otherwise, let's get this in!

--
Tristan Partin
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-24 Thread Melanie Plageman
On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:
>
> In heap_page_prune_and_freeze(), we now do some extra work on each live
> tuple, to set the all_visible_except_removable correctly. And also to
> update live_tuples, recently_dead_tuples and hastup. When we're not
> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> enough that it doesn't matter, but is it?

Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.

For example, I think we should set a bit in the prune/freeze WAL
record's flags to indicate if pruning was done by vacuum or on access
(mentioned in another of your recent emails).

> The first commit (after the WAL format changes) changes the all-visible
> check to use GlobalVisTestIsRemovableXid. The commit message says that
> it's because we don't have 'cutoffs' available, but we only care about
> that when we're freezing, and when we're freezing, we actually do have
> 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems
> sensible anyway, because that's what we use in
> heap_prune_satisfies_vacuum() too, but just wanted to point that out.

Yes, this is true. If we skip this part of the loop when on-access
pruning invokes it, we can go back to using the OldestXmin. I have
done that as well as some other changes in the attached patch,
conflict_horizon_updates.diff. Note that this patch may not apply on
your latest patch as I wrote it on top of an older version. Switching
back to using OldestXmin for page visibility determination makes this
patch set more similar to master as well. We could keep the
alternative check (with GlobalVisState) to maintain the illusion that
callers passing by_vacuum as True can pass NULL for pagefrz, but I was
worried about the extra overhead.

It would be nice to pick a single reasonable visibility horizon (the
oldest running xid we compare things to) at the beginning of
heap_page_prune_and_freeze() and use it for determining if we can
remove tuples, if we can freeze tuples, and if the page is all
visible. It makes it confusing that we use OldestXmin for freezing and
setting the page visibility in the VM and GlobalVisState for
determining prunability. I think using GlobalVisState for freezing has
been discussed before and ruled out for a variety of reasons, and I
definitely don't suggest making that change in this patch set.

> The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
> these patches's fault). This at least is wrong, because Max(a, b)
> doesn't handle XID wraparound correctly:
>
> >   if (do_freeze)
> >   conflict_xid = 
> > Max(prstate.snapshotConflictHorizon,
> >  
> > presult->frz_conflict_horizon);
> >   else
> >   conflict_xid = 
> > prstate.snapshotConflictHorizon;
>
> Then there's this in lazy_scan_prune():
>
> >   /* Using same cutoff when setting VM is now unnecessary */
> >   if (presult.all_frozen)
> >   presult.frz_conflict_horizon = InvalidTransactionId;
> This does the right thing in the end, but if all the tuples are frozen
> shouldn't frz_conflict_horizon already be InvalidTransactionId? The
> comment says it's "newest xmin on the page", and if everything was
> frozen, all xmins are FrozenTransactionId. In other words, that should
> be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
> caller. Also, frz_conflict_horizon is only set correctly if
> 'all_frozen==true', would be good to mention that in the comments too.

Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.

I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane
John Naylor  writes:
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in 
local_ts_extend_down()
1615node = child;
1616shift -= RT_SPAN;
1617}
1618 
1619/* Reserve slot for the value. */
1620n4 = (RT_NODE_4 *) node.local;
>>> CID 1594658:  Integer handling issues  (BAD_SHIFT)
>>> In expression "key >> shift", shifting by a negative amount has 
>>> undefined behavior.  The shift amount, "shift", is as little as -7.
1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
1622n4->base.count = 1;
1623 
1624return &n4->children[0];
1625 }
1626 

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble.  I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good?  Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

-   n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+   n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

regards, tom lane




Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Melanie Plageman
On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro  wrote:
>
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> > On 12/03/2024 15:02, Thomas Munro wrote:
> > > src/backend/storage/aio/streaming_read.c
> > > src/include/storage/streaming_read.h
> >
> > Standard file header comments missing.
>
> Fixed.
>
> > It would be nice to have a comment at the top of streaming_read.c,
> > explaining at a high level how the circular buffer, lookahead and all
> > that works. Maybe even some diagrams.
>
> Done.
>
> > For example, what is head and what is tail? Before reading the code, I
> > assumed that 'head' was the next block range to return in
> > pg_streaming_read_buffer_get_next(). But I think it's actually the other
> > way round?
>
> Yeah.  People seem to have different natural intuitions about head vs
> tail in this sort of thing, so I've switched to descriptive names:
> stream->{oldest,next}_buffer_index (see also below).
>
> > > /*
> > >  * Create a new streaming read object that can be used to perform the
> > >  * equivalent of a series of ReadBuffer() calls for one fork of one 
> > > relation.
> > >  * Internally, it generates larger vectored reads where possible by 
> > > looking
> > >  * ahead.
> > >  */
> > > PgStreamingRead *
> > > pg_streaming_read_buffer_alloc(int flags,
> > >  void 
> > > *pgsr_private,
> > >  size_t 
> > > per_buffer_data_size,
> > >  
> > > BufferAccessStrategy strategy,
> > >  
> > > BufferManagerRelation bmr,
> > >  ForkNumber 
> > > forknum,
> > >  
> > > PgStreamingReadBufferCB next_block_cb)
> >
> > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> > the work it does is memory allocation. But I'd suggest something like
> > 'pg_streaming_read_begin' instead.
>
> I like it.  Done.
>
> > Do we really need the pg_ prefix in these?
>
> Good question.  My understanding of our convention is that pg_ is
> needed for local replacements/variants/extensions of things with well
> known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
> a few places where the word is very common/short and we want to avoid
> collisions and make sure it's obviously ours (pg_popcount?), and I
> guess places that reflect the name of a SQL identifier with a prefix,
> but this doesn't seem to qualify for any of those things.  It's a new
> thing, our own thing entirely, and sufficiently distinctive and
> unconfusable with standard stuff.  So, prefix removed.
>
> Lots of other patches on top of this one are using "pgsr" as a
> variable name, ie containing that prefix; perhaps they would use  "sr"
> or "streaming_read" or "stream".  I used "stream" in a few places in
> this version.
>
> Other names improved in this version IMHO: pgsr_private ->
> callback_private.  I find it clearer, as a way to indicate that the
> provider of the callback "owns" it.  I also reordered the arguments:
> now it's streaming_read_buffer_begin(..., callback, callback_private,
> per_buffer_data_size), to keep those three things together.

I haven't reviewed the whole patch, but as I was rebasing
bitmapheapscan streaming read user, I found callback_private confusing
because it seems like it is a private callback, not private data
belonging to the callback. Perhaps call it callback_private_data? Also
maybe mention what it is for in the comment above
streaming_read_buffer_begin() and in the StreamingRead structure
itself.

- Melanie




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra



On 3/24/24 18:38, Melanie Plageman wrote:
> On Sun, Mar 24, 2024 at 01:36:19PM +0100, Tomas Vondra wrote:
>>
>>
>> On 3/23/24 01:26, Melanie Plageman wrote:
>>> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote:
 On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote:
> On 18/03/2024 17:19, Melanie Plageman wrote:
>> I've attached v7 rebased over this commit.
>
> If we delayed table_beginscan_bm() call further, after starting the TBM
> iterator, we could skip it altogether when the iterator is empty.
>
> That's a further improvement, doesn't need to be part of this patch set.
> Just caught my eye while reading this.

 Hmm. You mean like until after the first call to tbm_[shared]_iterate()?
 AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or
 not the iterator is "empty". Do you mean cases when the bitmap has no
 blocks in it? It seems like we should be able to tell that from the
 TIDBitmap.

>
>> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch
>
> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call 
> the
> flag e.g. SO_NEED_TUPLE.

 Agreed. Done in attached v8. Though I wondered if it was a bit weird
 that the flag is set in the common case and not set in the uncommon
 case...
>>>
>>> v8 actually attached this time
>>
>> I tried to run the benchmarks with v8, but unfortunately it crashes for
>> me very quickly (I've only seen 0015 to crash, so I guess the bug is in
>> that patch).
>>
>> The backtrace attached, this doesn't seem right:
>>
>> (gdb) p hscan->rs_cindex
>> $1 = 543516018
> 
> Thanks for reporting this! I hadn't seen it crash on my machine, so I
> didn't realize that I was no longer initializing rs_cindex and
> rs_ntuples on the first call to heapam_bitmap_next_tuple() (since
> heapam_bitmap_next_block() wasn't being called first). I've done this in
> attached v9.
> 

OK, I've restarted the tests with v9.

> I haven't had a chance yet to reproduce the regressions you saw in the
> streaming read user patch or to look closely at the performance results.

So you tried to reproduce it and didn't hit the issue? Or didn't have
time to look into that yet? FWIW with v7 it failed almost immediately
(only a couple queries until hitting one triggering the issue), but v9
that's not the case (hundreds of queries without an error).

> I don't anticipate the streaming read user will have any performance
> differences in this v9 from v6, since I haven't yet rebased in Thomas'
> latest streaming read API changes nor addressed any other potential
> regression sources.
> 

OK, understood. It'll be interesting to see the behavior with the new
version of Thomas' patch.

I however wonder what the plan with these patches is - do we still plan
to get some of this into v17? It seems to me we're getting uncomfortably
close to the end of the cycle, with a fairly incomplete idea of how it
affects performance.

Which is why I've been focusing more on the refactoring patches (up to
0015), to make sure those don't cause regressions if committed. And I
think that's generally true.

But for the main StreamingRead API the situation is very different.

> I tried rebasing in Thomas' latest version today and something is
> causing a crash that I have yet to figure out. v10 of this patchset will
> have his latest version once I get that fixed. I wanted to share this
> version with what I think is a bug fix for the crash you saw first.
> 

Understood. I'll let the tests with v9 run for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_dump versus enum types, round N+1

2024-03-24 Thread Tom Lane
Andrew Dunstan  writes:
> On Sat, Mar 23, 2024 at 3:00 PM Tom Lane  wrote:
>> So I'm glad we found that sooner not later, but something needs
>> to be done about it if [1] is to get committed.  It doesn't seem
>> particularly hard to fix though: we just have to track the enum
>> type OIDs made in the current transaction, using largely the same
>> approach as is already used in pg_enum.c to track enum value OIDs.

> Makes sense, Nice clear comments.

Thanks for looking.  Pushed after a bit more work on the comments.

regards, tom lane




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-24 Thread Melanie Plageman
On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 21:17, Melanie Plageman wrote:
> > Attached patch changes-for-0001.patch has a bunch of updated comments --
> > especially for heapam_xlog.h (including my promised diagram). And a few
> > suggestions (mostly changes that I should have made before).
> 
> New version of these WAL format changes attached. Squashed to one patch now.
> I spent more time on the comments throughout the patch. And one
> notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with
> XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a
> cleanup lock to replay the record. It must always be set when
> XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those
> always needs a cleanup lock. That felt easier to document and understand
> than XLHP_LP_TRUNCATE_ONLY.

Makes sense to me.

> > >  From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas 
> > > Date: Wed, 20 Mar 2024 14:53:31 +0200
> > > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
> > > 
> > > Mostly to make local variables more tightly-scoped.
> > 
> > So, I don't think you can move those sub-records into the tighter scope.
> > If you run tests with this applied, you'll see it crashes and a lot of
> > the data in the record is wrong. If you move the sub-record declarations
> > out to a wider scope, the tests pass.
> > 
> > The WAL record data isn't actually copied into the buffer until
> > 
> > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);
> > 
> > after registering everything.
> > So all of those sub-records you made are out of scope the time it tries
> > to copy from them.
> > 
> > I spent the better part of a day last week trying to figure out what was
> > happening after I did the exact same thing. I must say that I found the
> > xloginsert API incredibly unhelpful on this point.
> 
> Oops. I had that in mind and that was actually why I moved the
> XLogRegisterData() call to the end of the function, because I found it
> confusing to register the struct before filling it in completely, even
> though it works perfectly fine. But then I missed it anyway when I moved the
> local variables. I added a brief comment on that.

Comment was a good idea.

> There is another patch in the commitfest that touches this area: "Recording
> whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
> [1]. That actually goes in the opposite direction than this patch. That
> patch wants to add more information, to show whether a record was emitted by
> VACUUM or on-access pruning, while this patch makes the freezing and
> VACUUM's 2nd phase records also look the same. We could easily add more
> flags to xl_heap_prune to distinguish them. Or assign different xl_info code
> for them, like that other patch proposed. But I don't think that needs to
> block this patch, that can be added as a separate patch.
> 
> [1] 
> https://www.postgresql.org/message-id/cah2-wzmsevhox+hjpfmqgcxwwdgnzp0r9f+vbnpok6tgxmp...@mail.gmail.com

I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.

> From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 22 Mar 2024 23:10:22 +0200
> Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
>  
>  /*
> - * Handles XLOG_HEAP2_PRUNE record type.
> - *
> - * Acquires a full cleanup lock.
> + * Replay XLOG_HEAP2_PRUNE_FREEZE record.
>   */
>  static void
> -heap_xlog_prune(XLogReaderState *record)
> +heap_xlog_prune_freeze(XLogReaderState *record)
>  {
>   XLogRecPtr  lsn = record->EndRecPtr;
> - xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
> + char   *ptr;
> + xl_heap_prune *xlrec;
>   Buffer  buffer;
>   RelFileLocator rlocator;
>   BlockNumber blkno;
>   XLogRedoAction action;
>  
>   XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno);
> + ptr = XLogRecGetData(record);

I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.

> + xlrec = (xl_heap_prune *) ptr;
> + ptr += SizeOfHeapPrune;
>  
>   /*
> -  * We're about to remove tuples. In Hot Standby mode, ensure that 
> there's
> -  * no queries running for which the removed tuples are still visible.
> +  * 

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Melanie Plageman
On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra
 wrote:
>
> On 3/24/24 18:38, Melanie Plageman wrote:
> > I haven't had a chance yet to reproduce the regressions you saw in the
> > streaming read user patch or to look closely at the performance results.
>
> So you tried to reproduce it and didn't hit the issue? Or didn't have
> time to look into that yet? FWIW with v7 it failed almost immediately
> (only a couple queries until hitting one triggering the issue), but v9
> that's not the case (hundreds of queries without an error).

I haven't started trying to reproduce it yet.

> I however wonder what the plan with these patches is - do we still plan
> to get some of this into v17? It seems to me we're getting uncomfortably
> close to the end of the cycle, with a fairly incomplete idea of how it
> affects performance.
>
> Which is why I've been focusing more on the refactoring patches (up to
> 0015), to make sure those don't cause regressions if committed. And I
> think that's generally true.

Thank you for testing the refactoring patches with this in mind! Out
of the refactoring patches, I think there is a subset of them that
have independent value without the streaming read user. I think it is
worth committing the first few patches because they remove a table AM
layering violation. IMHO, all of the patches up to "Make
table_scan_bitmap_next_block() async friendly" make the code nicer and
better. And, if folks like the patch "Remove
table_scan_bitmap_next_block()", then I think I could rebase that back
in on top of "Make table_scan_bitmap_next_block() async friendly".
This would mean table AMs would only have to implement one callback
(table_scan_bitmap_next_tuple()) which I also think is a net
improvement and simplification.

The other refactoring patches may not be interesting without the
streaming read user.

> But for the main StreamingRead API the situation is very different.

My intent for the bitmapheapscan streaming read user was to get it
into 17, but I'm not sure that looks likely. The main issues Thomas is
looking into right now are related to regressions for a fully cached
scan (noticeable with the pg_prewarm streaming read user). With all of
these fixed, I anticipate we will still see enough behavioral
differences with the bitmapheap scan streaming read user that it may
not be committable in time. Though, I have yet to work on reproducing
the regressions with the BHS streaming read user mostly because I was
focused on getting the refactoring ready and not as much because the
streaming read API is unstable.

- Melanie




Re: SET ROLE documentation improvement

2024-03-24 Thread Nathan Bossart
On Sat, Mar 23, 2024 at 08:37:20AM -0400, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 4:51 PM Nathan Bossart  
> wrote:
>> Actually, shouldn't this one be back-patched to v16?  If so, I'd do that
>> one separately from the other changes we are discussing.
> 
> Sure, that seems fine.

Committed that part.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add AVX2 support to simd.h

2024-03-24 Thread Nathan Bossart
Here's a new version of 0001 with some added #ifdefs that cfbot revealed
were missing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cc2bc5ca5b49cd8641af8b2231a34a1aa5091bb9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 20 Mar 2024 14:20:24 -0500
Subject: [PATCH v7 1/1] pg_lfind32(): add "overlap" code for remaining
 elements

---
 src/include/port/pg_lfind.h | 105 
 1 file changed, 72 insertions(+), 33 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..5830cc7cb3 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,51 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+#ifndef USE_NO_SIMD
+/*
+ * pg_lfind32_helper
+ *
+ * Searches one 4-register-block of integers.  The caller is responsible for
+ * ensuring that there are at least 4-registers-worth of integers remaining.
+ */
+static inline bool
+pg_lfind32_helper(const Vector32 keys, uint32 *base)
+{
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	Vector32	vals1,
+vals2,
+vals3,
+vals4,
+result1,
+result2,
+result3,
+result4,
+tmp1,
+tmp2,
+result;
+
+	/* load the next block into 4 registers */
+	vector32_load(&vals1, base);
+	vector32_load(&vals2, &base[nelem_per_vector]);
+	vector32_load(&vals3, &base[nelem_per_vector * 2]);
+	vector32_load(&vals4, &base[nelem_per_vector * 3]);
+
+	/* compare each value to the key */
+	result1 = vector32_eq(keys, vals1);
+	result2 = vector32_eq(keys, vals2);
+	result3 = vector32_eq(keys, vals3);
+	result4 = vector32_eq(keys, vals4);
+
+	/* combine the results into a single variable */
+	tmp1 = vector32_or(result1, result2);
+	tmp2 = vector32_or(result3, result4);
+	result = vector32_or(tmp1, tmp2);
+
+	/* return whether there was a match */
+	return vector32_is_highbit_set(result);
+}
+#endif			/* ! USE_NO_SIMD */
+
 /*
  * pg_lfind32
  *
@@ -119,46 +164,40 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+	/*
+	 * If there aren't enough elements for the SIMD code, jump to the standard
+	 * one-by-one linear search code.
+	 */
+	if (nelem <= nelem_per_iteration)
+		goto one_by_one;
+
+	/*
+	 * Process as many elements as possible with a block of 4 registers.
+	 */
+	do
 	{
-		Vector32	vals1,
-	vals2,
-	vals3,
-	vals4,
-	result1,
-	result2,
-	result3,
-	result4,
-	tmp1,
-	tmp2,
-	result;
-
-		/* load the next block into 4 registers */
-		vector32_load(&vals1, &base[i]);
-		vector32_load(&vals2, &base[i + nelem_per_vector]);
-		vector32_load(&vals3, &base[i + nelem_per_vector * 2]);
-		vector32_load(&vals4, &base[i + nelem_per_vector * 3]);
-
-		/* compare each value to the key */
-		result1 = vector32_eq(keys, vals1);
-		result2 = vector32_eq(keys, vals2);
-		result3 = vector32_eq(keys, vals3);
-		result4 = vector32_eq(keys, vals4);
-
-		/* combine the results into a single variable */
-		tmp1 = vector32_or(result1, result2);
-		tmp2 = vector32_or(result3, result4);
-		result = vector32_or(tmp1, tmp2);
-
-		/* see if there was a match */
-		if (vector32_is_highbit_set(result))
+		if (pg_lfind32_helper(keys, &base[i]))
 		{
 			Assert(assert_result == true);
 			return true;
 		}
-	}
+
+		i += nelem_per_iteration;
+
+	} while (i < tail_idx);
+
+	/*
+	 * Process the last 'nelem_per_iteration' elements in the array with a
+	 * 4-register block.  This will cause us to check some of the elements
+	 * more than once, but that won't affect correctness, and testing has
+	 * demonstrated that this helps more cases than it harms.
+	 */
+	Assert(assert_result == pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]));
+	return pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]);
+
 #endif			/* ! USE_NO_SIMD */
 
+one_by_one:
 	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
-- 
2.25.1



Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra
On 3/24/24 21:12, Melanie Plageman wrote:
> On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra
>  wrote:
>>
>> On 3/24/24 18:38, Melanie Plageman wrote:
>>> I haven't had a chance yet to reproduce the regressions you saw in the
>>> streaming read user patch or to look closely at the performance results.
>>
>> So you tried to reproduce it and didn't hit the issue? Or didn't have
>> time to look into that yet? FWIW with v7 it failed almost immediately
>> (only a couple queries until hitting one triggering the issue), but v9
>> that's not the case (hundreds of queries without an error).
> 
> I haven't started trying to reproduce it yet.
> 
>> I however wonder what the plan with these patches is - do we still plan
>> to get some of this into v17? It seems to me we're getting uncomfortably
>> close to the end of the cycle, with a fairly incomplete idea of how it
>> affects performance.
>>
>> Which is why I've been focusing more on the refactoring patches (up to
>> 0015), to make sure those don't cause regressions if committed. And I
>> think that's generally true.
> 
> Thank you for testing the refactoring patches with this in mind! Out
> of the refactoring patches, I think there is a subset of them that
> have independent value without the streaming read user. I think it is
> worth committing the first few patches because they remove a table AM
> layering violation. IMHO, all of the patches up to "Make
> table_scan_bitmap_next_block() async friendly" make the code nicer and
> better. And, if folks like the patch "Remove
> table_scan_bitmap_next_block()", then I think I could rebase that back
> in on top of "Make table_scan_bitmap_next_block() async friendly".
> This would mean table AMs would only have to implement one callback
> (table_scan_bitmap_next_tuple()) which I also think is a net
> improvement and simplification.
> 
> The other refactoring patches may not be interesting without the
> streaming read user.
> 

I admit not reviewing the individual patches very closely yet, but this
matches how I understood them - that at least some are likely an
improvement on their own, not just as a refactoring preparing for the
switch to streaming reads.

We only have ~2 weeks left, so it's probably time to focus on getting at
least those improvements committed. I see Heikki was paying way more
attention to the patches than me, though ...

BTW when you say "up to 'Make table_scan_bitmap_next_block() async
friendly'" do you mean including that patch, or that this is the first
patch that is not one of the independently useful patches.

(I took a quick look at the first couple patches and I appreciate that
you keep separate patches with small cosmetic changes to keep the actual
patch smaller and easier to understand.)

>> But for the main StreamingRead API the situation is very different.
> 
> My intent for the bitmapheapscan streaming read user was to get it
> into 17, but I'm not sure that looks likely. The main issues Thomas is
> looking into right now are related to regressions for a fully cached
> scan (noticeable with the pg_prewarm streaming read user). With all of
> these fixed, I anticipate we will still see enough behavioral
> differences with the bitmapheap scan streaming read user that it may
> not be committable in time. Though, I have yet to work on reproducing
> the regressions with the BHS streaming read user mostly because I was
> focused on getting the refactoring ready and not as much because the
> streaming read API is unstable.
> 

I don't have a very good intuition regarding impact of the streaming API
patch on performance. I haven't been following that thread very closely,
but AFAICS there wasn't much discussion about that - perhaps it happened
offlist, not sure. So who knows, really?

Which is why I started looking at this patch instead - it seemed easier
to benchmark with a somewhat realistic workload.

But yeah, there certainly were significant behavior changes, and it's
unlikely that whatever Thomas did in v8 made them go away.

FWIW I certainly am *not* suggesting there must be no behavior changes,
that's simply not possible. I'm not even suggesting no queries must get
slower - given the dependence on storage, I think some regressions are
pretty much inevitable. But it's still be good to know the regressions
are reasonably rare exceptions rather than the common case, and that's
not what I'm seeing ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: add AVX2 support to simd.h

2024-03-24 Thread Nathan Bossart
On Sun, Mar 24, 2024 at 03:53:17PM -0500, Nathan Bossart wrote:
> Here's a new version of 0001 with some added #ifdefs that cfbot revealed
> were missing.

Sorry for the noise.  cfbot revealed another silly mistake (forgetting to
reset the "i" variable in the assertion path).  That should be fixed in v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f15d85844370aef8505559fc0f2db629b135a9e8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 20 Mar 2024 14:20:24 -0500
Subject: [PATCH v8 1/1] pg_lfind32(): add "overlap" code for remaining
 elements

---
 src/include/port/pg_lfind.h | 109 
 1 file changed, 74 insertions(+), 35 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..f746aabbf9 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,51 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+#ifndef USE_NO_SIMD
+/*
+ * pg_lfind32_helper
+ *
+ * Searches one 4-register-block of integers.  The caller is responsible for
+ * ensuring that there are at least 4-registers-worth of integers remaining.
+ */
+static inline bool
+pg_lfind32_helper(const Vector32 keys, uint32 *base)
+{
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	Vector32	vals1,
+vals2,
+vals3,
+vals4,
+result1,
+result2,
+result3,
+result4,
+tmp1,
+tmp2,
+result;
+
+	/* load the next block into 4 registers */
+	vector32_load(&vals1, base);
+	vector32_load(&vals2, &base[nelem_per_vector]);
+	vector32_load(&vals3, &base[nelem_per_vector * 2]);
+	vector32_load(&vals4, &base[nelem_per_vector * 3]);
+
+	/* compare each value to the key */
+	result1 = vector32_eq(keys, vals1);
+	result2 = vector32_eq(keys, vals2);
+	result3 = vector32_eq(keys, vals3);
+	result4 = vector32_eq(keys, vals4);
+
+	/* combine the results into a single variable */
+	tmp1 = vector32_or(result1, result2);
+	tmp2 = vector32_or(result3, result4);
+	result = vector32_or(tmp1, tmp2);
+
+	/* return whether there was a match */
+	return vector32_is_highbit_set(result);
+}
+#endif			/* ! USE_NO_SIMD */
+
 /*
  * pg_lfind32
  *
@@ -109,9 +154,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	bool		assert_result = false;
 
 	/* pre-compute the result for assert checking */
-	for (i = 0; i < nelem; i++)
+	for (int j = 0; j < nelem; j++)
 	{
-		if (key == base[i])
+		if (key == base[j])
 		{
 			assert_result = true;
 			break;
@@ -119,46 +164,40 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+	/*
+	 * If there aren't enough elements for the SIMD code, jump to the standard
+	 * one-by-one linear search code.
+	 */
+	if (nelem <= nelem_per_iteration)
+		goto one_by_one;
+
+	/*
+	 * Process as many elements as possible with a block of 4 registers.
+	 */
+	do
 	{
-		Vector32	vals1,
-	vals2,
-	vals3,
-	vals4,
-	result1,
-	result2,
-	result3,
-	result4,
-	tmp1,
-	tmp2,
-	result;
-
-		/* load the next block into 4 registers */
-		vector32_load(&vals1, &base[i]);
-		vector32_load(&vals2, &base[i + nelem_per_vector]);
-		vector32_load(&vals3, &base[i + nelem_per_vector * 2]);
-		vector32_load(&vals4, &base[i + nelem_per_vector * 3]);
-
-		/* compare each value to the key */
-		result1 = vector32_eq(keys, vals1);
-		result2 = vector32_eq(keys, vals2);
-		result3 = vector32_eq(keys, vals3);
-		result4 = vector32_eq(keys, vals4);
-
-		/* combine the results into a single variable */
-		tmp1 = vector32_or(result1, result2);
-		tmp2 = vector32_or(result3, result4);
-		result = vector32_or(tmp1, tmp2);
-
-		/* see if there was a match */
-		if (vector32_is_highbit_set(result))
+		if (pg_lfind32_helper(keys, &base[i]))
 		{
 			Assert(assert_result == true);
 			return true;
 		}
-	}
+
+		i += nelem_per_iteration;
+
+	} while (i < tail_idx);
+
+	/*
+	 * Process the last 'nelem_per_iteration' elements in the array with a
+	 * 4-register block.  This will cause us to check some of the elements
+	 * more than once, but that won't affect correctness, and testing has
+	 * demonstrated that this helps more cases than it harms.
+	 */
+	Assert(assert_result == pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]));
+	return pg_lfind32_helper(keys, &base[nelem - nelem_per_iteration]);
+
 #endif			/* ! USE_NO_SIMD */
 
+one_by_one:
 	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
-- 
2.25.1



Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Melanie Plageman
On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra
 wrote:
>
> BTW when you say "up to 'Make table_scan_bitmap_next_block() async
> friendly'" do you mean including that patch, or that this is the first
> patch that is not one of the independently useful patches.

I think the code is easier to understand with "Make
table_scan_bitmap_next_block() async friendly". Prior to that commit,
table_scan_bitmap_next_block() could return false even when the bitmap
has more blocks and expects the caller to handle this and invoke it
again. I think that interface is very confusing. The downside of the
code in that state is that the code for prefetching is still in the
BitmapHeapNext() code and the code for getting the current block is in
the heap AM-specific code. I took a stab at fixing this in v9's 0013,
but the outcome wasn't very attractive.

What I will do tomorrow is reorder and group the commits such that all
of the commits that are useful independent of streaming read are first
(I think 0014 and 0015 are independently valuable but they are on top
of some things that are only useful to streaming read because they are
more recently requested changes). I think I can actually do a bit of
simplification in terms of how many commits there are and what is in
each. Just to be clear, v9 is still reviewable. I am just going to go
back and change what is included in each commit.

> (I took a quick look at the first couple patches and I appreciate that
> you keep separate patches with small cosmetic changes to keep the actual
> patch smaller and easier to understand.)

Thanks!




Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Thomas Munro
On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman
 wrote:
> I haven't reviewed the whole patch, but as I was rebasing
> bitmapheapscan streaming read user, I found callback_private confusing
> because it seems like it is a private callback, not private data
> belonging to the callback. Perhaps call it callback_private_data? Also

WFM.

> maybe mention what it is for in the comment above
> streaming_read_buffer_begin() and in the StreamingRead structure
> itself.

Yeah.  I've tried to improve the comments on all three public
functions.  I also moved the three public functions _begin(), _next(),
_end() to be next to each other after the static helper functions.

Working on perf regression/tuning reports today, more soon...
From edd3d078cf8d4b0c2f08df82295825f7107ec62b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v9 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC buffer_io_size, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 658 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  45 +-
 src/tools/pgindent/typedefs.list  |   1 +
 7 files changed, 535 insertions(+), 212 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c4086..3af86c59384 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'

   
 
+  
+   buffer_io_size (integer)
+   
+buffer_io_size configuration parameter
+   
+   
+   
+
+ Controls the target I/O size in operations that coalesce buffer I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..b5347678726 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 ForkNumber forkNum, BlockNumber blockNum,
-ReadBufferMode mode, BufferAccessStrategy strategy,
-bool *hit);
+ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 		   ForkNumber fork,
 		   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 		  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 			  ui

Re: Built-in CTYPE provider

2024-03-24 Thread Jeff Davis
On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote:
> Please look at a Valgrind-detected error caused by the following
> query
> (starting from f69319f2f):
> SELECT lower('Π' COLLATE pg_c_utf8);

Thank you for the report!

Fixed in 503c0ad976.

Valgrind did not detect the problem in my setup, so I added a unit test
in case_test.c where it's easier to see the valgrind problem.

Regards,
Jeff Davis







Re: Adding OLD/NEW support to RETURNING

2024-03-24 Thread jian he
On Mon, Mar 18, 2024 at 6:48 PM Dean Rasheed  wrote:
>
> On Tue, 12 Mar 2024 at 18:21, Dean Rasheed  wrote:
> >
> > Updated version attached tidying up a couple of things and fixing another 
> > bug:
> >
>
> Rebased version attached, on top of c649fa24a4 (MERGE ... RETURNING support).
>


hi, some minor issues I found out.

+/*
+ * ReplaceReturningVarsFromTargetList -
+ * replace RETURNING list Vars with items from a targetlist
+ *
+ * This is equivalent to calling ReplaceVarsFromTargetList() with a
+ * nomatch_option of REPLACEVARS_REPORT_ERROR, but with the added effect of
+ * copying varreturningtype onto any Vars referring to new_result_relation,
+ * allowing RETURNING OLD/NEW to work in the rewritten query.
+ */
+
+typedef struct
+{
+ ReplaceVarsFromTargetList_context rv_con;
+ int new_result_relation;
+} ReplaceReturningVarsFromTargetList_context;
+
+static Node *
+ReplaceReturningVarsFromTargetList_callback(Var *var,
+ replace_rte_variables_context *context)
+{
+ ReplaceReturningVarsFromTargetList_context *rcon =
(ReplaceReturningVarsFromTargetList_context *) context->callback_arg;
+ Node   *newnode;
+
+ newnode = ReplaceVarsFromTargetList_callback(var, context);
+
+ if (var->varreturningtype != VAR_RETURNING_DEFAULT)
+ SetVarReturningType((Node *) newnode, rcon->new_result_relation,
+ var->varlevelsup, var->varreturningtype);
+
+ return newnode;
+}
+
+Node *
+ReplaceReturningVarsFromTargetList(Node *node,
+   int target_varno, int sublevels_up,
+   RangeTblEntry *target_rte,
+   List *targetlist,
+   int new_result_relation,
+   bool *outer_hasSubLinks)
+{
+ ReplaceReturningVarsFromTargetList_context context;
+
+ context.rv_con.target_rte = target_rte;
+ context.rv_con.targetlist = targetlist;
+ context.rv_con.nomatch_option = REPLACEVARS_REPORT_ERROR;
+ context.rv_con.nomatch_varno = 0;
+ context.new_result_relation = new_result_relation;
+
+ return replace_rte_variables(node, target_varno, sublevels_up,
+ ReplaceReturningVarsFromTargetList_callback,
+ (void *) &context,
+ outer_hasSubLinks);
+}

the ReplaceReturningVarsFromTargetList related comment
should be placed right above the function ReplaceReturningVarsFromTargetList,
not above ReplaceReturningVarsFromTargetList_context?

struct ReplaceReturningVarsFromTargetList_context adds some comments
about new_result_relation would be great.


/* INDEX_VAR is handled by default case */
this comment appears in execExpr.c and execExprInterp.c.
need to move to default case's switch default case?


/*
 * set_deparse_context_plan - Specify Plan node containing expression
 *
 * When deparsing an expression in a Plan tree, we might have to resolve
 * OUTER_VAR, INNER_VAR, or INDEX_VAR references.  To do this, the caller must
 * provide the parent Plan node.
...
*/
does the comment in set_deparse_context_plan need to be updated?

+ * buildNSItemForReturning -
+ * add a ParseNamespaceItem for the OLD or NEW alias in RETURNING.
+ */
+static void
+addNSItemForReturning(ParseState *pstate, const char *aliasname,
+  VarReturningType returning_type)
comment "buildNSItemForReturning" should be "addNSItemForReturning"?


  * results.  If include_dropped is true then empty strings and NULL constants
  * (not Vars!) are returned for dropped columns.
  *
- * rtindex, sublevels_up, and location are the varno, varlevelsup, and location
- * values to use in the created Vars.  Ordinarily rtindex should match the
- * actual position of the RTE in its rangetable.
+ * rtindex, sublevels_up, returning_type, and location are the varno,
+ * varlevelsup, varreturningtype, and location values to use in the created
+ * Vars.  Ordinarily rtindex should match the actual position of the RTE in
+ * its rangetable.
we already updated the comment in expandRTE.
but it seems we only do RTE_RELATION, some part of RTE_FUNCTION.
do we need
`
varnode->varreturningtype = returning_type;
`
for other `rte->rtekind` when there is a makeVar?

(I don't understand this part, in the case where rte->rtekind is
RTE_SUBQUERY, if I add  `varnode->varreturningtype = returning_type;`
the tests still pass.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada
On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!

Thank you for the report.

>
> I'm not sure why it took a couple weeks for Coverity to notice
> ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

>
> /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in 
> local_ts_extend_down()
> 1615node = child;
> 1616shift -= RT_SPAN;
> 1617}
> 1618
> 1619/* Reserve slot for the value. */
> 1620n4 = (RT_NODE_4 *) node.local;
> >>> CID 1594658:  Integer handling issues  (BAD_SHIFT)
> >>> In expression "key >> shift", shifting by a negative amount has 
> >>> undefined behavior.  The shift amount, "shift", is as little as -7.
> 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> 1622n4->base.count = 1;
> 1623
> 1624return &n4->children[0];
> 1625 }
> 1626
>
> I think the point here is that if you start with an arbitrary
> non-negative shift value, the preceding loop may in fact decrement it
> down to something less than zero before exiting, in which case we
> would indeed have trouble.  I suspect that the code is making
> undocumented assumptions about the possible initial values of shift.
> Maybe some Asserts would be good?  Also, if we're effectively assuming
> that shift must be exactly zero here, why not let the compiler
> hard-code that?
>
> -   n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> +   n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

Sounds like a good solution. I've attached the patch for that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_radixtree.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane
Masahiko Sawada  writes:
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
>> I think the point here is that if you start with an arbitrary
>> non-negative shift value, the preceding loop may in fact decrement it
>> down to something less than zero before exiting, in which case we
>> would indeed have trouble.  I suspect that the code is making
>> undocumented assumptions about the possible initial values of shift.
>> Maybe some Asserts would be good?  Also, if we're effectively assuming
>> that shift must be exactly zero here, why not let the compiler
>> hard-code that?

> Sounds like a good solution. I've attached the patch for that.

Personally I'd put the Assert immediately after the loop, because
it's not related to the "Reserve slot for the value" comment.
Seems reasonable otherwise.

regards, tom lane




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >
> > I'm not sure why it took a couple weeks for Coverity to notice
> > ee1b30f12, but it saw it today, and it's not happy:
>
> Hmm, I've also done Coverity Scan in development but I wasn't able to
> see this one for some reason...

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?




Re: Cannot find a working 64-bit integer type on Illumos

2024-03-24 Thread Japin Li


On Sat, 23 Mar 2024 at 01:22, Japin Li  wrote:
> On Sat, 23 Mar 2024 at 01:04, Tom Lane  wrote:
>> Japin Li  writes:
>>> When I try to configure PostgreSQL 16.2 on Illumos using the following 
>>> command,
>>> it complains $subject.
>>
>>> ./configure --enable-cassert --enable-debug --enable-nls --with-perl \
>>>   --with-python --without-tcl --without-gssapi --with-openssl \
>>>   --with-ldap --with-libxml --with-libxslt --without-systemd \
>>>   --with-readline --enable-thread-safety --enable-dtrace \
>>>   DTRACEFLAGS=-64 CFLAGS=-Werror
>>
>>> However, if I remove the `CFLAGS=-Werror`, it works fine.
>>> I'm not sure what happened here.
>>
>> CFLAGS=-Werror breaks a whole lot of configure's tests, not only that
>> one.  (We even have this documented, see [1].)  So you can't inject
>> -Werror that way.  What I do on my buildfarm animals is the equivalent
>> of
>>
>>  export COPT='-Werror'
>>
>> after configure and before build.  I think configure pays no attention
>> to COPT, so it'd likely be safe to keep that set all the time, but in
>> the buildfarm client it's just as easy to be conservative.
>>
>>  regards, tom lane
>>
>> [1] https://www.postgresql.org/docs/devel/install-make.html#CONFIGURE-ENVVARS
>
> Thank you very much!  I didn't notice this part before.

I try to use the following to compile it, however, it cannot compile it.

$ ../configure --enable-cassert --enable-debug --enable-nls --with-perl 
--with-python --without-tcl --without-gssapi --with-openssl --with-ldap 
--with-libxml --with-libxslt --without-systemd --with-readline 
--enable-thread-safety --enable-dtrace DTRACEFLAGS=-64
$ make COPT='-Werror' -s
/home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function 
'repairDependencyLoop':
/home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: 
format not a string literal and no format arguments [-Werror=format-security]
 1276 |   pg_log_warning(ngettext("there are circular foreign-key constraints 
on this table:",
  |   ^~
cc1: all warnings being treated as errors
make[3]: *** [: pg_dump_sort.o] Error 1
make[2]: *** [Makefile:43: all-pg_dump-recurse] Error 2
make[1]: *** [Makefile:42: all-bin-recurse] Error 2
make: *** [GNUmakefile:11: all-src-recurse] Error 2




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane
John Naylor  writes:
> Hmm, before 30e144287 this code only ran in a test module, is it
> possible Coverity would not find it there?

That could indeed explain why Coverity didn't see it.  I'm not
sure how our community run is set up, but it may not build the
test modules.

regards, tom lane




Re: Cannot find a working 64-bit integer type on Illumos

2024-03-24 Thread Tom Lane
Japin Li  writes:
> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function 
> 'repairDependencyLoop':
> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: 
> format not a string literal and no format arguments [-Werror=format-security]
>  1276 |   pg_log_warning(ngettext("there are circular foreign-key constraints 
> on this table:",
>   |   ^~

Yeah, some of the older buildfarm animals issue that warning too.
AFAICS it's a bogus compiler heuristic: there is not anything
wrong with the code as given.

regards, tom lane




Re: Cannot find a working 64-bit integer type on Illumos

2024-03-24 Thread Japin Li


On Mon, 25 Mar 2024 at 09:32, Tom Lane  wrote:
> Japin Li  writes:
>> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c: In function 
>> 'repairDependencyLoop':
>> /home/japin/postgres/debug/../src/bin/pg_dump/pg_dump_sort.c:1276:3: error: 
>> format not a string literal and no format arguments [-Werror=format-security]
>>  1276 |   pg_log_warning(ngettext("there are circular foreign-key 
>> constraints on this table:",
>>   |   ^~
>
> Yeah, some of the older buildfarm animals issue that warning too.
> AFAICS it's a bogus compiler heuristic: there is not anything
> wrong with the code as given.
>

Thanks!  It seems I should remove -Werror option on Illumos.




Re: Properly pathify the union planner

2024-03-24 Thread David Rowley
On Tue, 12 Mar 2024 at 23:21, David Rowley  wrote:
> I've attached v3.

I spent quite a bit more time looking at this.

I discovered that the dNumGroups wasn't being set as it should have
been for INTERSECT and EXCEPT queries.  There was a plan change as a
result of this.  I've fixed this by adjusting where dNumGroups is set.
It must be delayed until after the setop child paths have been
created.

Aside from this, the changes I made were mostly cosmetic.  However, I
did notice that I wasn't setting the union child RelOptInfo's
ec_indexes in add_setop_child_rel_equivalences().  I also discovered
that we're not doing that properly for the top-level RelOptInfo for
the UNION query prior to this change.  The reason is that due to the
Var.varno==0 for the top-level UNION targetlist. The code in
get_eclass_for_sort_expr() effectively misses this relation due to
"while ((i = bms_next_member(newec->ec_relids, i)) > 0)".  This
happens to be good because there is no root->simple_rel_array[0]
entry, so happens to prevent that code crashing.   It seems ok that
the ec_indexes are not set for the top-level set RelOptInfo as
get_eclass_for_sort_expr() does not make use of ec_indexes while
searching for an existing EquivalenceClass.  Maybe we should fix this
varno == 0 hack and adjust get_eclass_for_sort_expr() so that it makes
use of the ec_indexes.

It's possible to see this happen with a query such as:

SELECT oid FROM pg_class UNION SELECT oid FROM pg_class ORDER BY oid;

I didn't see that as a reason not to push this patch as this occurs
both with and without this change, so I've now pushed this patch.

Thank you and Andy for reviewing this.

David




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-24 Thread Noah Misch
On Mon, Oct 30, 2023 at 11:29:04AM +0400, Pavel Borisov wrote:
> On Wed, 25 Oct 2023 at 00:13, Alexander Korotkov  wrote:
> > I think this patch is ready to go.  I'm going to push it if there are
> > no objections.

> It's very good that this long-standing patch is finally committed. Thanks a 
> lot!

Agreed.  I gave this feature (commit 5ae2087) a try.  Thanks for implementing
it.  Could I get your input on two topics?


 1. Cross-page comparison at "first key on the next page" only

Cross-page comparisons got this discussion upthread:

On Tue, Mar 02, 2021 at 07:10:32PM -0800, Peter Geoghegan wrote:
> On Mon, Feb 8, 2021 at 2:46 AM Pavel Borisov  wrote:
> > Caveat: if the first entry on the next index page has a key equal to the 
> > key on a previous page AND all heap tid's corresponding to this entry are 
> > invisible, currently cross-page check can not detect unique constraint 
> > violation between previous index page entry and 2nd, 3d and next current 
> > index page entries. In this case, there would be a message that recommends 
> > doing VACUUM to remove the invisible entries from the index and repeat the 
> > check. (Generally, it is recommended to do vacuum before the check, but for 
> > the testing purpose I'd recommend turning it off to check the detection of 
> > visible-invisible-visible duplicates scenarios)

> You're going to have to "couple" buffer locks in the style of
> _bt_check_unique() (as well as keeping a buffer lock on "the first
> leaf page a duplicate might be on" throughout) if you need the test to
> work reliably.

The amcheck feature has no lock coupling at its "first key on the next page"
check.  I think that's fine, because amcheck takes one snapshot at the
beginning and looks for pairs of visible-to-that-snapshot heap tuples with the
same scan key.  _bt_check_unique(), unlike amcheck, must catch concurrent
inserts.  If amcheck "checkunique" wanted to detect duplicates that would
appear when all transactions commit, it would need lock coupling.  (I'm not
suggesting it do that.)  Do you see a problem with the lack of lock coupling
at "first key on the next page"?

> But why bother with that? The tool doesn't have to be
> 100% perfect at detecting corruption (nothing can be), and it's rather
> unlikely that it will matter for this test. A simple test that doesn't
> handle cross-page duplicates is still going to be very effective.

I agree, but perhaps the "first key on the next page" code is more complex
than general-case code would be.  If the lack of lock coupling is fine, then I
think memory context lifecycle is the only obstacle making index page
boundaries special.  Are there factors beyond that?  We already have
state->lowkey kept across pages via MemoryContextAlloc().  Similar lines of
code could preserve the scan key for checkunique, making the "first key on the
next page" code unnecessary.


 2. Raises runtime by 476% despite no dead tuples

I used the following to create a table larger than RAM, 17GB table and 10GB
index on a system with 12GB RAM:

\set count 5
begin;
set maintenance_work_mem = '1GB';
set client_min_messages = debug1;  -- debug2 is per-block spam
create temp table t as select n from generate_series(1,:count) t(n);
create unique index t_idx on t(n);
\dt+ t
\di+ t_idx
create extension amcheck;
select bt_index_check('t_idx', heapallindexed => false, checkunique => false);
select bt_index_check('t_idx', heapallindexed => false, checkunique => true);

Adding checkunique raised runtime from 58s to 276s, because it checks
visibility for every heap tuple.  It could do the heap fetch and visibility
check lazily, when the index yields two heap TIDs for one scan key.  That
should give zero visibility checks for this particular test case, and it
doesn't add visibility checks to bloated-table cases.  Pseudo-code:

/*---
 * scan_key is the last uniqueness-relevant scan key observed as
 * bt_check_level_from_leftmost() moves right to traverse the leaf 
level.
 * Will be NULL if the next tuple can't be the second tuple of a
 * uniqueness violation, because any of the following apply:
 * - we're evaluating the first leaf tuple of the entire index
 * - last scan key had anynullkeys (never forms a uniqueness violation 
w/
 *   any other scan key)
 */
scan_key = NULL;
/*
 * scan_key_known_visible==true indicates that scan_key_heap_tid is the
 * last _visible_ heap TID observed for scan_key.  Otherwise,
 * scan_key_heap_tid is the last heap TID observed for scan_key, and 
we've
 * not yet checked its visibility.
 */
bool scan_key_known_visible;
scan_key_heap_tid;
foreach itup (leftmost_leaf_level_tup .. rightmost_leaf_level_tup) {
if (itup.anynullkeys)
scan_key = NULL;
else if (scan_key != NULL &&
 _bt_compare(scan_k

Re: Add Index-level REINDEX with multiple jobs

2024-03-24 Thread Tom Lane
Alexander Korotkov  writes:
> Here goes the revised patch.  I'm going to push this if there are no 
> objections.

Quite a lot of the buildfarm is complaining about this:

reindexdb.c: In function 'reindex_one_database':
reindexdb.c:434:54: error: 'indices_tables_cell' may be used uninitialized in 
this function [-Werror=maybe-uninitialized]
  434 | strcmp(prev_index_table_name, indices_tables_cell->val) == 0)
  |   ~~~^

I noticed it first on mamba, which is set up with -Werror, but a
scrape of the buildfarm logs shows many other animals reporting this
as a warning.  I think they are right.  Even granting that the
compiler realizes that "parallel && process_type == REINDEX_INDEX" is
enough to reach the one place where indices_tables_cell is
initialized, that's not really enough, because that place is

if (indices_tables_list)
indices_tables_cell = indices_tables_list->head;

So I believe this code will crash if get_parallel_object_list returns
an empty list.  Initializing indices_tables_cell to NULL in its
declaration would stop the compiler warning, but if I'm right it
will do nothing to prevent that crash.  This needs a bit more effort.

regards, tom lane




Re: Add Index-level REINDEX with multiple jobs

2024-03-24 Thread Richard Guo
On Mon, Mar 25, 2024 at 10:07 AM Tom Lane  wrote:

> Alexander Korotkov  writes:
> > Here goes the revised patch.  I'm going to push this if there are no
> objections.
>
> Quite a lot of the buildfarm is complaining about this:
>
> reindexdb.c: In function 'reindex_one_database':
> reindexdb.c:434:54: error: 'indices_tables_cell' may be used uninitialized
> in this function [-Werror=maybe-uninitialized]
>   434 | strcmp(prev_index_table_name, indices_tables_cell->val) == 0)
>   |   ~~~^
>
> I noticed it first on mamba, which is set up with -Werror, but a
> scrape of the buildfarm logs shows many other animals reporting this
> as a warning.


I noticed the similar warning on cfbot:
https://cirrus-ci.com/task/6298504306360320?logs=gcc_warning#L448

reindexdb.c: In function ‘reindex_one_database’:
reindexdb.c:437:24: error: ‘indices_tables_cell’ may be used uninitialized
in this function [-Werror=maybe-uninitialized]
  437 |indices_tables_cell = indices_tables_cell->next;
  |^~~

Although it's complaining on line 437 not 434, I think they are the same
issue.


> I think they are right.  Even granting that the
> compiler realizes that "parallel && process_type == REINDEX_INDEX" is
> enough to reach the one place where indices_tables_cell is
> initialized, that's not really enough, because that place is
>
> if (indices_tables_list)
> indices_tables_cell = indices_tables_list->head;
>
> So I believe this code will crash if get_parallel_object_list returns
> an empty list.  Initializing indices_tables_cell to NULL in its
> declaration would stop the compiler warning, but if I'm right it
> will do nothing to prevent that crash.  This needs a bit more effort.


Agreed.  And the comment of get_parallel_object_list() says that it may
indeed return NULL.

BTW, on line 373, it checks 'process_list' and bails out if this list is
NULL.  But it seems to me that 'process_list' cannot be NULL in this
case, because it's initialized to be 'user_list' and we have asserted
that user_list is not NULL on line 360.  I wonder if we should check
indices_tables_list instead of process_list on line 373.

Thanks
Richard


Re: add AVX2 support to simd.h

2024-03-24 Thread John Naylor
On Fri, Mar 22, 2024 at 12:09 AM Nathan Bossart
 wrote:
>
> On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:

> > If this were "<=" then the for long arrays we could assume there is
> > always more than one block, and wouldn't need to check if any elements
> > remain -- first block, then a single loop and it's done.
> >
> > The loop could also then be a "do while" since it doesn't have to
> > check the exit condition up front.
>
> Good idea.  That causes us to re-check all of the tail elements when the
> number of elements is evenly divisible by nelem_per_iteration, but that
> might be worth the trade-off.

Yeah, if there's no easy way to avoid that it's probably fine. I
wonder if we can subtract one first to force even multiples to round
down, although I admit I haven't thought through the consequences of
that.

> [v8]

Seems pretty good. It'd be good to see the results of 2- vs.
4-register before committing, because that might lead to some
restructuring, but maybe it won't, and v8 is already an improvement
over HEAD.

/* Process the remaining elements one at a time. */

This now does all of them if that path is taken, so "remaining" can be removed.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada
On Mon, Mar 25, 2024 at 10:13 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >> I think the point here is that if you start with an arbitrary
> >> non-negative shift value, the preceding loop may in fact decrement it
> >> down to something less than zero before exiting, in which case we
> >> would indeed have trouble.  I suspect that the code is making
> >> undocumented assumptions about the possible initial values of shift.
> >> Maybe some Asserts would be good?  Also, if we're effectively assuming
> >> that shift must be exactly zero here, why not let the compiler
> >> hard-code that?
>
> > Sounds like a good solution. I've attached the patch for that.
>
> Personally I'd put the Assert immediately after the loop, because
> it's not related to the "Reserve slot for the value" comment.
> Seems reasonable otherwise.
>

Thanks. Pushed the fix after moving the Assert.


Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: A problem about partitionwise join

2024-03-24 Thread Richard Guo
On Tue, Mar 19, 2024 at 3:40 PM Ashutosh Bapat 
wrote:

> On Tue, Mar 19, 2024 at 8:18 AM Richard Guo 
> wrote:
>
>> On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat <
>> ashutosh.bapat@gmail.com> wrote:
>>
>>> Approach
>>> 
>>> The equijoin condition between partition keys doesn't appear in the
>>> join's restrictilist because of 'best_score' strategy as you explained well
>>> in [2]. What if we add an extra score for clauses between partition keys
>>> and give preference to equijoin between partition keys? Have you given it a
>>> thought? I feel that having an equijoin clause involving partition keys has
>>> more usages compared to a clause with any random column. E.g. nextloop may
>>> be able to prune partitions from inner relation if the clause contains a
>>> partition key.
>>>
>>
>> Hmm, I think this approach won't work in cases where one certain pair of
>> partition keys has formed an EC that contains pseudoconstants.  In such
>> cases, the EC machinery will generate restriction clauses like 'pk =
>> const' rather than any join clauses.
>>
>
> That should be ok and more desirable. Clauses like pk = const will leave
> only one partition around in each of the joining relations thus PWJ won't
> be required OR it will be automatic - whichever way you see it.
>

No, that's not true.  There could be multiple partition keys, and the
particular key involved in the pushed-down restriction 'pk = const' may
not be able to prune away any partitions.  To be concrete, consider the
query:

create table p (k1 int, k2 int, val int) partition by range(k1, k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

set enable_partitionwise_join to on;

explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2
and foo.k2 = 5;
   QUERY PLAN
-
 Hash Join
   Hash Cond: (foo.k1 = bar.k1)
   ->  Append
 ->  Seq Scan on p_1 foo_1
   Filter: (k2 = 5)
 ->  Seq Scan on p_2 foo_2
   Filter: (k2 = 5)
   ->  Hash
 ->  Append
   ->  Seq Scan on p_1 bar_1
 Filter: (k2 = 5)
   ->  Seq Scan on p_2 bar_2
 Filter: (k2 = 5)
(13 rows)

Thanks
Richard


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread Amit Kapila
On Sun, Mar 24, 2024 at 3:05 PM Bharath Rupireddy
 wrote:
>
> On Sun, Mar 24, 2024 at 10:40 AM Amit Kapila  wrote:
> >
> > > For instance, setting last_inactive_time_1 to an invalid value fails
> > > with the following error:
> > >
> > > error running SQL: 'psql::1: ERROR:  invalid input syntax for
> > > type timestamp with time zone: "foo"
> > > LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli...
> > >
> >
> > It would be found at a later point. It would be probably better to
> > verify immediately after the test that fetches the last_inactive_time
> > value.
>
> Agree. I've added a few more checks explicitly to verify the
> last_inactive_time is sane with the following:
>
> qq[SELECT '$last_inactive_time'::timestamptz > to_timestamp(0)
> AND '$last_inactive_time'::timestamptz >
> '$slot_creation_time'::timestamptz;]
>

Such a test looks reasonable but shall we add equal to in the second
part of the test (like '$last_inactive_time'::timestamptz >=
> '$slot_creation_time'::timestamptz;). This is just to be sure that even if 
> the test ran fast enough to give the same time, the test shouldn't fail. I 
> think it won't matter for correctness as well.


-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread Amit Kapila
On Mon, Mar 25, 2024 at 9:48 AM Amit Kapila  wrote:
>
>
> Such a test looks reasonable but shall we add equal to in the second
> part of the test (like '$last_inactive_time'::timestamptz >=
> > '$slot_creation_time'::timestamptz;). This is just to be sure that even if 
> > the test ran fast enough to give the same time, the test shouldn't fail. I 
> > think it won't matter for correctness as well.
>

Apart from this, I have made minor changes in the comments. See and
let me know what you think of attached.

-- 
With Regards,
Amit Kapila.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 2b36b5fef1..5f4165a945 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2529,8 +2529,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   
 The time at which the slot became inactive.
-NULL if the slot is currently actively being
-used.
+NULL if the slot is currently being used.
   
  
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 0f48d6dc7c..77cb633812 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -623,7 +623,7 @@ retry:
if (SlotIsLogical(s))
pgstat_acquire_replslot(s);
 
-   /* The slot is active by now, so reset the last inactive time. */
+   /* Reset the last inactive time as the slot is active now. */
SpinLockAcquire(&s->mutex);
s->last_inactive_time = 0;
SpinLockRelease(&s->mutex);
@@ -687,8 +687,8 @@ ReplicationSlotRelease(void)
}
 
/*
-* Set the last inactive time after marking slot inactive. We get 
current
-* time beforehand to avoid system call while holding the lock.
+* Set the last inactive time after marking the slot inactive. We get 
the
+* current time beforehand to avoid a system call while holding the 
lock.
 */
now = GetCurrentTimestamp();
 
@@ -2363,9 +2363,9 @@ RestoreSlotFromDisk(const char *name)
slot->active_pid = 0;
 
/*
-* We set last inactive time after loading the slot from the 
disk into
-* memory. Whoever acquires the slot i.e. makes the slot active 
will
-* anyway reset it.
+* We set the last inactive time after loading the slot from 
the disk
+* into memory. Whoever acquires the slot i.e. makes the slot 
active
+* will reset it.
 */
slot->last_inactive_time = GetCurrentTimestamp();
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2f18433ecc..eefd7abd39 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -202,7 +202,7 @@ typedef struct ReplicationSlot
 */
XLogRecPtr  last_saved_confirmed_flush;
 
-   /* The time at which this slot become inactive */
+   /* The time at which this slot becomes inactive */
TimestampTz last_inactive_time;
 } ReplicationSlot;
 
diff --git a/src/test/recovery/t/019_replslot_limit.pl 
b/src/test/recovery/t/019_replslot_limit.pl
index bff84cc9c4..81bd36f5d8 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -411,7 +411,7 @@ $node_primary3->stop;
 $node_standby3->stop;
 
 # =
-# Testcase start: Check last_inactive_time property of streaming standby's slot
+# Testcase start: Check last_inactive_time property of the streaming standby's 
slot
 #
 
 # Initialize primary node
@@ -440,8 +440,8 @@ $primary4->safe_psql(
 SELECT pg_create_physical_replication_slot(slot_name := '$sb4_slot');
 ]);
 
-# Get last_inactive_time value after slot's creation. Note that the slot is 
still
-# inactive unless it's used by the standby below.
+# Get last_inactive_time value after the slot's creation. Note that the slot
+# is still inactive till it's used by the standby below.
 my $last_inactive_time = $primary4->safe_psql('postgres',
qq(SELECT last_inactive_time FROM pg_replication_slots WHERE slot_name 
= '$sb4_slot' AND last_inactive_time IS NOT NULL;)
 );
@@ -470,8 +470,8 @@ is( $primary4->safe_psql(
 # Stop the standby to check its last_inactive_time value is updated
 $standby4->stop;
 
-# Let's also restart the primary so that the last_inactive_time is set upon
-# loading the slot from disk.
+# Let's restart the primary so that the last_inactive_time is set upon
+# loading the slot from the disk.
 $primary4->restart;
 
 is( $primary4->safe_psql(
@@ -483,11 +483,11 @@ is( $primary4->safe_psql(
 
 $standby4->stop;
 
-# Testcase end: Check last_inactive_time property of streaming standby's slot
+# Testcase end: Check last_inactive_time property of the streaming standby's 
slot
 # =
 
 # ===

Re: Improve readability by using designated initializers when possible

2024-03-24 Thread jian he
looking through v4 again.
v4 looks good to me.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread shveta malik
On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
 wrote:
>
> I've attached the v18 patch set here.

Thanks for the patches. Please find few comments:

patch 001:


1)
slot.h:

+ /* The time at which this slot become inactive */
+ TimestampTz last_inactive_time;

become -->became

-
patch 002:

2)
slotsync.c:

  ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY,
remote_slot->two_phase,
remote_slot->failover,
-   true);
+   true, 0);

+ slot->data.inactive_timeout = remote_slot->inactive_timeout;

Is there a reason we are not passing 'remote_slot->inactive_timeout'
to ReplicationSlotCreate() directly?

-

3)
slotfuncs.c
pg_create_logical_replication_slot():
+ int inactive_timeout = PG_GETARG_INT32(5);

Can we mention here that timeout is in seconds either in comment or
rename variable to inactive_timeout_secs?

Please do this for create_physical_replication_slot(),
create_logical_replication_slot(),
pg_create_physical_replication_slot() as well.

-
4)
+ int inactive_timeout; /* The amount of time in seconds the slot
+ * is allowed to be inactive. */
 } LogicalSlotInfo;

 Do we need to mention "before getting invalided" like other places
(in last patch)?

--

 5)
Same at these two places. "before getting invalided" to be added in
the last patch otherwise the info is incompleted.

+
+ /* The amount of time in seconds the slot is allowed to be inactive */
+ int inactive_timeout;
 } ReplicationSlotPersistentData;


+ * inactive_timeout: The amount of time in seconds the slot is allowed to be
+ * inactive.
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
 Same here. "before getting invalidated" ?



Reviewing more..

thanks
Shveta




Re: Add new error_action COPY ON_ERROR "log"

2024-03-24 Thread Masahiko Sawada
On Wed, Mar 13, 2024 at 11:02 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 13, 2024 at 11:09 AM Michael Paquier  wrote:
> >
> > Hmm.  This NOTICE is really bugging me.  It is true that the clients
> > would get more information, but the information is duplicated on the
> > server side because the error context provides the same information as
> > the NOTICE itself:
> > NOTICE:  data type incompatibility at line 1 for column "a"
> > CONTEXT:  COPY aa, line 1, column a: "a"
> > STATEMENT:  copy aa from stdin with (on_error ignore, log_verbosity 
> > verbose);
>
> Yes, if wanted, clients can also get the CONTEXT - for instance, using
> '\set SHOW_CONTEXT always' in psql.
>
> I think we can enhance the NOTICE message to include the column value
> (just like CONTEXT message is showing) and leverage relname_only to
> emit only the relation name in the CONTEXT message.
>
> /*
>  * We suppress error context information other than the relation name,
>  * if one of the operations below fails.
>  */
> Assert(!cstate->relname_only);
> cstate->relname_only = true;
>
> I'm attaching the v8 patch set implementing the above idea. With this,
> [1] is sent to the client, [2] is sent to the server log. This
> approach not only reduces the duplicate info in the NOTICE and CONTEXT
> messages, but also makes it easy for users to get all the necessary
> info in the NOTICE message without having to set extra parameters to
> get CONTEXT message.
>
> Another idea is to move even the table name to NOTICE message and hide
> the context with errhidecontext when we emit the new NOTICE messages.
>
> Thoughts?
>

The current approach, eliminating the duplicated information in
CONTEXT, seems good to me.

One question about the latest (v8) patch:

+   else
+   ereport(NOTICE,
+   errmsg("data type incompatibility at
line %llu for column %s: null input",
+  (unsigned long long) cstate->cur_lineno,
+  cstate->cur_attname));
+

How can we reach this path? It seems we don't cover this path by the tests.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: make dist using git archive

2024-03-24 Thread Peter Eisentraut

On 24.03.24 16:42, Tristan Partin wrote:
You might find Meson's string formatting syntax creates a more readable 
command string:


'tar.tar.bz2.command=@0@ -c'.format(bzip2.path())

And then 'install: false' is the default if you feel like leaving it out.

Otherwise, let's get this in!


Done and committed.





RE: speed up a logical replica setup

2024-03-24 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

I also want to share my opinion, just in case.

> On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira  wrote:
> >
> > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote:
> >
> > If we commit this we might not be able to change the way the option
> > behaves once the customers starts using it. How about removing these
> > options in the first version and adding it in the next version after
> > more discussion.
> >
> >
> > We don't need to redesign this one if we want to add a format string in a 
> > next
> > version. A long time ago, pg_dump started to accept pattern for tables 
> > without
> > breaking or deprecating the -t option. If you have 100 databases and you 
> > don't
> > want to specify the options or use a script to generate it for you, you also
> > have the option to let pg_createsubscriber generate the object names for 
> > you.
> > Per my experience, it will be a rare case.
> >
> 
> But, why go with some UI in the first place which we don't think is a
> good one, or at least don't have a broader agreement that it is a good
> one? The problem with self-generated names for users could be that
> they won't be able to make much out of it. If one has to always use
> those internally then probably that would be acceptable. I would
> prefer what Tomas proposed a few emails ago as compared to this one as
> that has fewer options to be provided by users but still, they can
> later identify objects. But surely, we should discuss if you or others
> have better alternatives.

IIUC, added options were inspired by Tomas. And he said the limitation 
(pub/sub/slot
name cannot be specified) was acceptable for the first version. I agree with 
him.
(To be honest, I feel that options should be fewer for the first release)

> The user choosing to convert a physical standby to a subscriber would
> in some cases be interested in converting it for all the databases
> (say for the case of upgrade [1]) and giving options for each database
> would be cumbersome for her.

+1 for the primary use case.

> > Currently dry-run will do the check and might fail on identifying a
> > few failures like after checking subscriber configurations. Then the
> > user will have to correct the configuration and re-run then fix the
> > next set of failures. Whereas the suggest-config will display all the
> > optimal configuration for both the primary and the standby in a single
> > shot. This is not a must in the first version, it can be done as a
> > subsequent enhancement.
> >
> >
> > Do you meant publisher, right? Per order, check_subscriber is done before
> > check_publisher and it checks all settings on the subscriber before 
> > exiting. In
> > v30, I changed the way it provides the required settings. In a previous 
> > version,
> > it fails when it found a wrong setting; the current version, check all 
> > settings
> > from that server before providing a suitable error.
> >
> > pg_createsubscriber: checking settings on publisher
> > pg_createsubscriber: primary has replication slot "physical_slot"
> > pg_createsubscriber: error: publisher requires wal_level >= logical
> > pg_createsubscriber: error: publisher requires 2 replication slots, but 
> > only 0
> remain
> > pg_createsubscriber: hint: Consider increasing max_replication_slots to at 
> > least
> 3.
> > pg_createsubscriber: error: publisher requires 2 wal sender processes, but 
> > only
> 0 remain
> > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 
> > 3.
> >
> > If you have such an error, you will fix them all and rerun using dry run 
> > mode
> > again to verify everything is ok. I don't have a strong preference about 
> > it. It
> > can be changed easily (unifying the check functions or providing a return 
> > for
> > each of the check functions).
> >
> 
> We can unify the checks but not sure if it is worth it. I am fine
> either way. It would have been better if we provided a way for a user
> to know the tool's requirement rather than letting him know via errors
> but I think it should be okay to extend it later as well.

Both ways are OK, but I prefer to unify checks a bit. The number of working 
modes
in the same executables should be reduced as much as possible.

Also, I feel the current specification is acceptable. pg_upgrade checks one by
one and exits soon in bad cases. If both old and new clusters have issues, the
first run only reports the old one's issues. After DBA fixes and runs again,
issues on the new cluster are output.

[1]: 
https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: Properly pathify the union planner

2024-03-24 Thread Richard Guo
On Mon, Mar 25, 2024 at 9:44 AM David Rowley  wrote:

> It seems ok that
> the ec_indexes are not set for the top-level set RelOptInfo as
> get_eclass_for_sort_expr() does not make use of ec_indexes while
> searching for an existing EquivalenceClass.  Maybe we should fix this
> varno == 0 hack and adjust get_eclass_for_sort_expr() so that it makes
> use of the ec_indexes.
>
> It's possible to see this happen with a query such as:
>
> SELECT oid FROM pg_class UNION SELECT oid FROM pg_class ORDER BY oid;


I see what you said.  Yeah, there might be some optimization
possibilities in this area.  And I agree that this should not be a
blocker in pushing this patch.


> I didn't see that as a reason not to push this patch as this occurs
> both with and without this change, so I've now pushed this patch.


Great to see this patch has been pushed!

Thanks
Richard


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread shveta malik
On Mon, Mar 25, 2024 at 10:33 AM shveta malik  wrote:
>
> On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
>  wrote:
> >
> > I've attached the v18 patch set here.
>

I have a question. Don't we allow creating subscriptions on an
existing slot with a non-null 'inactive_timeout' set where
'inactive_timeout' of the slot is retained even after subscription
creation?

I tried this:

===
--On publisher, create slot with 120sec inactive_timeout:
SELECT * FROM pg_create_logical_replication_slot('logical_slot1',
'pgoutput', false, true, true, 120);

--On subscriber, create sub using logical_slot1
create subscription mysubnew1_1  connection 'dbname=newdb1
host=localhost user=shveta port=5433' publication mypubnew1_1 WITH
(failover = true, create_slot=false, slot_name='logical_slot1');

--Before creating sub, pg_replication_slots output:
   slot_name   | failover | synced | active | temp | conf |
   lat| inactive_timeout
---+--+++--+--+--+--
 logical_slot1 | t| f  | f  | f| f| 2024-03-25
11:11:55.375736+05:30 |  120

--After creating sub pg_replication_slots output:  (inactive_timeout is 0 now):
   slot_name   |failover | synced | active | temp | conf | | lat |
inactive_timeout
---+-+++--+--+-+-+--
 logical_slot1 |t| f  | t  | f| f| | |
   0
===

In CreateSubscription, we call  'walrcv_alter_slot()' /
'ReplicationSlotAlter()' when create_slot is false. This call ends up
setting active_timeout from 120sec to 0. Is it intentional?

thanks
Shveta




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:

> > v77-0001
> >
> > - dead_items = (VacDeadItems *) 
> > palloc(vac_max_items_to_alloc_size(max_items));
> > - dead_items->max_items = max_items;
> > - dead_items->num_items = 0;
> > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > +
> > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> >
> > This is confusing enough that it looks like a bug:
> >
> > [inside TidStoreCreate()]
> > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > while (16 * maxBlockSize > max_bytes * 1024L)
> > maxBlockSize >>= 1;
> >
> > This was copied from CreateWorkExprContext, which operates directly on
> > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > here. If we're passing something measured in kilobytes, the parameter
> > is badly named. Let's use convert once and use bytes everywhere.
>
> True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

More on 0003:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs

+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum

I wonder if the comments here should refer to it using a more natural
spelling, like "TID store".

- * items in the dead_items array for later vacuuming, count live and
+ * items in the dead_items for later vacuuming, count live and

Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

Also, "we line pointers" seems to be a pre-existing typo.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));

This is a translated message, so let's keep the message the same.

/*
 * Allocate dead_items (either using palloc, or in dynamic shared memory).
 * Sets dead_items in vacrel for caller.
 *
 * Also handles parallel initialization as part of allocating dead_items in
 * DSM when required.
 */
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)

This comment didn't change at all. It's not wrong, but let's consider
updating the specifics.

v78-0004:

> > +#define dsa_create(tranch_id) \
> > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
> >
> > Since these macros are now referring to defaults, maybe their name
> > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> > (*_MAX_*)
>
> It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
> the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
> name also makes sense to me.

Right, that makes sense.

v78-0005:

"Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly."

Perhaps s/still needed/would have needed/ ..., since we're preventing
that necessity.

> > Did you try it with 1MB m_w_m?
>
> I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

+ /*
+ * Choose the initial and maximum DSA segment sizes to be no longer
+ * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+ * segment size is low, we end up having many segments, which risks
+ * exceeding the total number of segments the platform can have.

The second sentence is technically correct, but I'm not sure how it
relates to the cod

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-24 Thread Donghang Lin
Hi Tomas,
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <
tomas.von...@enterprisedb.com> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
>  - snapshot_off
>  - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.

This issue is now gone after Heikki's fix.

> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
> Heap Blocks: exact=102 lossy=10995
> Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
>Heap Blocks: exact=207 lossy=19354
> Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
>Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
> Heap Blocks: exact=125 lossy=10789
> Buffers: shared hit=11 read=45420
> Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
>   Heap Blocks: exact=136 lossy=13541
>   Buffers: shared hit=4 read=13541
> Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
>   Heap Blocks: exact=209 lossy=20655
>   Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
> Buffers: shared hit=7 read=45421
> Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
>   Heap Blocks: exact=282 lossy=27806
>   Buffers: shared hit=4 read=28241
> Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
>   Heap Blocks: exact=188 lossy=17179
>   Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.

Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific
stats
do not aggregate workers' stats into leaders for some existing nodes. For
example,
Memorize node for Hits, Misses, etc

   ->  Nested Loop (actual rows=17 loops=3)
 ->  Parallel Seq Scan on t (actual rows=3 loops=3)
 ->  Memoize (actual rows=5 loops=10)
   Cache Key: t.j
   Cache Mode: logical
   Hits: 32991  Misses: 5  Evictions: 0  Overflows: 0  Memory
Usage: 2kB
   Worker 0:  Hits: 33551  Misses: 5  Evictions: 0  Overflows:
0  Memory Usage: 2kB
   Worker 1:  Hits: 33443  Misses: 5  Evictions: 0  Overflows:
0  Memory Usage: 2kB
   ->  Index Scan using uj on u (actual rows=5 loops=15)
 Index Cond: (j = t.j)

Sort, HashAggregate also do the same stuff.

> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
>->  Parallel Bitmap Heap Scan on t  (...)
>  Recheck Cond: (a < 5000)
>  Rows Removed by Index Recheck: 246882
>  Worker 0:  Heap Blocks: exact=168 lossy=15648
>  Worker 1:  Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
>->  Parallel Bitmap Heap Scan on public.t  (...)
>  Recheck Cond: (t.a < 5000)
>  Rows Removed by Index Recheck: 246882
>  Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
>Heap Blocks: exact=257 lossy=25358
>  Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
>Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
>->  Parallel Bitmap Heap Scan on t  (...)
>  Recheck Cond: (a < 5000)
>  Rows Removed by Index Recheck: 246882
>  Buffers: shared hit=7 read=45421
>  Worker 0:  Heap Blocks: exact=236 lossy=21870
>  Worker 1:  Heap Blocks: exact=234 lossy=231

Re: altering a column's collation leaves an invalid foreign key

2024-03-24 Thread Paul Jungwirth

On 3/23/24 10:04, Paul Jungwirth wrote:

Perhaps if the previous collation was nondeterministic we should force a 
re-check.


Here is a patch implementing this. It was a bit more fuss than I expected, so maybe someone has a 
better way.


We have had nondeterministic collations since v12, so perhaps it is something 
to back-patch?

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom a8078f85859400f4e4cffb57d8ec1b069e46bfb8 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Sun, 24 Mar 2024 23:37:48 -0700
Subject: [PATCH v1] Re-check foreign key if referenced collation was
 nondeterministic

With deterministic collations, we break ties by looking at binary
equality. But nondeterministic collations can allow non-identical values
to be considered equal, e.g. 'a' and 'A' when case-insensitive. So some
references that used to be valid may not be anymore.
---
 src/backend/commands/tablecmds.c  | 96 ---
 src/include/nodes/parsenodes.h|  2 +
 .../regress/expected/collate.icu.utf8.out |  9 ++
 src/test/regress/sql/collate.icu.utf8.sql |  8 ++
 4 files changed, 103 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 71740984f33..940b1845edb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -194,6 +194,7 @@ typedef struct AlteredTableInfo
 	/* Objects to rebuild after completing ALTER TYPE operations */
 	List	   *changedConstraintOids;	/* OIDs of constraints to rebuild */
 	List	   *changedConstraintDefs;	/* string definitions of same */
+	List	   *changedCollationOids;	/* collation of each attnum */
 	List	   *changedIndexOids;	/* OIDs of indexes to rebuild */
 	List	   *changedIndexDefs;	/* string definitions of same */
 	char	   *replicaIdentityIndex;	/* index to reset as REPLICA IDENTITY */
@@ -572,6 +573,7 @@ static ObjectAddress ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
 		   AlterTableCmd *cmd, LOCKMODE lockmode);
 static void RememberAllDependentForRebuilding(AlteredTableInfo *tab, AlterTableType subtype,
 			  Relation rel, AttrNumber attnum, const char *colName);
+static void RememberCollationForRebuilding(AttrNumber attnum, AlteredTableInfo *tab);
 static void RememberConstraintForRebuilding(Oid conoid, AlteredTableInfo *tab);
 static void RememberIndexForRebuilding(Oid indoid, AlteredTableInfo *tab);
 static void RememberStatisticsForRebuilding(Oid stxoid, AlteredTableInfo *tab);
@@ -579,12 +581,12 @@ static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab,
    LOCKMODE lockmode);
 static void ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId,
  char *cmd, List **wqueue, LOCKMODE lockmode,
- bool rewrite);
+ bool rewrite, List *collationOids);
 static void RebuildConstraintComment(AlteredTableInfo *tab, AlterTablePass pass,
 	 Oid objid, Relation rel, List *domname,
 	 const char *conname);
 static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
-static void TryReuseForeignKey(Oid oldId, Constraint *con);
+static void TryReuseForeignKey(Oid oldId, Constraint *con, List *changedCollationOids);
 static ObjectAddress ATExecAlterColumnGenericOptions(Relation rel, const char *colName,
 	 List *options, LOCKMODE lockmode);
 static void change_owner_fix_column_acls(Oid relationOid,
@@ -9826,6 +9828,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	bool		old_check_ok;
 	ObjectAddress address;
 	ListCell   *old_pfeqop_item = list_head(fkconstraint->old_conpfeqop);
+	ListCell   *old_collation_item = list_head(fkconstraint->old_collations);
 
 	/*
 	 * Grab ShareRowExclusiveLock on the pk table, so that someone doesn't
@@ -10212,9 +10215,15 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			CoercionPathType new_pathtype;
 			Oid			old_castfunc;
 			Oid			new_castfunc;
+			Oid			old_collation;
 			Form_pg_attribute attr = TupleDescAttr(tab->oldDesc,
    fkattnum[i] - 1);
 
+			/* Get the collation for this key part. */
+			old_collation = lfirst_oid(old_collation_item);
+			old_collation_item = lnext(fkconstraint->old_collations,
+	   old_collation_item);
+
 			/*
 			 * Identify coercion pathways from each of the old and new FK-side
 			 * column types to the right (foreign) operand type of the pfeqop.
@@ -10250,9 +10259,12 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			 * turn conform to the domain.  Consequently, we need not treat
 			 * domains specially here.
 			 *
-			 * Since we require that all collations share the same notion of
-			 * equality (which they do, because texteq reduces to bitwise
-			 * equality), we don't compare collation here.
+			 * All deterministic collations use bitwise equality to resolve
+			 * ties, but if the previous collation was nondeterministic,
+			 * we mu

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread Bharath Rupireddy
On Mon, Mar 25, 2024 at 10:28 AM Amit Kapila  wrote:
>
> On Mon, Mar 25, 2024 at 9:48 AM Amit Kapila  wrote:
> >
> >
> > Such a test looks reasonable but shall we add equal to in the second
> > part of the test (like '$last_inactive_time'::timestamptz >=
> > > '$slot_creation_time'::timestamptz;). This is just to be sure that even 
> > > if the test ran fast enough to give the same time, the test shouldn't 
> > > fail. I think it won't matter for correctness as well.

Agree. I added that in v19 patch. I was having that concern in my
mind. That's the reason I wasn't capturing current_time something like
below for the same worry that current_timestamp might be the same (or
nearly the same) as the slot creation time. That's why I ended up
capturing current_timestamp in a separate query than clubbing it up
with pg_create_physical_replication_slot.

SELECT current_timestamp FROM pg_create_physical_replication_slot('foo');

> Apart from this, I have made minor changes in the comments. See and
> let me know what you think of attached.

LGTM. I've merged the diff into v19 patch.

Please find the attached v19 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v19-0001-Track-last_inactive_time-in-pg_replication_slots.patch
Description: Binary data