Re: Sort optimizations: Making in-memory sort cache-aware

2023-02-12 Thread Ankit Kumar Pandey




On 12/02/23 01:59, Andres Freund wrote:



However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.



The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.


This probably explains the issue.



One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples.  E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.



There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
  amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
  anything.



But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.


I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?



Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.


I definitely agree with this.



My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.


This should be easy hack and we can easily profile benefits from this.


It's very likely we can do better than just doing a plain sort of everything
after that.
You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a 
heapsort between the presorted blocks.


This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.



A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.
We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts.  I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.



The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.



We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.


This Looks doable, try this.


I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.



While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.


I would try this as well.


There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.


Thanks a lot Andres, these are lots of pointers to work on (without major 
overhaul of sort
implementation and with potentially good amount of improvements). I will give 
these a try
and see if I could get some performance gains.

Thanks,
Ankit





Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-02-12 Thread David Rowley
On Sun, 12 Feb 2023 at 19:39, Tom Lane  wrote:
> I find this patch horribly dangerous.

I see LogicalRepApplyLoop() does something similar with a
StringInfoData. Maybe it's just scarier having an external function in
stringinfo.c which does this as having it increases the chances of
someone using it for the wrong thing.

> It could maybe be okay if we added the capability for StringInfoData
> to understand (and enforce) that its "data" buffer is read-only.
> However, that'd add overhead to every existing use-case.

I'm not very excited by that.  I considered just setting maxlen = -1
in the new function and adding Asserts to check for that in each of
the appendStringInfo* functions. However, since the performance gains
are not so great, I'll probably just drop the whole thing given
there's resistance.

David




Re: run pgindent on a regular basis / scripted manner

2023-02-12 Thread Andrew Dunstan


On 2023-02-10 Fr 10:21, Andrew Dunstan wrote:



On 2023-02-10 Fr 04:25, Jelte Fennema wrote:

Ah yes, I had seen that when I read the initial --commit patch but
then forgot about it when the flag didn't work at all when I tried it.

Attached is a patch that fixes the issue. And also implements the
--dirty and --staged flags in pgindent that Robert Haas requested.




I don't think just adding a diff filter is really a sufficient fix. 
The file might have been deleted since the commit(s) in question. 
Here's a more general fix for missing files.




OK, I've pushed this along with a check to make sure we only process 
each file once.



I'm not sure how much more I really want to do here. Given the way 
pgindent now processes command line arguments, maybe the best thing is 
for people to use that. Use of git aliases can help. Something like 
these for example



[alias]

    dirty = diff --name-only --diff-filter=ACMU -- .
    staged = diff --name-only --cached --diff-filter=ACMU -- .
    dstaged = diff --name-only --diff-filter=ACMU HEAD -- .


and then you could do

    pgindent `git dirty`


The only danger would be if there were no dirty files. Maybe we need a 
switch to inhibit using the current directory if there are no command 
line files.



Thoughts?


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Bharath Rupireddy
On Sun, Feb 12, 2023 at 4:14 AM Andres Freund  wrote:
>
> > +ssize_t
> > +pg_pwrite_zeros(int fd, size_t size)
> > +{
> > + PGAlignedBlock  zbuffer;
> > + size_t  zbuffer_sz;
> > + struct ioveciov[PG_IOV_MAX];
> > + int blocks;
> > + size_t  remaining_size = 0;
> > + int i;
> > + ssize_t written;
> > + ssize_t total_written = 0;
> > +
> > + zbuffer_sz = sizeof(zbuffer.data);
> > +
> > + /* Zero-fill the buffer. */
> > + memset(zbuffer.data, 0, zbuffer_sz);
>
> I previously commented on this - why are we memseting a buffer on every call
> to this?  That's not at all free.
>
> Something like
> static const PGAlignedBlock zerobuf = {0};
> would do the trick.  You do need to cast the const away, to assign to
> iov_base, but that's not too ugly.

Thanks for looking at it. We know that we don't change the zbuffer in
the function, so can we avoid static const and have just a static
variable, like the attached
v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
you see any problem with it?

FWIW, it comes out like the attached
v1-0001-Use-static-const-variable-to-avoid-memset-calls-i.patch with
static const.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 2527c6083bbdf006900be1a1dbf15ed815184d2b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 12 Feb 2023 10:58:33 +
Subject: [PATCH v1] Use static variable to avoid memset() calls in
 pg_pwrite_zeros()

---
 src/common/file_utils.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..0d59488f60 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	/* Zero-filled buffer worth BLCKSZ. */
+	static PGAlignedBlock zbuffer = {0};
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
@@ -548,10 +549,7 @@ pg_pwrite_zeros(int fd, size_t size)
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
+	zbuffer_sz = BLCKSZ;
 
 	/* Prepare to write out a lot of copies of our zero buffer at once. */
 	for (i = 0; i < lengthof(iov); ++i)
-- 
2.34.1

From adc5ea83fb6726a5d8a1f3b87d6b3b03550c7fcb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 12 Feb 2023 10:55:09 +
Subject: [PATCH v1] Use static const variable to avoid memset calls in
 pg_pwrite_zeros

---
 src/common/file_utils.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..cd15cc2fdb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	/* Zero-filled buffer worth BLCKSZ. */
+	static const PGAlignedBlock zbuffer = {0};
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
@@ -548,15 +549,12 @@ pg_pwrite_zeros(int fd, size_t size)
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
+	zbuffer_sz = BLCKSZ;
 
 	/* Prepare to write out a lot of copies of our zero buffer at once. */
 	for (i = 0; i < lengthof(iov); ++i)
 	{
-		iov[i].iov_base = zbuffer.data;
+		iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
 		iov[i].iov_len = zbuffer_sz;
 	}
 
-- 
2.34.1



Re: Transparent column encryption

2023-02-12 Thread Mark Dilger



> On Feb 11, 2023, at 1:54 PM, Mark Dilger  wrote:
> 
> Here are some observations 

I should mention, src/sgml/html/libpq-exec.html needs clarification:

> paramFormats[]Specifies whether parameters are text (put a zero in the array 
> entry for the corresponding parameter) or binary (put a one in the array 
> entry for the corresponding parameter). If the array pointer is null then all 
> parameters are presumed to be text strings.

Perhaps you should edit this last sentence to say that all parameters are 
presumed to be test strings without forced encryption.

> Values passed in binary format require knowledge of the internal 
> representation expected by the backend. For example, integers must be passed 
> in network byte order. Passing numeric values requires knowledge of the 
> server storage format, as implemented in 
> src/backend/utils/adt/numeric.c::numeric_send() and 
> src/backend/utils/adt/numeric.c::numeric_recv().

> When column encryption is enabled, the second-least-significant byte of this 
> parameter specifies whether encryption should be forced for a parameter.

The value 0x10 has a one as its second-least-significant *nibble*, but that's a 
really strange way to describe the high-order nibble, and perhaps not what you 
mean.  Could you clarify?

> Set this byte to one to force encryption.

I think setting the byte to one (0x01) means "binary unencrypted", not "force 
encryption".  Don't you mean to set this byte to 16? 

> For example, use the C code literal 0x10 to specify text format with forced 
> encryption. If the array pointer is null then encryption is not forced for 
> any parameter.
> If encryption is forced for a parameter but the parameter does not correspond 
> to an encrypted column on the server, then the call will fail and the 
> parameter will not be sent. This can be used for additional security against 
> a compromised server. (The drawback is that application code then needs to be 
> kept up to date with knowledge about which columns are encrypted rather than 
> letting the server specify this.)

 I think you should say something about how specifying 0x11 will behave -- in 
other words, asking for encrypted binary data.  I believe that this is will 
draw a "format must be text for encrypted parameter" error, and that the docs 
should clearly say so.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: run pgindent on a regular basis / scripted manner

2023-02-12 Thread Tom Lane
Andrew Dunstan  writes:
> ... then you could do
>      pgindent `git dirty`
> The only danger would be if there were no dirty files. Maybe we need a 
> switch to inhibit using the current directory if there are no command 
> line files.

It seems like "indent the whole tree" is about to become a minority
use-case.  Maybe instead of continuing to privilege that case, we
should say that it's invoked by some new switch like --all-files,
and without that only the stuff identified by command-line arguments
gets processed.

regards, tom lane




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Andres Freund
Hi,

On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> Thanks for looking at it. We know that we don't change the zbuffer in
> the function, so can we avoid static const and have just a static
> variable, like the attached
> v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> you see any problem with it?

Making it static const is better, because it allows the memory for the
variable to be put in a readonly section.


>   /* Prepare to write out a lot of copies of our zero buffer at once. */
>   for (i = 0; i < lengthof(iov); ++i)
>   {
> - iov[i].iov_base = zbuffer.data;
> + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> &zbuffer)->data);
>   iov[i].iov_len = zbuffer_sz;
>   }

Another thing: I think we should either avoid iterating over all the IOVs if
we don't need them, or, even better, initialize the array as a constant, once.

Greetings,

Andres Freund




Re: pgindent vs variable declaration across multiple lines

2023-02-12 Thread Tom Lane
Now that pg_bsd_indent is in our tree, we can format this as a
patch against Postgres sources.  I'll stick it in the March CF
so we don't forget about it.

regards, tom lane

diff --git a/src/tools/pg_bsd_indent/args.c b/src/tools/pg_bsd_indent/args.c
index d08b086a88..38eaa5a5bf 100644
--- a/src/tools/pg_bsd_indent/args.c
+++ b/src/tools/pg_bsd_indent/args.c
@@ -51,7 +51,7 @@ static char sccsid[] = "@(#)args.c	8.1 (Berkeley) 6/6/93";
 #include "indent_globs.h"
 #include "indent.h"
 
-#define INDENT_VERSION	"2.1.1"
+#define INDENT_VERSION	"2.1.2"
 
 /* profile types */
 #define	PRO_SPECIAL	1	/* special case */
diff --git a/src/tools/pg_bsd_indent/io.c b/src/tools/pg_bsd_indent/io.c
index 4149424294..9d64ca1ee5 100644
--- a/src/tools/pg_bsd_indent/io.c
+++ b/src/tools/pg_bsd_indent/io.c
@@ -201,11 +201,12 @@ dump_line(void)
 ps.decl_on_line = ps.in_decl;	/* if we are in the middle of a
 	 * declaration, remember that fact for
 	 * proper comment indentation */
-ps.ind_stmt = ps.in_stmt & ~ps.in_decl;	/* next line should be
-		 * indented if we have not
-		 * completed this stmt and if
-		 * we are not in the middle of
-		 * a declaration */
+/* next line should be indented if we have not completed this stmt, and
+ * either we are not in a declaration or we are in an initialization
+ * assignment; but not if we're within braces in an initialization,
+ * because that scenario is handled by other rules. */
+ps.ind_stmt = ps.in_stmt &&
+	(!ps.in_decl || (ps.block_init && ps.block_init_level <= 0));
 ps.use_ff = false;
 ps.dumped_decl_indent = 0;
 *(e_lab = s_lab) = '\0';	/* reset buffers */
diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index a793971e07..3398d62133 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -13,7 +13,7 @@ use IO::Handle;
 use Getopt::Long;
 
 # Update for pg_bsd_indent version
-my $INDENT_VERSION = "2.1.1";
+my $INDENT_VERSION = "2.1.2";
 
 # Our standard indent settings
 my $indent_opts =


Re: run pgindent on a regular basis / scripted manner

2023-02-12 Thread Andrew Dunstan


On 2023-02-12 Su 11:24, Tom Lane wrote:

Andrew Dunstan  writes:

... then you could do
      pgindent `git dirty`
The only danger would be if there were no dirty files. Maybe we need a
switch to inhibit using the current directory if there are no command
line files.

It seems like "indent the whole tree" is about to become a minority
use-case.  Maybe instead of continuing to privilege that case, we
should say that it's invoked by some new switch like --all-files,
and without that only the stuff identified by command-line arguments
gets processed.





I don't think we need --all-files. The attached gets rid of the build 
and code-base cruft, which is now in any case obsolete given we've put 
pg_bsd_indent in our code base. So the way to spell this instead of 
"pgindent --all-files" would be "pgindent ."


I added a warning if there are no files at all specified.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com
diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index a793971e07..b098cef02a 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -21,7 +21,7 @@ my $indent_opts =
 
 my $devnull = File::Spec->devnull;
 
-my ($typedefs_file, $typedef_str, $code_base,
+my ($typedefs_file, $typedef_str,
 	@excludes,  $indent,  $build,
 	$show_diff, $silent_diff, $help,
 	@commits,);
@@ -33,10 +33,8 @@ my %options = (
 	"commit=s"   => \@commits,
 	"typedefs=s" => \$typedefs_file,
 	"list-of-typedefs=s" => \$typedef_str,
-	"code-base=s"=> \$code_base,
 	"excludes=s" => \@excludes,
 	"indent=s"   => \$indent,
-	"build"  => \$build,
 	"show-diff"  => \$show_diff,
 	"silent-diff"=> \$silent_diff,);
 GetOptions(%options) || usage("bad command line argument");
@@ -46,22 +44,16 @@ usage() if $help;
 usage("Cannot have both --silent-diff and --show-diff")
   if $silent_diff && $show_diff;
 
-usage("Cannot use --commit with --code-base or command line file list")
-  if (@commits && ($code_base || @ARGV));
+usage("Cannot use --commit with command line file list")
+  if (@commits && @ARGV);
 
-run_build($code_base) if ($build);
-
-# command line option wins, then environment (which is how --build sets it) ,
-# then locations. based on current dir, then default location
+# command line option wins, then environment, then locations based on current
+# dir, then default location
 $typedefs_file ||= $ENV{PGTYPEDEFS};
 
-# build mode sets PGINDENT
+# get indent location for environment or default
 $indent ||= $ENV{PGINDENT} || $ENV{INDENT} || "pg_bsd_indent";
 
-# if no non-option arguments or commits are given, default to looking in the
-# current directory
-$code_base ||= '.' unless (@ARGV || @commits);
-
 my $sourcedir = locate_sourcedir();
 
 # if it's the base of a postgres tree, we will exclude the files
@@ -121,8 +113,7 @@ sub check_indent
 sub locate_sourcedir
 {
 	# try fairly hard to locate the sourcedir
-	my $where = $code_base || '.';
-	my $sub = "$where/src/tools/pgindent";
+	my $sub = "./src/tools/pgindent";
 	return $sub if -d $sub;
 	# try to find it from an ancestor directory
 	$sub = "../src/tools/pgindent";
@@ -320,72 +311,6 @@ sub show_diff
 	return $diff;
 }
 
-sub run_build
-{
-	eval "use LWP::Simple;";## no critic (ProhibitStringyEval);
-
-	my $code_base = shift || '.';
-	my $save_dir = getcwd();
-
-	# look for the code root
-	foreach (1 .. 5)
-	{
-		last if -d "$code_base/src/tools/pgindent";
-		$code_base = "$code_base/..";
-	}
-
-	die "cannot locate src/tools/pgindent directory in \"$code_base\"\n"
-	  unless -d "$code_base/src/tools/pgindent";
-
-	chdir "$code_base/src/tools/pgindent";
-
-	my $typedefs_list_url =
-	  "https://buildfarm.postgresql.org/cgi-bin/typedefs.pl";;
-
-	my $rv = getstore($typedefs_list_url, "tmp_typedefs.list");
-
-	die "cannot fetch typedefs list from $typedefs_list_url\n"
-	  unless is_success($rv);
-
-	$ENV{PGTYPEDEFS} = abs_path('tmp_typedefs.list');
-
-	my $indentrepo = "https://git.postgresql.org/git/pg_bsd_indent.git";;
-	system("git clone $indentrepo >$devnull 2>&1");
-	die "could not fetch pg_bsd_indent sources from $indentrepo\n"
-	  unless $? == 0;
-
-	chdir "pg_bsd_indent" || die;
-	system("make all check >$devnull");
-	die "could not build pg_bsd_indent from source\n"
-	  unless $? == 0;
-
-	$ENV{PGINDENT} = abs_path('pg_bsd_indent');
-
-	chdir $save_dir;
-	return;
-}
-
-sub build_clean
-{
-	my $code_base = shift || '.';
-
-	# look for the code root
-	foreach (1 .. 5)
-	{
-		last if -d "$code_base/src/tools/pgindent";
-		$code_base = "$code_base/..";
-	}
-
-	die "cannot locate src/tools/pgindent directory in \"$code_base\"\n"
-	  unless -d "$code_base/src/tools/pgindent";
-
-	chdir "$code_base";
-
-	system("rm -rf src/tools/pgindent/pg_bsd_indent");
-	system("rm -f src/tools/pgindent/tmp_typedefs.list");
-	return;
-}
-
 sub usage
 {
 	my $message  = shift;
@@ -397,10 +322,8 @@ Options:
 	--commit

Re: run pgindent on a regular basis / scripted manner

2023-02-12 Thread Justin Pryzby
On Sun, Feb 12, 2023 at 11:24:14AM -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > ... then you could do
> >      pgindent `git dirty`
> > The only danger would be if there were no dirty files. Maybe we need a 
> > switch to inhibit using the current directory if there are no command 
> > line files.
> 
> It seems like "indent the whole tree" is about to become a minority
> use-case.  Maybe instead of continuing to privilege that case, we
> should say that it's invoked by some new switch like --all-files,
> and without that only the stuff identified by command-line arguments
> gets processed.

It seems like if pgindent knows about git, it ought to process only
tracked files.  Then, it wouldn't need to manually exclude generated
files, and it wouldn't process vpath builds and who-knows-what else it
finds in CWD.

At least --commit doesn't seem to work when run outside of the root
source dir.

-- 
Justin




Re: run pgindent on a regular basis / scripted manner

2023-02-12 Thread Tom Lane
Andrew Dunstan  writes:
> On 2023-02-12 Su 11:24, Tom Lane wrote:
>> It seems like "indent the whole tree" is about to become a minority
>> use-case.  Maybe instead of continuing to privilege that case, we
>> should say that it's invoked by some new switch like --all-files,
>> and without that only the stuff identified by command-line arguments
>> gets processed.

> I don't think we need --all-files. The attached gets rid of the build 
> and code-base cruft, which is now in any case obsolete given we've put 
> pg_bsd_indent in our code base. So the way to spell this instead of 
> "pgindent --all-files" would be "pgindent ."

Ah, of course.

> I added a warning if there are no files at all specified.

LGTM.

regards, tom lane




Re: ExecRTCheckPerms() and many prunable partitions (sqlsmith)

2023-02-12 Thread Justin Pryzby
On Tue, Nov 29, 2022 at 10:37:56PM +0900, Amit Langote wrote:
> 0002 contains changes that has to do with changing how we access
> checkAsUser in some foreign table planning/execution code sites.
> Thought it might be better to describe it separately too.

This was committed as 599b33b94:
Stop accessing checkAsUser via RTE in some cases

That seems to add various elog()s which are hit frequently by sqlsmith:

postgres=# select from
(select transaction
from pg_prepared_xacts
right join pg_available_extensions
on false limit 0) where false;
ERROR:  permission info at index 2 (with relid=1262) does not match provided 
RTE (with relid=12081)

postgres=# select from (select confl_tablespace
from pg_stat_database_conflicts
where datname <> (select 'af')
limit 1) where false;
ERROR:  invalid perminfoindex 1 in RTE with relid 12271





Re: Making Vars outer-join aware

2023-02-12 Thread Justin Pryzby
On Mon, Jan 23, 2023 at 03:38:06PM -0500, Tom Lane wrote:
> Richard, are you planning to review this any more?  I'm getting
> a little antsy to get it committed.  For such a large patch,
> it's surprising it's had so few conflicts to date.

The patch broke this query:

select from pg_inherits inner join information_schema.element_types
right join (select from pg_constraint as sample_2) on true
on false, lateral (select scope_catalog, inhdetachpending from 
pg_publication_namespace limit 3);
ERROR:  could not devise a query plan for the given query





Re: Generating code for query jumbling through gen_node_support.pl

2023-02-12 Thread Michael Paquier
On Fri, Feb 10, 2023 at 04:40:08PM -0500, Tom Lane wrote:
> v2 looks good to me as far as it goes.

Thanks.  I have applied that after an extra lookup.

> I agree these other questions deserve a separate look.

Okay, I may be able to come back to that.  Another point is that we
need to do a better job in forcing the execution of the query jumbling
in one of the TAP tests running pg_regress, outside
pg_stat_statements, to maximize coverage.  Will see to that on a
separate thread.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add pretty-printed XML output option

2023-02-12 Thread Peter Smith
Something is misbehaving.

Using the latest HEAD, and before applying the v6 patch, 'make check'
is working OK.

But after applying the v6 patch, some XML regression tests are failing
because the DETAIL part of the message indicating the wrong syntax
position is not getting displayed. Not only for your new tests -- but
for other XML tests too.

My ./configure looks like this:
./configure --prefix=/usr/local/pg_oss --with-libxml --enable-debug
--enable-cassert --enable-tap-tests  CFLAGS="-ggdb -O0 -g3
-fno-omit-frame-pointer"

resulting in:
checking whether to build with XML support... yes
checking for libxml-2.0 >= 2.6.23... yes

~

e.g.(these are a sample of errors)

xml  ... FAILED 2561 ms

@@ -344,8 +326,6 @@
 &idontexist;
^
 line 1: Opening and ending tag mismatch: twoerrors line 1 and unbalanced
-&idontexist;
-^
 SELECT xmlparse(document '');
   xmlparse
 -
@@ -1696,14 +1676,8 @@
 SELECT xmlformat('');
 ERROR:  invalid XML document
 DETAIL:  line 1: switching encoding : no input
-
-^
 line 1: Document is empty
-
-^
 -- XML format: invalid string (whitespaces)
 SELECT xmlformat('   ');
 ERROR:  invalid XML document
 DETAIL:  line 1: Start tag expected, '<' not found
-
-   ^

~~

Separately (but maybe it's related?), the CF-bot test also reported a
failure [1] with strange error detail differences.

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/xml.out
/tmp/cirrus-ci-build/build/testrun/regress/regress/results/xml.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/xml.out 2023-02-12
09:02:57.077569000 +
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/xml.out
2023-02-12 09:05:45.14810 +
@@ -1695,10 +1695,7 @@
 -- XML format: empty string
 SELECT xmlformat('');
 ERROR:  invalid XML document
-DETAIL:  line 1: switching encoding : no input
-
-^
-line 1: Document is empty
+DETAIL:  line 1: Document is empty

 ^
 -- XML format: invalid string (whitespaces)

--
[1] 
https://api.cirrus-ci.com/v1/artifact/task/4802219812323328/testrun/build/testrun/regress/regress/regression.diffs

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila  wrote 
in 
> On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila  wrote:
> >
> > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> >  wrote:
> > > We have suffered this kind of feedback silence many times. Thus I
> > > don't want to rely on luck here. I had in mind of exposing last_send
> > > itself or providing interval-calclation function to the logic.
> >
> > I think we have last_send time in send_feedback(), so we can expose it
> > if we want but how would that solve the problem you are worried about?

Wal receiver can avoid a too-long sleep by knowing when to wake up for
the next feedback.

> I have an idea to use last_send time to avoid walsenders being
> timeout. Instead of directly using wal_receiver_status_interval as a
> minimum interval to send the feedback, we should check if it is
> greater than last_send time then we should send the feedback after
> (wal_receiver_status_interval - last_send). I think they can probably
> be different only on the very first time. Any better ideas?

If it means MyLogicalRepWorker->last_send_time, it is not the last
time when walreceiver sent a feedback but the last time when
wal*sender* sent a data. So I'm not sure that works.

We could use the variable that way, but AFAIU in turn when so many
changes have been spooled that the control doesn't return to
LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay()
starts calling send_feedback() at every call after the first feedback
timing.  Even in that case, send_feedback() won't send one actually
until the next feedback timing, I don't think that behavior is great.

The only packets walreceiver sends back is the feedback packets and
currently only send_feedback knows the last feedback time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Amit,
> 
> > Can't we have this option just as a bool (like shutdown_immediate)?
> > Why do we want to keep multiple modes?
> 
> Of course we can use boolean instead, but current style is motivated by the 
> post[1].
> This allows to add another option in future, whereas I do not have idea now.
> 
> I want to ask other reviewers which one is better...
> 
> [1]: 
> https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication. If it's
not the case, I don't object to make it a Boolean.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 16:43:15 +0900, Michael Paquier  wrote 
in 
> On Wed, Feb 08, 2023 at 09:42:13PM -0500, Tom Lane wrote:
> > Hm.  On the one hand, if it is in fact not in postgresql.conf.sample,
> > then that flag should be set for sure.  OTOH I see that that flag
> > isn't purely documentation: help_config.c thinks it should hide
> > GUCs that are marked that way.  Do we really want that behavior?
> > Not sure.  I can see an argument that you might want --describe-config
> > to tell you that, but there are a lot of other GUC_NOT_IN_SAMPLE
> > GUCs that maybe do indeed deserve to be left out.
> 
> I am not sure to follow.  help_config() won't show something that's
> either marked NO_SHOW_ALL, NOT_IN_SAMPLE or DISALLOW_IN_FILE, hence
> config_file does not show up already in what postgres
> --describe-config prints, because it has DISALLOW_IN_FILE, so adding
> NOT_IN_SAMPLE changes nothing.

I think currently the output by --describe-config can be used only for
consulting while editing a (possiblly broken) config file.  Thus I
think it's no use showing GIC_DISALLOW_IN_FILE items there unless we
use help_config() for an on-session use.

On the other hand, don't we need to remove the condition
GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config()
should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if
it is GUC_NOT_IN_SAMPLE.  I'm not sure whether there's any variable
that are marked that way, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-02-12 Thread Amit Kapila
On Sat, Feb 11, 2023 at 3:04 AM Andres Freund  wrote:
>
> > One difference I see with the patch is that I think we will end up
> > sending keepalive for empty prepared transactions even though we don't
> > skip sending begin/prepare messages for those.
>
> With the proposed approach we reliably know whether a callback wrote
> something, so we can tune the behaviour here fairly easily.
>

I would like to clarify a few things about the proposed approach. In
commit_cb_wrapper()/prepare_cb_wrapper(), the patch first did
ctx->did_write = false;, then call the commit/prepare callback (which
will call pgoutput_commit_txn()/pgoutput_prepare_txn()) and then call
update_progress() which will make decisions based on ctx->did_write
flag. Now, for this to work pgoutput_commit_txn/pgoutput_prepare_txn
should know that the transaction has performed some writes before that
call which is currently working because pgoutput is tracking the same
via sent_begin_txn. Is the intention here that we still track whether
BEGIN () has been sent via pgoutput?

-- 
With Regards,
Amit Kapila.




Re: Exit walsender before confirming remote flush in logical replication

2023-02-12 Thread Amit Kapila
On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
 wrote:
>
> At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" 
>  wrote in
> > Dear Amit,
> >
> > > Can't we have this option just as a bool (like shutdown_immediate)?
> > > Why do we want to keep multiple modes?
> >
> > Of course we can use boolean instead, but current style is motivated by the 
> > post[1].
> > This allows to add another option in future, whereas I do not have idea now.
> >
> > I want to ask other reviewers which one is better...
> >
> > [1]: 
> > https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com
>
> IMHO I vaguely don't like that we lose a means to specify the default
> behavior here. And I'm not sure we definitely don't need other than
> flush and immedaite for both physical and logical replication.
>

If we can think of any use case that requires its extension then it
makes sense to make it a non-boolean option but otherwise, let's keep
things simple by having a boolean option.

> If it's
> not the case, I don't object to make it a Boolean.
>

Thanks.

-- 
With Regards,
Amit Kapila.




Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-12 Thread Michael Paquier
On Mon, Feb 13, 2023 at 11:27:58AM +0900, Kyotaro Horiguchi wrote:
> I think currently the output by --describe-config can be used only for
> consulting while editing a (possiblly broken) config file.  Thus I
> think it's no use showing GIC_DISALLOW_IN_FILE items there unless we
> use help_config() for an on-session use.
> 
> On the other hand, don't we need to remove the condition
> GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config()
> should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if
> it is GUC_NOT_IN_SAMPLE.  I'm not sure whether there's any variable
> that are marked that way, though.

As in marked with GUC_NOT_IN_SAMPLE but not GUC_DISALLOW_IN_FILE?
There are quite a lot, developer GUCs being one (say
ignore_invalid_pages).  We don't want to list them in the sample file
so as common users don't play with them, still they make sense if
listed in a file.

If you add a check meaning that GUC_DISALLOW_IN_FILE implies
GUC_NOT_IN_SAMPLE, where one change would need to be applied to
config_file as all the other GUC_DISALLOW_IN_FILE GUCs already do
that, you could remove GUC_DISALLOW_IN_FILE.  However,
GUC_NOT_IN_SAMPLE should be around to not expose options, we don't
want common users to know too much about.

The question about how much people rely on --describe-config these
days is a good one, so perhaps there could be an argument in removing
GUC_NOT_IN_SAMPLE from the set.  TBH, I would be really surprised that
anybody able to use a developer option writes an configuration file in
an incorrect format and needs to use this option, though :)
--
Michael


signature.asc
Description: PGP signature


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Michael Paquier
On Sun, Feb 12, 2023 at 09:31:36AM -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

Where you imply that we'd still use memset() once on iov[PG_IOV_MAX],
right?
--
Michael


signature.asc
Description: PGP signature


Re: Exit walsender before confirming remote flush in logical replication

2023-02-12 Thread Peter Smith
Here are some comments for patch v7-0002.

==
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE clause. It is
currently implemented only for logical replication.

~

"to accept SHUTDOWN_MODE clause." --> "to accept a SHUTDOWN_MODE clause."

==
doc/src/sgml/protocol.sgml

2.
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE {
'wait_flush' | 'immediate' } ] [ ( option_name [ option_value ] [,
...] ) ]

~

IMO this should say shutdown_mode as it did before:
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE
shutdown_mode ] [ ( option_name [ option_value ] [, ...] ) ]

~~~

3.
+   
+shutdown_mode
+
+ 
+  Determines the behavior of the walsender process at shutdown. If
+  shutdown_mode is 'wait_flush', the walsender waits
+  for all the sent WALs to be flushed on the subscriber side. This is
+  the default when SHUTDOWN_MODE is not specified. If shutdown_mode is
+  'immediate', the walsender exits without
+  confirming the remote flush.
+ 
+
+   

Is the font of the "shutdown_mode" correct? I expected it to be like
the others (e.g. slot_name)

==
src/backend/replication/walsender.c

4.
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ {
+ char*mode = cmd->shutdownmode;
+
+ if (pg_strcasecmp(mode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(mode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", mode),
+ errhint("Available values: wait_flush, immediate."));
+ }
+
+}

Unnecessary extra whitespace at end of the function.

==
src/include/nodes/replnodes.

5.
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
  char*slotname;
  TimeLineID timeline;
  XLogRecPtr startpoint;
+ char*shutdownmode;
  List*options;
 } StartReplicationCmd;

IMO I those the last 2 members should have a comment something like:
/* Only for logical replication */

because that will make it more clear why sometimes they are assigned
and sometimes they are not.

==
src/include/replication/walreceiver.h

6.
Should the protocol version be bumped (and documented) now that the
START REPLICATION supports a new extended syntax? Or is that done only
for messages sent by pgoutput?

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-12 Thread Takamichi Osumi (Fujitsu)
Hi, Horiguchi-san


On Monday, February 13, 2023 10:26 AM Kyotaro Horiguchi 
 wrote:
> At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila 
> wrote in
> > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila 
> wrote:
> > >
> > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> > >  wrote:
> > > > We have suffered this kind of feedback silence many times. Thus I
> > > > don't want to rely on luck here. I had in mind of exposing
> > > > last_send itself or providing interval-calclation function to the logic.
> > >
> > > I think we have last_send time in send_feedback(), so we can expose
> > > it if we want but how would that solve the problem you are worried
> about?
> 
> Wal receiver can avoid a too-long sleep by knowing when to wake up for the
> next feedback.
> 
> > I have an idea to use last_send time to avoid walsenders being
> > timeout. Instead of directly using wal_receiver_status_interval as a
> > minimum interval to send the feedback, we should check if it is
> > greater than last_send time then we should send the feedback after
> > (wal_receiver_status_interval - last_send). I think they can probably
> > be different only on the very first time. Any better ideas?
> 
> If it means MyLogicalRepWorker->last_send_time, it is not the last time when
> walreceiver sent a feedback but the last time when
> wal*sender* sent a data. So I'm not sure that works.
> 
> We could use the variable that way, but AFAIU in turn when so many changes
> have been spooled that the control doesn't return to LogicalRepApplyLoop
> longer than wal_r_s_interval, maybe_apply_delay() starts calling
> send_feedback() at every call after the first feedback timing.  Even in that
> case, send_feedback() won't send one actually until the next feedback timing,
> I don't think that behavior is great.
> 
> The only packets walreceiver sends back is the feedback packets and
> currently only send_feedback knows the last feedback time.
Thanks for your comments !

As described in your last sentence, in the latest patch v34 [1],
we use the last time set in send_feedback() and
based on it, we calculate and adjust the first timing of feedback message
in maybe_apply_delay() so that we can send the feedback message following
the interval of wal_receiver_status_interval. I wasn't sure if
the above concern is still valid for this implementation.

Could you please have a look at the latest patch and share your opinion ?


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83736C50C98CB2153728A7A8EDDE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi





Re: Support logical replication of DDLs

2023-02-12 Thread Zheng Li
On Fri, Feb 10, 2023 at 11:31 AM vignesh C  wrote:
>
> On Fri, 10 Feb 2023 at 21:50, vignesh C  wrote:
> > The attached v68 version patch has the changes for the same.
>
> I was not sure if we should support ddl replication of
> create/alter/drop subscription commands as there might be some data
> inconsistency issues in the following cases:
> #node1 who is running in port 5432
> create publication pub_node1 for all tables with ( PUBLISH = 'insert,
> update, delete, truncate');
>
> #node2 who is running in port 5433
> create publication pub_node2 for all tables with(PUBLISH = 'insert,
> update, delete, truncate', ddl = 'all');
> create subscription sub_node2 connection 'dbname=postgres host=node1
> port=5432' publication pub_node1;
>
> #node3
> create subscription sub_node3 connection 'dbname=postgres host=node2
> port=5433 publication pub_node2;
>
> #node1
> create table t1(c1 int );
>
> #node2
> create table t1(c1 int);
> alter subscription sub_node2 refresh publication;
>
> # Additionally this command will be replicated to node3, creating a
> subscription sub2_node2 in node3 which will subscribe data from node1
> create subscription sub2_node2 connection 'dbname=postgres host=node1
> port=5432' publication pub_node1;
>
> After this any insert into t1 from node1 will be replicated to node2
> and node3, additionally node2's replicated data(which was replicated
> from node1) will also be sent to node3 causing inconsistency. If the
> table has unique or primary key constraints, it will lead to an error.
>
> Another option would be to replicate the create subscription in
> disabled state and not support few ddl replication of alter
> subscription which will connect to publisher like:
> 1) Alter subscription sub1 enable;
> 2) Alter subscription sub1 refresh publication;

I think it will also be error-prone when the user tries to enable the
replicated subscription on node3 later on, for example, when switching
over from node2 to node3.
There is risk of duplicate or missing data on node3 if the switchover
isn't done right.

> But in this case also, we will be able to support few alter
> subscription commands and not support few alter subscription commands.
> I feel it is better that we do not need to support ddl replication of
> create/alter/drop subscription command and let users handle the
> subscription commands.

+1 for not supporting subscription commands in the first version and
letting users handle them.

Regards,
Zane




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Bharath Rupireddy
On Sun, Feb 12, 2023 at 11:01 PM Andres Freund  wrote:
>
> On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > Thanks for looking at it. We know that we don't change the zbuffer in
> > the function, so can we avoid static const and have just a static
> > variable, like the attached
> > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> > you see any problem with it?
>
> Making it static const is better, because it allows the memory for the
> variable to be put in a readonly section.

Okay.

> >   /* Prepare to write out a lot of copies of our zero buffer at once. */
> >   for (i = 0; i < lengthof(iov); ++i)
> >   {
> > - iov[i].iov_base = zbuffer.data;
> > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> > &zbuffer)->data);
> >   iov[i].iov_len = zbuffer_sz;
> >   }
>
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

How about like the attached patch? It makes the iovec static variable
and points the zero buffer only once/for the first time to iovec. This
avoids for-loop on every call.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 550b606affb614980a4b814d9e44e73d5489 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 13 Feb 2023 04:25:34 +
Subject: [PATCH v1] Optimize zero buffer handling in pg_pwrite_zeros()

1. Use static const zero-filled buffer to avoid memset-ing on
every call.
2. Use static iovec and point the zero-filled buffer for the
first time through to avoid for loop initializing iovec on every
call.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.de
---
 src/common/file_utils.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..08688eb1b1 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,25 +539,35 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
-	struct iovec iov[PG_IOV_MAX];
+	/*
+	 * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing
+	 * the buffer on every call. Also, with the const qualifier, it is ensured
+	 * that the buffer is placed in read-only section of the process's memory.
+	 */
+	static const PGAlignedBlock zbuffer = {0};
+	static size_t	zbuffer_sz = BLCKSZ;
+	static bool	first_time = true;
+	static struct iovec	iov[PG_IOV_MAX] = {0};
 	int			blocks;
 	size_t		remaining_size = 0;
 	int			i;
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
+	/*
+	 * If first time through, point the zero buffer to iovec structure. This
+	 * avoids for loop on every call.
+	 */
+	if (first_time)
 	{
-		iov[i].iov_base = zbuffer.data;
-		iov[i].iov_len = zbuffer_sz;
+		/* Prepare to write out a lot of copies of our zero buffer at once. */
+		for (i = 0; i < lengthof(iov); ++i)
+		{
+			iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
+			iov[i].iov_len = zbuffer_sz;
+		}
+
+		first_time = false;
 	}
 
 	/* Loop, writing as many blocks as we can for each system call. */
-- 
2.34.1



Re: Support logical replication of DDLs

2023-02-12 Thread Peter Smith
On Sat, Feb 11, 2023 at 3:31 AM vignesh C  wrote:
>
> On Fri, 10 Feb 2023 at 21:50, vignesh C  wrote:
> > The attached v68 version patch has the changes for the same.
>
> I was not sure if we should support ddl replication of
> create/alter/drop subscription commands as there might be some data
> inconsistency issues in the following cases:
> #node1 who is running in port 5432
> create publication pub_node1 for all tables with ( PUBLISH = 'insert,
> update, delete, truncate');
>
> #node2 who is running in port 5433
> create publication pub_node2 for all tables with(PUBLISH = 'insert,
> update, delete, truncate', ddl = 'all');
> create subscription sub_node2 connection 'dbname=postgres host=node1
> port=5432' publication pub_node1;
>
> #node3
> create subscription sub_node3 connection 'dbname=postgres host=node2
> port=5433 publication pub_node2;
>
> #node1
> create table t1(c1 int );
>
> #node2
> create table t1(c1 int);
> alter subscription sub_node2 refresh publication;
>
> # Additionally this command will be replicated to node3, creating a
> subscription sub2_node2 in node3 which will subscribe data from node1
> create subscription sub2_node2 connection 'dbname=postgres host=node1
> port=5432' publication pub_node1;
>
> After this any insert into t1 from node1 will be replicated to node2
> and node3, additionally node2's replicated data(which was replicated
> from node1) will also be sent to node3 causing inconsistency. If the
> table has unique or primary key constraints, it will lead to an error.
>
> Another option would be to replicate the create subscription in
> disabled state and not support few ddl replication of alter
> subscription which will connect to publisher like:
> 1) Alter subscription sub1 enable;
> 2) Alter subscription sub1 refresh publication;
>
> But in this case also, we will be able to support few alter
> subscription commands and not support few alter subscription commands.
> I feel it is better that we do not need to support ddl replication of
> create/alter/drop subscription command and let users handle the
> subscription commands.
> Thoughts?
>

So essentially, node3 is subscribed 2x from the same table in node1

node1 --> node2
|| ddl
|V
+-> node3

I think the suggested options are:

option #1. If you support CREATE SUBSCRIPTION DDL replication then you
can end up with the conflict troubles you described above

option #2. If you support CREATE SUBSCRIPTION DDL replication but only
make sure it is disabled first then your above scenario might be OK
but you will also need to *not* allow DDL replication of the ALTER
SUBSCRIPTION which might cause it to become re-enabled. IMO adding
tricky rules is just inviting more problems.

option #3. Do nothing, don't support it. +1 but see below for a
variation of this

~

Actually, I am sort of expecting lots of potential difficulties with
DDL replication and this CREATE SUBSCRIPTION problem is just one of
them. IMO it would be a mistake to try and make the first version of
these patches try to do *everything*. E.g. Why invent tricky solutions
to problems without yet knowing user expectations/requirements?

Therefore, my first impression is to do a generic option #4:

option #4. This is a variation of "do nothing". My suggestion is you
can still replicate every DDL via logical replication messages but
just don't actually *apply* anything on the subscriber side for the
commands which are problematic (like this one is). Instead of
applying, the subscriber can just log a NOTICE about each command that
was skipped. This will make it easier for the user to know what didn’t
happen, but they can just cut/paste that command from the NOTICE if
they really want to. Also, this option #4 is deliberately generic,
which means you can do the same for every kind of DDL that proves too
difficult to automate in the first version (I doubt CREATE
SUBSCRIPTION will be the only example).

The option #4 result might look like this:


test_sub=# create subscription sub1 connection 'dbname=test_pub'
publication pub1;
NOTICE: created replication slot "sub1" on publisher CREATE SUBSCRIPTION
NOTICE: subscription "sub1" skipping DDL: create subscription
sub_node2 connection 'dbname=postgres host=node1 port=5432'
publication pub_node1;
...


(And if it turns out users really do want this then it can be
revisited in later patch versions)

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Force testing of query jumbling code in TAP tests

2023-02-12 Thread Michael Paquier
Hi all,

As mentioned two times on this thread, there is not much coverage for
the query jumbling code, even if it is in core:
https://www.postgresql.org/message-id/Y5BHOUhX3zTH/i...@paquier.xyz

Ths issue is that we have the options to enable it, but only
pg_stat_statements is able to enable and stress it.  This causes
coverage to be missed for all query patterns that are not covered
directly by pg_stat_statements, like XML expressions, various DML
patterns, etc.  More aggressive testing would also ensure that no
nodes are marked as no_query_jumble while they should be included in a
computation.

Attached is a patch to improve that.  The main regression database is
able to cover everything, basically, so I'd like to propose the
addition of some extra configuration in 027_stream_regress.pl to
enable pg_stat_statements.  This could be added in the pg_upgrade
tests, but that felt a bit less adapted here.  Or can people think
about cases where checking pg_stat_statements makes more sense after
an upgrade or on a standby?  One thing that makes sense for a standby
is to check that the contents of pg_stat_statements are empty?

With this addition, the query jumbling gets covered at 95%~, while
https://coverage.postgresql.org/src/backend/nodes/index.html reports
currently 35%.

Thoughts or comments?
--
Michael
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 570bf42b58..c60314d195 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-
 
-EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm
+EXTRA_INSTALL=contrib/pg_prewarm contrib/pg_stat_statements contrib/test_decoding
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 69d6ddf281..9fb565f33c 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -14,6 +14,16 @@ $node_primary->init(allows_streaming => 1);
 $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
+
+# Enable pg_stat_statements to force tests with do query jumbling.
+# pg_stat_statements.max should be large enough to hold all the entries
+# of the regression database.
+$node_primary->append_conf('postgresql.conf',
+			   qq{shared_preload_libraries = 'pg_stat_statements'
+pg_stat_statements.max = 5
+compute_query_id = 'regress'
+});
+
 # We'll stick with Cluster->new's small default shared_buffers, but since that
 # makes synchronized seqscans more probable, it risks changing the results of
 # some test queries.  Disable synchronized seqscans to prevent that.


signature.asc
Description: PGP signature


Re: Introduce a new view for checkpointer related stats

2023-02-12 Thread Bharath Rupireddy
On Fri, Feb 10, 2023 at 10:00 PM Bharath Rupireddy
 wrote:
>
> Thanks. FWIW, I've attached the patch introducing pg_stat_checkpointer
> as 0003 here.
>
> Please review the attached v7 patch set.

Needed a rebase. Please review the attached v8 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From c7d1a10b113c90e3712e5f836c21b5a40f909a24 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 13 Feb 2023 05:01:52 +
Subject: [PATCH v8] Drop buffers_backend column from pg_stat_bgwriter view

---
 doc/src/sgml/monitoring.sgml  |  9 -
 src/backend/catalog/system_views.sql  |  1 -
 src/backend/postmaster/checkpointer.c | 20 +--
 .../utils/activity/pgstat_checkpointer.c  |  2 --
 src/backend/utils/adt/pgstatfuncs.c   |  6 --
 src/include/catalog/pg_proc.dat   |  4 
 src/include/pgstat.h  |  1 -
 src/test/regress/expected/rules.out   |  1 -
 8 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad..c0cae0f4f3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4059,15 +4059,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
- 
-  
-   buffers_backend bigint
-  
-  
-   Number of buffers written directly by a backend
-  
- 
-
  
   
buffers_backend_fsync bigint
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..9bb96e27aa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1112,7 +1112,6 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
 pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
 pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-pg_stat_get_buf_written_backend() AS buffers_backend,
 pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index aaad5c5228..baadd08044 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -91,17 +91,15 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
+ * num_backend_fsync counts the subset of buffer writes performed by user
+ * backend processes that also had to do their own fsync, because the
+ * checkpointer failed to absorb their request.
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
+ * Unlike the checkpoint fields, num_backend_fsync and the requests fields are
+ * protected by CheckpointerCommLock.
  *--
  */
 typedef struct
@@ -125,7 +123,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
 	int			num_requests;	/* current # of requests */
@@ -1096,10 +1093,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1272,12 +1265,9 @@ AbsorbSyncRequests(void)
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
 	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
 	PendingCheckpointerStats.buf_fsync_backend
 		+= CheckpointerShmem->num_backend_fsync;
 
-	CheckpointerShmem->num_backend_writes = 0;
 	CheckpointerShmem->num_backend_fsync = 0;
 
 	/*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-12 Thread Drouvot, Bertrand

Hi,

On 2/10/23 10:46 PM, Andres Freund wrote:

Hi,

On 2023-02-09 11:38:18 +0100, Drouvot, Bertrand wrote:

Please find attached a patch proposal for $SUBJECT.

The idea has been raised in [1] by Andres: it would allow to simplify even more 
the work done to
generate pg_stat_get_xact*() functions with Macros.


Thanks!



Thanks for looking at it!


I think this is useful beyond being able to generate those functions with
macros. The fact that we had to deal with transactional code in pgstatfuncs.c
meant that a lot of the relevant itnernals had to be exposed "outside" pgstat,
which seems architecturally not great.

 
Right, good point.



Indeed, with the reconciliation done in find_tabstat_entry() then all the 
pg_stat_get_xact*() functions
(making use of find_tabstat_entry()) now "look the same" (should they take into 
account live subtransactions or not).


I'm not bothered by making all of pg_stat_get_xact* functions more expensive,
they're not a hot code path. But if we need to, we could just add a parameter
to find_tabstat_entry() indicating whether we need to reconcile or not.



I think that's a good idea to avoid doing extra work if not needed.
V2 adds such a bool.


/* save stats for this function, later used to compensate for recursion 
*/
-   fcu->save_f_total_time = pending->f_counts.f_total_time;
+   fcu->save_f_total_time = pending->f_total_time;
  
  	/* save current backend-wide total time */

fcu->save_total = total_func_time;


The diff is noisy due to all the mechanical changes like the above. Could that
be split into a separate commit?



Fully agree, the PgStat_BackendFunctionEntry stuff will be done in a separate 
patch.




  find_tabstat_entry(Oid rel_id)
  {
PgStat_EntryRef *entry_ref;
+   PgStat_TableStatus *tablestatus = NULL;
  
  	entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, MyDatabaseId, rel_id);

if (!entry_ref)
entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, 
InvalidOid, rel_id);
  
  	if (entry_ref)

-   return entry_ref->pending;
-   return NULL;
+   {
+   PgStat_TableStatus *tabentry = (PgStat_TableStatus *) 
entry_ref->pending;


I'd add an early return for the !entry_ref case, that way you don't need to
indent the bulk of the function.



Good point, done in V2.




+   PgStat_TableXactStatus *trans;
+
+   tablestatus  = palloc(sizeof(PgStat_TableStatus));
+   memcpy(tablestatus, tabentry, sizeof(PgStat_TableStatus));


For things like this I'd use
   *tablestatus = *tabentry;

that way the compiler will warn you about mismatching types, and you don't
need the sizeof().




Good point, done in V2.


+   /* live subtransactions' counts aren't in t_counts yet */
+   for (trans = tabentry->trans; trans != NULL; trans = 
trans->upper)
+   {
+   tablestatus->t_counts.t_tuples_inserted += 
trans->tuples_inserted;
+   tablestatus->t_counts.t_tuples_updated += 
trans->tuples_updated;
+   tablestatus->t_counts.t_tuples_deleted += 
trans->tuples_deleted;
+   }
+   }


Hm, why do we end uup with t_counts still being used here, but removed other
places?



t_counts are not removed, maybe you are confused with the "f_counts" that were 
removed
in V1 due to the PgStat_BackendFunctionEntry related changes?




diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..40a6fbf871 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1366,7 +1366,10 @@ pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
if ((tabentry = find_tabstat_entry(relid)) == NULL)
result = 0;
else
+   {
result = (int64) (tabentry->t_counts.t_numscans);
+   pfree(tabentry);
+   }
  
  	PG_RETURN_INT64(result);

  }


I don't think we need to bother with individual pfrees in this path. The
caller will call the function in a dedicated memory context, that'll be reset
very soon after this.


Oh right, the palloc is done in the ExprContext memory context that is reset 
soon after.
Removing the pfrees in V2 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/backend/utils/activity/pgstat_relation.c 
b/src/backend/utils/activity/pgstat_relation.c
index f793ac1516..1d3f068a1c 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -474,17 +474,33 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid)
  * If no entry found, return NULL, don't create a new one
  */
 PgStat_TableStatus *
-find_tabstat_entry(Oid rel_id)
+find_tabstat_entry(Oid rel_id, bool add_subtrans)
 {
PgStat_EntryRef *entry_ref;
+   PgStat_TableXactStatus *trans;
+

Re: Making Vars outer-join aware

2023-02-12 Thread Richard Guo
On Mon, Feb 13, 2023 at 7:58 AM Justin Pryzby  wrote:

> The patch broke this query:
>
> select from pg_inherits inner join information_schema.element_types
> right join (select from pg_constraint as sample_2) on true
> on false, lateral (select scope_catalog, inhdetachpending from
> pg_publication_namespace limit 3);
> ERROR:  could not devise a query plan for the given query


Thanks for the report!  I've looked at it a little bit and traced down
to function have_unsafe_outer_join_ref().  The comment there says

 * In practice, this test never finds a problem ...
 * ...
 * It still seems worth checking
 * as a backstop, but we don't go to a lot of trouble: just reject if the
 * unsatisfied part includes any outer-join relids at all.

This seems not correct as showed by the counterexample.  ISTM that we
need to do the check honestly as what the other comment says

 * If the parameterization is only partly satisfied by the outer rel,
 * the unsatisfied part can't include any outer-join relids that could
 * null rels of the satisfied part.

The NOT_USED part of code is doing this check.  But I think we need a
little tweak.  We should check the nullable side of related outer joins
against the satisfied part, rather than inner_paramrels.  Maybe
something like attached.

However, this test seems to cost some cycles after the change.  So I
wonder if it's worthwhile to perform it, considering that join order
restrictions should be able to guarantee there is no problem here.

BTW, here is a simplified query that can trigger this issue on HEAD.

select * from t1 inner join t2 left join (select null as c from t3 left
join t4 on true) as sub on true on true, lateral (select c, t1.a from t5
offset 0 ) ss;

Thanks
Richard


v1-0001-Fix-for-have_unsafe_outer_join_ref.patch
Description: Binary data


Re: recovery modules

2023-02-12 Thread Michael Paquier
On Thu, Feb 09, 2023 at 02:48:26PM -0800, Nathan Bossart wrote:
> On Thu, Feb 09, 2023 at 12:18:55PM -0800, Andres Freund wrote:
>>>  
>>> -typedef bool (*ArchiveCheckConfiguredCB) (void);
>>> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
>>>  
>>>  
>>>  If true is returned, the server will proceed with
>> 
>> Hm. I wonder if ArchiveCheckConfiguredCB() should actually work without the
>> state. We're not really doing anything yet, at that point, so it shouldn't
>> really need state?
>> 
>> The reason I'm wondering is that I think we should consider calling this from
>> the GUC assignment hook, at least in postmaster. Which would make it more
>> convenient to not have state, I think?
> 
> I agree that this callback should typically not need the state, but I'm not
> sure whether it fits into the assignment hook for archive_library.  This
> callback is primarily meant for situations when you have archiving enabled,
> but your module isn't configured yet (e.g., archive_command is empty).  In
> this case, we keep the WAL around, but we don't try to archive it until
> this hook returns true.  It's up to each module to define that criteria.  I
> can imagine someone introducing a GUC in their archive module that
> temporarily halts archiving via this callback.  In that case, calling it
> via a GUC assignment hook probably won't work.  In fact, checking whether
> archive_command is empty in that hook might not work either.

Keeping the state in the configure check callback does not strike me
as a problem, FWIW.

>>>  
>>> -typedef void (*ArchiveShutdownCB) (void);
>>> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
>>>  
>>> 
>>>
>> 
>> Perhaps mention that this needs to free state it allocated in the
>> ArchiveModuleState, or it'll be leaked?
> 
> done
> 
> I left this out originally because the archiver exits shortly after calling
> this.  However, if you have DSM segments or something, it's probably wise
> to make sure those are cleaned up.  And I suppose we might not always exit
> immediately after this callback, so establishing the habit of freeing the
> state could be a good idea.  In addition to updating this part of the docs,
> I wrote a shutdown callback for basic_archive that frees its state.

This makes sense to me.  Still, DSM segments had better be cleaned up
with dsm_backend_shutdown().

+   basic_archive_context = data->context;
+   if (CurrentMemoryContext == basic_archive_context)
+   MemoryContextSwitchTo(TopMemoryContext);
+
+   if (MemoryContextIsValid(basic_archive_context))
+   MemoryContextDelete(basic_archive_context);

This is a bit confusing, because it means that we enter in the
shutdown callback with one context, but exit it under
TopMemoryContext.  Are you sure that this will be OK when there could
be multiple callbacks piled up with before_shmem_exit()?  shmem_exit()
has nothing specific to memory contexts.

Is putting the new headers in src/include/postmaster/ the best move in
the long term?  Perhaps that's fine, but I was wondering whether a new
location like archive/ would make more sense.  pg_arch.h being in the
postmaster section is fine.
--
Michael


signature.asc
Description: PGP signature


Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-12 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand" 
 wrote in 
>> I think this is useful beyond being able to generate those functions
>> with
>> macros. The fact that we had to deal with transactional code in
>> pgstatfuncs.c
>> meant that a lot of the relevant itnernals had to be exposed "outside"
>> pgstat,
>> which seems architecturally not great.
>> 
> Right, good point.

Agreed.

> Removing the pfrees in V2 attached.

Ah, that sound good.

if (!entry_ref)
+   {
entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, 
InvalidOid, rel_id);
+   return tablestatus;
+   }

We should return something if the call returns a non-null value?

So, since we want to hide the internal from pgstatfuncs, the
additional flag should be gone. If the additional cost doesn't bother
anyone, I don't mind to remove the flag.  The patch will get far
simpler by that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-12 Thread Masahiko Sawada
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
 wrote:
>
> I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

> but I did find some inefficiencies in tidstore_add_tids() that are worth 
> talking about first, addressed in a rough fashion in the attached .txt 
> addendums that I can clean up and incorporate later.
>
> To start, I can reproduce the regression with this test as well:
>
> select * from bench_tidstore_load(0, 10 * 1000 * 1000);
>
> v15 + v26 store + adjustments:
>  mem_allocated | load_ms
> ---+-
>   98202152 |1676
>
> v26 0001-0008
>  mem_allocated | load_ms
> ---+-
>   98202032 |1826
>
> ...and reverting to the alternate way to update the parent didn't help:
>
> v26 0001-6, 0008, insert_inner w/ null parent
>
>  mem_allocated | load_ms
> ---+-
>   98202032 |1825
>
> ...and I'm kind of glad that wasn't the problem, because going back to that 
> would be a pain for the shmem case.
>
> Running perf doesn't show anything much different in the proportions (note 
> that rt_set must have been inlined when declared locally in v26):
>
> v15 + v26 store + adjustments:
>   65.88%  postgres  postgres [.] tidstore_add_tids
>   10.74%  postgres  postgres [.] rt_set
>9.20%  postgres  postgres [.] palloc0
>6.49%  postgres  postgres [.] rt_node_insert_leaf
>
> v26 0001-0008
>   78.50%  postgres  postgres [.] tidstore_add_tids
>8.88%  postgres  postgres [.] palloc0
>6.24%  postgres  postgres [.] local_rt_node_insert_leaf
>
> v2699-0001: The first thing I noticed is that palloc0 is taking way more time 
> than it should, and it's because the compiler doesn't know the values[] array 
> is small. One reason we need to zero the array is to make the algorithm 
> agnostic about what order the offsets come in, as I requested in a previous 
> review. Thinking some more, I was way too paranoid about that. As long as 
> access methods scan the line pointer array in the usual way, maybe we can 
> just assert that the keys we create are in order, and zero any unused array 
> entries as we find them. (I admit I can't actually think of a reason we would 
> ever encounter offsets out of order.)

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

> Also, we can keep track of the last key we need to consider for insertion 
> into the radix tree, and ignore the rest. That might shave a few cycles 
> during the exclusive lock when the max offset of an LP_DEAD item < 64 on a 
> given page, which I think would be common in the wild. I also got rid of the 
> special case for non-encoding, since shifting by zero should work the same 
> way. These together led to a nice speedup on the v26 branch:
>
>  mem_allocated | load_ms
> ---+-
>   98202032 |1386
>
> v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass 
> to tid_to_key_off(). That's bad for tidstore_add_tids() because 
> ItemPointerSetBlockNumber() must do this in order to allow the struct to be 
> SHORTALIGN'd:
>
> static inline void
> BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
> {
> blockId->bi_hi = blockNumber >> 16;
> blockId->bi_lo = blockNumber & 0x;
> }
>
> Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse 
> the above process:
>
> static inline BlockNumber
> BlockIdGetBlockNumber(const BlockIdData *blockId)
> {
> return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) 
> blockId->bi_lo);
> }
>
> There is no reason to do any of this if we're not reading/writing directly 
> to/from an on-disk tid etc. To avoid this, I created a new function 
> encode_key_off() [name could be better], which deals with the raw block 
> number that we already have. Then turn tid_to_key_off() into a wrapper around 
> that, since we still need the full conversion for tidstore_lookup_tid().
>
> v2699-0003: Get rid of all the remaining special cases for encoding/or not. I 
> am unaware of the need to optimize that case or treat it in any way 
> differently. I haven't tested this on an installation with non-default 
> blocksize and didn't measure this separately, but 0002+0003 gives:
>
>  mem_allocated | load_ms
> ---+-
>   98202032 |1259
>
> If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com