Re: Sort optimizations: Making in-memory sort cache-aware
On 12/02/23 01:59, Andres Freund wrote: However, tuplesort.c completely breaks that for > 1 column sorts. While spatial locality for accesses to the ->memtuples array is decent during sorting, due to qsort's subdividing of the problem, the locality for access to the tuples is *awful*. The memtuples array is reordered while sorting, but the tuples themselves aren't. Unless the input data is vaguely presorted, the access pattern for the tuples has practically zero locality. This probably explains the issue. One idea is to keep track of the distinctness of the first column sorted and to behave differently if it's significantly lower than the number of to be sorted tuples. E.g. by doing a first sort solely on the first column, then reorder the MinimalTuples in memory, and then continue normally. There's two main problems with that idea: 1) It's hard to re-order the tuples in memory, without needing substantial amounts of additional memory 2) If the second column also is not very distinct, it doesn't buy you much, if anything. But it might provide sufficient benefits regardless. And a naive version, requiring additional memory, should be quick to hack up. I get the second point (to reorder MinimalTuples by sorting on first column) but not sure how can keep `track of the distinctness of the first column`? Most sorting papers don't discuss variable-width data, nor a substantial amount of cache-polluting work while gathering the data that needs to be sorted. I definitely agree with this. My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort just those using the existing code, allocate new memory for all the corresponding MinimalTuples in one allocation, and copy the MinimalTuples into that, obviously in ->memtuples order. This should be easy hack and we can easily profile benefits from this. It's very likely we can do better than just doing a plain sort of everything after that. You effectively end up with a bounded number of pre-sorted blocks, so the most obvious thing to try is to build a heap of those blocks and effectively do a heapsort between the presorted blocks. This is very interesting. It is actually what few papers had suggested, to do sort in blocks and then merge (while sorting) presorted blocks. I am bit fuzzy on implementation of this (if it is same as what I am thinking) but this is definitely what I was looking for. A related, but separate, improvement is to reduce / remove the memory allocation overhead. The switch to GenerationContext helped some, but still leaves a bunch of overhead. And it's not used for bounded sorts right now. We don't palloc/pfree individual tuples during a normal sorts, but we do have some, for bounded sorts. I think with a reasonable amount of work we could avoid that for all tuples in ->tuplecontext. And switch to a trivial bump allocator, getting rid of all allocator overhead. The biggest source of individual pfree()s in the bounded case is that we unconditionally copy the tuple into base->tuplecontext during puttuple. Even though quite likely we'll immediately free it in the "case TSS_BOUNDED" block. We could instead pre-check that the tuple won't immediately be discarded, before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of course. This Looks doable, try this. I think we also can replace the individual freeing of tuples in tuplesort_heap_replace_top(), by allowing a number of dead tuples to accumulate (up to work_mem/2 maybe), and then copying the still living tuples into new memory context, freeing the old one. While that doesn't sound cheap, in bounded sorts the number of tuples commonly is quite limited, the pre-check before copying the tuple will prevent this from occurring too often, the copy will result in higher locality, and, most importantly, the reduced palloc() overhead (~25% or so with aset.c) will result in a considerably higher cache hit ratio / lower memory usage. I would try this as well. There are things we could do to improve upon this that don't require swapping out our sorting implementation wholesale. Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort implementation and with potentially good amount of improvements). I will give these a try and see if I could get some performance gains. Thanks, Ankit
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Sun, 12 Feb 2023 at 19:39, Tom Lane wrote: > I find this patch horribly dangerous. I see LogicalRepApplyLoop() does something similar with a StringInfoData. Maybe it's just scarier having an external function in stringinfo.c which does this as having it increases the chances of someone using it for the wrong thing. > It could maybe be okay if we added the capability for StringInfoData > to understand (and enforce) that its "data" buffer is read-only. > However, that'd add overhead to every existing use-case. I'm not very excited by that. I considered just setting maxlen = -1 in the new function and adding Asserts to check for that in each of the appendStringInfo* functions. However, since the performance gains are not so great, I'll probably just drop the whole thing given there's resistance. David
Re: run pgindent on a regular basis / scripted manner
On 2023-02-10 Fr 10:21, Andrew Dunstan wrote: On 2023-02-10 Fr 04:25, Jelte Fennema wrote: Ah yes, I had seen that when I read the initial --commit patch but then forgot about it when the flag didn't work at all when I tried it. Attached is a patch that fixes the issue. And also implements the --dirty and --staged flags in pgindent that Robert Haas requested. I don't think just adding a diff filter is really a sufficient fix. The file might have been deleted since the commit(s) in question. Here's a more general fix for missing files. OK, I've pushed this along with a check to make sure we only process each file once. I'm not sure how much more I really want to do here. Given the way pgindent now processes command line arguments, maybe the best thing is for people to use that. Use of git aliases can help. Something like these for example [alias] dirty = diff --name-only --diff-filter=ACMU -- . staged = diff --name-only --cached --diff-filter=ACMU -- . dstaged = diff --name-only --diff-filter=ACMU HEAD -- . and then you could do pgindent `git dirty` The only danger would be if there were no dirty files. Maybe we need a switch to inhibit using the current directory if there are no command line files. Thoughts? cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 4:14 AM Andres Freund wrote: > > > +ssize_t > > +pg_pwrite_zeros(int fd, size_t size) > > +{ > > + PGAlignedBlock zbuffer; > > + size_t zbuffer_sz; > > + struct ioveciov[PG_IOV_MAX]; > > + int blocks; > > + size_t remaining_size = 0; > > + int i; > > + ssize_t written; > > + ssize_t total_written = 0; > > + > > + zbuffer_sz = sizeof(zbuffer.data); > > + > > + /* Zero-fill the buffer. */ > > + memset(zbuffer.data, 0, zbuffer_sz); > > I previously commented on this - why are we memseting a buffer on every call > to this? That's not at all free. > > Something like > static const PGAlignedBlock zerobuf = {0}; > would do the trick. You do need to cast the const away, to assign to > iov_base, but that's not too ugly. Thanks for looking at it. We know that we don't change the zbuffer in the function, so can we avoid static const and have just a static variable, like the attached v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do you see any problem with it? FWIW, it comes out like the attached v1-0001-Use-static-const-variable-to-avoid-memset-calls-i.patch with static const. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 2527c6083bbdf006900be1a1dbf15ed815184d2b Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 12 Feb 2023 10:58:33 + Subject: [PATCH v1] Use static variable to avoid memset() calls in pg_pwrite_zeros() --- src/common/file_utils.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..0d59488f60 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ + /* Zero-filled buffer worth BLCKSZ. */ + static PGAlignedBlock zbuffer = {0}; size_t zbuffer_sz; struct iovec iov[PG_IOV_MAX]; int blocks; @@ -548,10 +549,7 @@ pg_pwrite_zeros(int fd, size_t size) ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); + zbuffer_sz = BLCKSZ; /* Prepare to write out a lot of copies of our zero buffer at once. */ for (i = 0; i < lengthof(iov); ++i) -- 2.34.1 From adc5ea83fb6726a5d8a1f3b87d6b3b03550c7fcb Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 12 Feb 2023 10:55:09 + Subject: [PATCH v1] Use static const variable to avoid memset calls in pg_pwrite_zeros --- src/common/file_utils.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..cd15cc2fdb 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ + /* Zero-filled buffer worth BLCKSZ. */ + static const PGAlignedBlock zbuffer = {0}; size_t zbuffer_sz; struct iovec iov[PG_IOV_MAX]; int blocks; @@ -548,15 +549,12 @@ pg_pwrite_zeros(int fd, size_t size) ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); + zbuffer_sz = BLCKSZ; /* Prepare to write out a lot of copies of our zero buffer at once. */ for (i = 0; i < lengthof(iov); ++i) { - iov[i].iov_base = zbuffer.data; + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data); iov[i].iov_len = zbuffer_sz; } -- 2.34.1
Re: Transparent column encryption
> On Feb 11, 2023, at 1:54 PM, Mark Dilger wrote: > > Here are some observations I should mention, src/sgml/html/libpq-exec.html needs clarification: > paramFormats[]Specifies whether parameters are text (put a zero in the array > entry for the corresponding parameter) or binary (put a one in the array > entry for the corresponding parameter). If the array pointer is null then all > parameters are presumed to be text strings. Perhaps you should edit this last sentence to say that all parameters are presumed to be test strings without forced encryption. > Values passed in binary format require knowledge of the internal > representation expected by the backend. For example, integers must be passed > in network byte order. Passing numeric values requires knowledge of the > server storage format, as implemented in > src/backend/utils/adt/numeric.c::numeric_send() and > src/backend/utils/adt/numeric.c::numeric_recv(). > When column encryption is enabled, the second-least-significant byte of this > parameter specifies whether encryption should be forced for a parameter. The value 0x10 has a one as its second-least-significant *nibble*, but that's a really strange way to describe the high-order nibble, and perhaps not what you mean. Could you clarify? > Set this byte to one to force encryption. I think setting the byte to one (0x01) means "binary unencrypted", not "force encryption". Don't you mean to set this byte to 16? > For example, use the C code literal 0x10 to specify text format with forced > encryption. If the array pointer is null then encryption is not forced for > any parameter. > If encryption is forced for a parameter but the parameter does not correspond > to an encrypted column on the server, then the call will fail and the > parameter will not be sent. This can be used for additional security against > a compromised server. (The drawback is that application code then needs to be > kept up to date with knowledge about which columns are encrypted rather than > letting the server specify this.) I think you should say something about how specifying 0x11 will behave -- in other words, asking for encrypted binary data. I believe that this is will draw a "format must be text for encrypted parameter" error, and that the docs should clearly say so. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: run pgindent on a regular basis / scripted manner
Andrew Dunstan writes: > ... then you could do > pgindent `git dirty` > The only danger would be if there were no dirty files. Maybe we need a > switch to inhibit using the current directory if there are no command > line files. It seems like "indent the whole tree" is about to become a minority use-case. Maybe instead of continuing to privilege that case, we should say that it's invoked by some new switch like --all-files, and without that only the stuff identified by command-line arguments gets processed. regards, tom lane
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > Thanks for looking at it. We know that we don't change the zbuffer in > the function, so can we avoid static const and have just a static > variable, like the attached > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do > you see any problem with it? Making it static const is better, because it allows the memory for the variable to be put in a readonly section. > /* Prepare to write out a lot of copies of our zero buffer at once. */ > for (i = 0; i < lengthof(iov); ++i) > { > - iov[i].iov_base = zbuffer.data; > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > &zbuffer)->data); > iov[i].iov_len = zbuffer_sz; > } Another thing: I think we should either avoid iterating over all the IOVs if we don't need them, or, even better, initialize the array as a constant, once. Greetings, Andres Freund
Re: pgindent vs variable declaration across multiple lines
Now that pg_bsd_indent is in our tree, we can format this as a patch against Postgres sources. I'll stick it in the March CF so we don't forget about it. regards, tom lane diff --git a/src/tools/pg_bsd_indent/args.c b/src/tools/pg_bsd_indent/args.c index d08b086a88..38eaa5a5bf 100644 --- a/src/tools/pg_bsd_indent/args.c +++ b/src/tools/pg_bsd_indent/args.c @@ -51,7 +51,7 @@ static char sccsid[] = "@(#)args.c 8.1 (Berkeley) 6/6/93"; #include "indent_globs.h" #include "indent.h" -#define INDENT_VERSION "2.1.1" +#define INDENT_VERSION "2.1.2" /* profile types */ #define PRO_SPECIAL 1 /* special case */ diff --git a/src/tools/pg_bsd_indent/io.c b/src/tools/pg_bsd_indent/io.c index 4149424294..9d64ca1ee5 100644 --- a/src/tools/pg_bsd_indent/io.c +++ b/src/tools/pg_bsd_indent/io.c @@ -201,11 +201,12 @@ dump_line(void) ps.decl_on_line = ps.in_decl; /* if we are in the middle of a * declaration, remember that fact for * proper comment indentation */ -ps.ind_stmt = ps.in_stmt & ~ps.in_decl; /* next line should be - * indented if we have not - * completed this stmt and if - * we are not in the middle of - * a declaration */ +/* next line should be indented if we have not completed this stmt, and + * either we are not in a declaration or we are in an initialization + * assignment; but not if we're within braces in an initialization, + * because that scenario is handled by other rules. */ +ps.ind_stmt = ps.in_stmt && + (!ps.in_decl || (ps.block_init && ps.block_init_level <= 0)); ps.use_ff = false; ps.dumped_decl_indent = 0; *(e_lab = s_lab) = '\0'; /* reset buffers */ diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index a793971e07..3398d62133 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -13,7 +13,7 @@ use IO::Handle; use Getopt::Long; # Update for pg_bsd_indent version -my $INDENT_VERSION = "2.1.1"; +my $INDENT_VERSION = "2.1.2"; # Our standard indent settings my $indent_opts =
Re: run pgindent on a regular basis / scripted manner
On 2023-02-12 Su 11:24, Tom Lane wrote: Andrew Dunstan writes: ... then you could do pgindent `git dirty` The only danger would be if there were no dirty files. Maybe we need a switch to inhibit using the current directory if there are no command line files. It seems like "indent the whole tree" is about to become a minority use-case. Maybe instead of continuing to privilege that case, we should say that it's invoked by some new switch like --all-files, and without that only the stuff identified by command-line arguments gets processed. I don't think we need --all-files. The attached gets rid of the build and code-base cruft, which is now in any case obsolete given we've put pg_bsd_indent in our code base. So the way to spell this instead of "pgindent --all-files" would be "pgindent ." I added a warning if there are no files at all specified. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index a793971e07..b098cef02a 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -21,7 +21,7 @@ my $indent_opts = my $devnull = File::Spec->devnull; -my ($typedefs_file, $typedef_str, $code_base, +my ($typedefs_file, $typedef_str, @excludes, $indent, $build, $show_diff, $silent_diff, $help, @commits,); @@ -33,10 +33,8 @@ my %options = ( "commit=s" => \@commits, "typedefs=s" => \$typedefs_file, "list-of-typedefs=s" => \$typedef_str, - "code-base=s"=> \$code_base, "excludes=s" => \@excludes, "indent=s" => \$indent, - "build" => \$build, "show-diff" => \$show_diff, "silent-diff"=> \$silent_diff,); GetOptions(%options) || usage("bad command line argument"); @@ -46,22 +44,16 @@ usage() if $help; usage("Cannot have both --silent-diff and --show-diff") if $silent_diff && $show_diff; -usage("Cannot use --commit with --code-base or command line file list") - if (@commits && ($code_base || @ARGV)); +usage("Cannot use --commit with command line file list") + if (@commits && @ARGV); -run_build($code_base) if ($build); - -# command line option wins, then environment (which is how --build sets it) , -# then locations. based on current dir, then default location +# command line option wins, then environment, then locations based on current +# dir, then default location $typedefs_file ||= $ENV{PGTYPEDEFS}; -# build mode sets PGINDENT +# get indent location for environment or default $indent ||= $ENV{PGINDENT} || $ENV{INDENT} || "pg_bsd_indent"; -# if no non-option arguments or commits are given, default to looking in the -# current directory -$code_base ||= '.' unless (@ARGV || @commits); - my $sourcedir = locate_sourcedir(); # if it's the base of a postgres tree, we will exclude the files @@ -121,8 +113,7 @@ sub check_indent sub locate_sourcedir { # try fairly hard to locate the sourcedir - my $where = $code_base || '.'; - my $sub = "$where/src/tools/pgindent"; + my $sub = "./src/tools/pgindent"; return $sub if -d $sub; # try to find it from an ancestor directory $sub = "../src/tools/pgindent"; @@ -320,72 +311,6 @@ sub show_diff return $diff; } -sub run_build -{ - eval "use LWP::Simple;";## no critic (ProhibitStringyEval); - - my $code_base = shift || '.'; - my $save_dir = getcwd(); - - # look for the code root - foreach (1 .. 5) - { - last if -d "$code_base/src/tools/pgindent"; - $code_base = "$code_base/.."; - } - - die "cannot locate src/tools/pgindent directory in \"$code_base\"\n" - unless -d "$code_base/src/tools/pgindent"; - - chdir "$code_base/src/tools/pgindent"; - - my $typedefs_list_url = - "https://buildfarm.postgresql.org/cgi-bin/typedefs.pl";; - - my $rv = getstore($typedefs_list_url, "tmp_typedefs.list"); - - die "cannot fetch typedefs list from $typedefs_list_url\n" - unless is_success($rv); - - $ENV{PGTYPEDEFS} = abs_path('tmp_typedefs.list'); - - my $indentrepo = "https://git.postgresql.org/git/pg_bsd_indent.git";; - system("git clone $indentrepo >$devnull 2>&1"); - die "could not fetch pg_bsd_indent sources from $indentrepo\n" - unless $? == 0; - - chdir "pg_bsd_indent" || die; - system("make all check >$devnull"); - die "could not build pg_bsd_indent from source\n" - unless $? == 0; - - $ENV{PGINDENT} = abs_path('pg_bsd_indent'); - - chdir $save_dir; - return; -} - -sub build_clean -{ - my $code_base = shift || '.'; - - # look for the code root - foreach (1 .. 5) - { - last if -d "$code_base/src/tools/pgindent"; - $code_base = "$code_base/.."; - } - - die "cannot locate src/tools/pgindent directory in \"$code_base\"\n" - unless -d "$code_base/src/tools/pgindent"; - - chdir "$code_base"; - - system("rm -rf src/tools/pgindent/pg_bsd_indent"); - system("rm -f src/tools/pgindent/tmp_typedefs.list"); - return; -} - sub usage { my $message = shift; @@ -397,10 +322,8 @@ Options: --commit
Re: run pgindent on a regular basis / scripted manner
On Sun, Feb 12, 2023 at 11:24:14AM -0500, Tom Lane wrote: > Andrew Dunstan writes: > > ... then you could do > > pgindent `git dirty` > > The only danger would be if there were no dirty files. Maybe we need a > > switch to inhibit using the current directory if there are no command > > line files. > > It seems like "indent the whole tree" is about to become a minority > use-case. Maybe instead of continuing to privilege that case, we > should say that it's invoked by some new switch like --all-files, > and without that only the stuff identified by command-line arguments > gets processed. It seems like if pgindent knows about git, it ought to process only tracked files. Then, it wouldn't need to manually exclude generated files, and it wouldn't process vpath builds and who-knows-what else it finds in CWD. At least --commit doesn't seem to work when run outside of the root source dir. -- Justin
Re: run pgindent on a regular basis / scripted manner
Andrew Dunstan writes: > On 2023-02-12 Su 11:24, Tom Lane wrote: >> It seems like "indent the whole tree" is about to become a minority >> use-case. Maybe instead of continuing to privilege that case, we >> should say that it's invoked by some new switch like --all-files, >> and without that only the stuff identified by command-line arguments >> gets processed. > I don't think we need --all-files. The attached gets rid of the build > and code-base cruft, which is now in any case obsolete given we've put > pg_bsd_indent in our code base. So the way to spell this instead of > "pgindent --all-files" would be "pgindent ." Ah, of course. > I added a warning if there are no files at all specified. LGTM. regards, tom lane
Re: ExecRTCheckPerms() and many prunable partitions (sqlsmith)
On Tue, Nov 29, 2022 at 10:37:56PM +0900, Amit Langote wrote: > 0002 contains changes that has to do with changing how we access > checkAsUser in some foreign table planning/execution code sites. > Thought it might be better to describe it separately too. This was committed as 599b33b94: Stop accessing checkAsUser via RTE in some cases That seems to add various elog()s which are hit frequently by sqlsmith: postgres=# select from (select transaction from pg_prepared_xacts right join pg_available_extensions on false limit 0) where false; ERROR: permission info at index 2 (with relid=1262) does not match provided RTE (with relid=12081) postgres=# select from (select confl_tablespace from pg_stat_database_conflicts where datname <> (select 'af') limit 1) where false; ERROR: invalid perminfoindex 1 in RTE with relid 12271
Re: Making Vars outer-join aware
On Mon, Jan 23, 2023 at 03:38:06PM -0500, Tom Lane wrote: > Richard, are you planning to review this any more? I'm getting > a little antsy to get it committed. For such a large patch, > it's surprising it's had so few conflicts to date. The patch broke this query: select from pg_inherits inner join information_schema.element_types right join (select from pg_constraint as sample_2) on true on false, lateral (select scope_catalog, inhdetachpending from pg_publication_namespace limit 3); ERROR: could not devise a query plan for the given query
Re: Generating code for query jumbling through gen_node_support.pl
On Fri, Feb 10, 2023 at 04:40:08PM -0500, Tom Lane wrote: > v2 looks good to me as far as it goes. Thanks. I have applied that after an extra lookup. > I agree these other questions deserve a separate look. Okay, I may be able to come back to that. Another point is that we need to do a better job in forcing the execution of the query jumbling in one of the TAP tests running pg_regress, outside pg_stat_statements, to maximize coverage. Will see to that on a separate thread. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Add pretty-printed XML output option
Something is misbehaving. Using the latest HEAD, and before applying the v6 patch, 'make check' is working OK. But after applying the v6 patch, some XML regression tests are failing because the DETAIL part of the message indicating the wrong syntax position is not getting displayed. Not only for your new tests -- but for other XML tests too. My ./configure looks like this: ./configure --prefix=/usr/local/pg_oss --with-libxml --enable-debug --enable-cassert --enable-tap-tests CFLAGS="-ggdb -O0 -g3 -fno-omit-frame-pointer" resulting in: checking whether to build with XML support... yes checking for libxml-2.0 >= 2.6.23... yes ~ e.g.(these are a sample of errors) xml ... FAILED 2561 ms @@ -344,8 +326,6 @@ &idontexist; ^ line 1: Opening and ending tag mismatch: twoerrors line 1 and unbalanced -&idontexist; -^ SELECT xmlparse(document ''); xmlparse - @@ -1696,14 +1676,8 @@ SELECT xmlformat(''); ERROR: invalid XML document DETAIL: line 1: switching encoding : no input - -^ line 1: Document is empty - -^ -- XML format: invalid string (whitespaces) SELECT xmlformat(' '); ERROR: invalid XML document DETAIL: line 1: Start tag expected, '<' not found - - ^ ~~ Separately (but maybe it's related?), the CF-bot test also reported a failure [1] with strange error detail differences. diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/xml.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/xml.out --- /tmp/cirrus-ci-build/src/test/regress/expected/xml.out 2023-02-12 09:02:57.077569000 + +++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/xml.out 2023-02-12 09:05:45.14810 + @@ -1695,10 +1695,7 @@ -- XML format: empty string SELECT xmlformat(''); ERROR: invalid XML document -DETAIL: line 1: switching encoding : no input - -^ -line 1: Document is empty +DETAIL: line 1: Document is empty ^ -- XML format: invalid string (whitespaces) -- [1] https://api.cirrus-ci.com/v1/artifact/task/4802219812323328/testrun/build/testrun/regress/regress/regression.diffs Kind Regards, Peter Smith. Fujitsu Australia
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila wrote in > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila wrote: > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > wrote: > > > We have suffered this kind of feedback silence many times. Thus I > > > don't want to rely on luck here. I had in mind of exposing last_send > > > itself or providing interval-calclation function to the logic. > > > > I think we have last_send time in send_feedback(), so we can expose it > > if we want but how would that solve the problem you are worried about? Wal receiver can avoid a too-long sleep by knowing when to wake up for the next feedback. > I have an idea to use last_send time to avoid walsenders being > timeout. Instead of directly using wal_receiver_status_interval as a > minimum interval to send the feedback, we should check if it is > greater than last_send time then we should send the feedback after > (wal_receiver_status_interval - last_send). I think they can probably > be different only on the very first time. Any better ideas? If it means MyLogicalRepWorker->last_send_time, it is not the last time when walreceiver sent a feedback but the last time when wal*sender* sent a data. So I'm not sure that works. We could use the variable that way, but AFAIU in turn when so many changes have been spooled that the control doesn't return to LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay() starts calling send_feedback() at every call after the first feedback timing. Even in that case, send_feedback() won't send one actually until the next feedback timing, I don't think that behavior is great. The only packets walreceiver sends back is the feedback packets and currently only send_feedback knows the last feedback time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Amit, > > > Can't we have this option just as a bool (like shutdown_immediate)? > > Why do we want to keep multiple modes? > > Of course we can use boolean instead, but current style is motivated by the > post[1]. > This allows to add another option in future, whereas I do not have idea now. > > I want to ask other reviewers which one is better... > > [1]: > https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com IMHO I vaguely don't like that we lose a means to specify the default behavior here. And I'm not sure we definitely don't need other than flush and immedaite for both physical and logical replication. If it's not the case, I don't object to make it a Boolean. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
At Fri, 10 Feb 2023 16:43:15 +0900, Michael Paquier wrote in > On Wed, Feb 08, 2023 at 09:42:13PM -0500, Tom Lane wrote: > > Hm. On the one hand, if it is in fact not in postgresql.conf.sample, > > then that flag should be set for sure. OTOH I see that that flag > > isn't purely documentation: help_config.c thinks it should hide > > GUCs that are marked that way. Do we really want that behavior? > > Not sure. I can see an argument that you might want --describe-config > > to tell you that, but there are a lot of other GUC_NOT_IN_SAMPLE > > GUCs that maybe do indeed deserve to be left out. > > I am not sure to follow. help_config() won't show something that's > either marked NO_SHOW_ALL, NOT_IN_SAMPLE or DISALLOW_IN_FILE, hence > config_file does not show up already in what postgres > --describe-config prints, because it has DISALLOW_IN_FILE, so adding > NOT_IN_SAMPLE changes nothing. I think currently the output by --describe-config can be used only for consulting while editing a (possiblly broken) config file. Thus I think it's no use showing GIC_DISALLOW_IN_FILE items there unless we use help_config() for an on-session use. On the other hand, don't we need to remove the condition GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config() should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if it is GUC_NOT_IN_SAMPLE. I'm not sure whether there's any variable that are marked that way, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Rework LogicalOutputPluginWriterUpdateProgress
On Sat, Feb 11, 2023 at 3:04 AM Andres Freund wrote: > > > One difference I see with the patch is that I think we will end up > > sending keepalive for empty prepared transactions even though we don't > > skip sending begin/prepare messages for those. > > With the proposed approach we reliably know whether a callback wrote > something, so we can tune the behaviour here fairly easily. > I would like to clarify a few things about the proposed approach. In commit_cb_wrapper()/prepare_cb_wrapper(), the patch first did ctx->did_write = false;, then call the commit/prepare callback (which will call pgoutput_commit_txn()/pgoutput_prepare_txn()) and then call update_progress() which will make decisions based on ctx->did_write flag. Now, for this to work pgoutput_commit_txn/pgoutput_prepare_txn should know that the transaction has performed some writes before that call which is currently working because pgoutput is tracking the same via sent_begin_txn. Is the intention here that we still track whether BEGIN () has been sent via pgoutput? -- With Regards, Amit Kapila.
Re: Exit walsender before confirming remote flush in logical replication
On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi wrote: > > At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" > wrote in > > Dear Amit, > > > > > Can't we have this option just as a bool (like shutdown_immediate)? > > > Why do we want to keep multiple modes? > > > > Of course we can use boolean instead, but current style is motivated by the > > post[1]. > > This allows to add another option in future, whereas I do not have idea now. > > > > I want to ask other reviewers which one is better... > > > > [1]: > > https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com > > IMHO I vaguely don't like that we lose a means to specify the default > behavior here. And I'm not sure we definitely don't need other than > flush and immedaite for both physical and logical replication. > If we can think of any use case that requires its extension then it makes sense to make it a non-boolean option but otherwise, let's keep things simple by having a boolean option. > If it's > not the case, I don't object to make it a Boolean. > Thanks. -- With Regards, Amit Kapila.
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
On Mon, Feb 13, 2023 at 11:27:58AM +0900, Kyotaro Horiguchi wrote: > I think currently the output by --describe-config can be used only for > consulting while editing a (possiblly broken) config file. Thus I > think it's no use showing GIC_DISALLOW_IN_FILE items there unless we > use help_config() for an on-session use. > > On the other hand, don't we need to remove the condition > GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config() > should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if > it is GUC_NOT_IN_SAMPLE. I'm not sure whether there's any variable > that are marked that way, though. As in marked with GUC_NOT_IN_SAMPLE but not GUC_DISALLOW_IN_FILE? There are quite a lot, developer GUCs being one (say ignore_invalid_pages). We don't want to list them in the sample file so as common users don't play with them, still they make sense if listed in a file. If you add a check meaning that GUC_DISALLOW_IN_FILE implies GUC_NOT_IN_SAMPLE, where one change would need to be applied to config_file as all the other GUC_DISALLOW_IN_FILE GUCs already do that, you could remove GUC_DISALLOW_IN_FILE. However, GUC_NOT_IN_SAMPLE should be around to not expose options, we don't want common users to know too much about. The question about how much people rely on --describe-config these days is a good one, so perhaps there could be an argument in removing GUC_NOT_IN_SAMPLE from the set. TBH, I would be really surprised that anybody able to use a developer option writes an configuration file in an incorrect format and needs to use this option, though :) -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 09:31:36AM -0800, Andres Freund wrote: > Another thing: I think we should either avoid iterating over all the IOVs if > we don't need them, or, even better, initialize the array as a constant, once. Where you imply that we'd still use memset() once on iov[PG_IOV_MAX], right? -- Michael signature.asc Description: PGP signature
Re: Exit walsender before confirming remote flush in logical replication
Here are some comments for patch v7-0002. == Commit Message 1. This commit extends START_REPLICATION to accept SHUTDOWN_MODE clause. It is currently implemented only for logical replication. ~ "to accept SHUTDOWN_MODE clause." --> "to accept a SHUTDOWN_MODE clause." == doc/src/sgml/protocol.sgml 2. START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE { 'wait_flush' | 'immediate' } ] [ ( option_name [ option_value ] [, ...] ) ] ~ IMO this should say shutdown_mode as it did before: START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE shutdown_mode ] [ ( option_name [ option_value ] [, ...] ) ] ~~~ 3. + +shutdown_mode + + + Determines the behavior of the walsender process at shutdown. If + shutdown_mode is 'wait_flush', the walsender waits + for all the sent WALs to be flushed on the subscriber side. This is + the default when SHUTDOWN_MODE is not specified. If shutdown_mode is + 'immediate', the walsender exits without + confirming the remote flush. + + + Is the font of the "shutdown_mode" correct? I expected it to be like the others (e.g. slot_name) == src/backend/replication/walsender.c 4. +static void +CheckWalSndOptions(const StartReplicationCmd *cmd) +{ + if (cmd->shutdownmode) + { + char*mode = cmd->shutdownmode; + + if (pg_strcasecmp(mode, "wait_flush") == 0) + shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH; + else if (pg_strcasecmp(mode, "immediate") == 0) + shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE; + else + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid value for shutdown mode: \"%s\"", mode), + errhint("Available values: wait_flush, immediate.")); + } + +} Unnecessary extra whitespace at end of the function. == src/include/nodes/replnodes. 5. @@ -83,6 +83,7 @@ typedef struct StartReplicationCmd char*slotname; TimeLineID timeline; XLogRecPtr startpoint; + char*shutdownmode; List*options; } StartReplicationCmd; IMO I those the last 2 members should have a comment something like: /* Only for logical replication */ because that will make it more clear why sometimes they are assigned and sometimes they are not. == src/include/replication/walreceiver.h 6. Should the protocol version be bumped (and documented) now that the START REPLICATION supports a new extended syntax? Or is that done only for messages sent by pgoutput? -- Kind Regards, Peter Smith. Fujitsu Australia.
RE: Time delayed LR (WAS Re: logical replication restrictions)
Hi, Horiguchi-san On Monday, February 13, 2023 10:26 AM Kyotaro Horiguchi wrote: > At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila > wrote in > > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila > wrote: > > > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > > wrote: > > > > We have suffered this kind of feedback silence many times. Thus I > > > > don't want to rely on luck here. I had in mind of exposing > > > > last_send itself or providing interval-calclation function to the logic. > > > > > > I think we have last_send time in send_feedback(), so we can expose > > > it if we want but how would that solve the problem you are worried > about? > > Wal receiver can avoid a too-long sleep by knowing when to wake up for the > next feedback. > > > I have an idea to use last_send time to avoid walsenders being > > timeout. Instead of directly using wal_receiver_status_interval as a > > minimum interval to send the feedback, we should check if it is > > greater than last_send time then we should send the feedback after > > (wal_receiver_status_interval - last_send). I think they can probably > > be different only on the very first time. Any better ideas? > > If it means MyLogicalRepWorker->last_send_time, it is not the last time when > walreceiver sent a feedback but the last time when > wal*sender* sent a data. So I'm not sure that works. > > We could use the variable that way, but AFAIU in turn when so many changes > have been spooled that the control doesn't return to LogicalRepApplyLoop > longer than wal_r_s_interval, maybe_apply_delay() starts calling > send_feedback() at every call after the first feedback timing. Even in that > case, send_feedback() won't send one actually until the next feedback timing, > I don't think that behavior is great. > > The only packets walreceiver sends back is the feedback packets and > currently only send_feedback knows the last feedback time. Thanks for your comments ! As described in your last sentence, in the latest patch v34 [1], we use the last time set in send_feedback() and based on it, we calculate and adjust the first timing of feedback message in maybe_apply_delay() so that we can send the feedback message following the interval of wal_receiver_status_interval. I wasn't sure if the above concern is still valid for this implementation. Could you please have a look at the latest patch and share your opinion ? [1] - https://www.postgresql.org/message-id/TYCPR01MB83736C50C98CB2153728A7A8EDDE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
Re: Support logical replication of DDLs
On Fri, Feb 10, 2023 at 11:31 AM vignesh C wrote: > > On Fri, 10 Feb 2023 at 21:50, vignesh C wrote: > > The attached v68 version patch has the changes for the same. > > I was not sure if we should support ddl replication of > create/alter/drop subscription commands as there might be some data > inconsistency issues in the following cases: > #node1 who is running in port 5432 > create publication pub_node1 for all tables with ( PUBLISH = 'insert, > update, delete, truncate'); > > #node2 who is running in port 5433 > create publication pub_node2 for all tables with(PUBLISH = 'insert, > update, delete, truncate', ddl = 'all'); > create subscription sub_node2 connection 'dbname=postgres host=node1 > port=5432' publication pub_node1; > > #node3 > create subscription sub_node3 connection 'dbname=postgres host=node2 > port=5433 publication pub_node2; > > #node1 > create table t1(c1 int ); > > #node2 > create table t1(c1 int); > alter subscription sub_node2 refresh publication; > > # Additionally this command will be replicated to node3, creating a > subscription sub2_node2 in node3 which will subscribe data from node1 > create subscription sub2_node2 connection 'dbname=postgres host=node1 > port=5432' publication pub_node1; > > After this any insert into t1 from node1 will be replicated to node2 > and node3, additionally node2's replicated data(which was replicated > from node1) will also be sent to node3 causing inconsistency. If the > table has unique or primary key constraints, it will lead to an error. > > Another option would be to replicate the create subscription in > disabled state and not support few ddl replication of alter > subscription which will connect to publisher like: > 1) Alter subscription sub1 enable; > 2) Alter subscription sub1 refresh publication; I think it will also be error-prone when the user tries to enable the replicated subscription on node3 later on, for example, when switching over from node2 to node3. There is risk of duplicate or missing data on node3 if the switchover isn't done right. > But in this case also, we will be able to support few alter > subscription commands and not support few alter subscription commands. > I feel it is better that we do not need to support ddl replication of > create/alter/drop subscription command and let users handle the > subscription commands. +1 for not supporting subscription commands in the first version and letting users handle them. Regards, Zane
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 11:01 PM Andres Freund wrote: > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > > Thanks for looking at it. We know that we don't change the zbuffer in > > the function, so can we avoid static const and have just a static > > variable, like the attached > > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do > > you see any problem with it? > > Making it static const is better, because it allows the memory for the > variable to be put in a readonly section. Okay. > > /* Prepare to write out a lot of copies of our zero buffer at once. */ > > for (i = 0; i < lengthof(iov); ++i) > > { > > - iov[i].iov_base = zbuffer.data; > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > > &zbuffer)->data); > > iov[i].iov_len = zbuffer_sz; > > } > > Another thing: I think we should either avoid iterating over all the IOVs if > we don't need them, or, even better, initialize the array as a constant, once. How about like the attached patch? It makes the iovec static variable and points the zero buffer only once/for the first time to iovec. This avoids for-loop on every call. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 550b606affb614980a4b814d9e44e73d5489 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 13 Feb 2023 04:25:34 + Subject: [PATCH v1] Optimize zero buffer handling in pg_pwrite_zeros() 1. Use static const zero-filled buffer to avoid memset-ing on every call. 2. Use static iovec and point the zero-filled buffer for the first time through to avoid for loop initializing iovec on every call. Reported-by: Andres Freund Author: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.de --- src/common/file_utils.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..08688eb1b1 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,25 +539,35 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ - size_t zbuffer_sz; - struct iovec iov[PG_IOV_MAX]; + /* + * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing + * the buffer on every call. Also, with the const qualifier, it is ensured + * that the buffer is placed in read-only section of the process's memory. + */ + static const PGAlignedBlock zbuffer = {0}; + static size_t zbuffer_sz = BLCKSZ; + static bool first_time = true; + static struct iovec iov[PG_IOV_MAX] = {0}; int blocks; size_t remaining_size = 0; int i; ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); - - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (i = 0; i < lengthof(iov); ++i) + /* + * If first time through, point the zero buffer to iovec structure. This + * avoids for loop on every call. + */ + if (first_time) { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = zbuffer_sz; + /* Prepare to write out a lot of copies of our zero buffer at once. */ + for (i = 0; i < lengthof(iov); ++i) + { + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data); + iov[i].iov_len = zbuffer_sz; + } + + first_time = false; } /* Loop, writing as many blocks as we can for each system call. */ -- 2.34.1
Re: Support logical replication of DDLs
On Sat, Feb 11, 2023 at 3:31 AM vignesh C wrote: > > On Fri, 10 Feb 2023 at 21:50, vignesh C wrote: > > The attached v68 version patch has the changes for the same. > > I was not sure if we should support ddl replication of > create/alter/drop subscription commands as there might be some data > inconsistency issues in the following cases: > #node1 who is running in port 5432 > create publication pub_node1 for all tables with ( PUBLISH = 'insert, > update, delete, truncate'); > > #node2 who is running in port 5433 > create publication pub_node2 for all tables with(PUBLISH = 'insert, > update, delete, truncate', ddl = 'all'); > create subscription sub_node2 connection 'dbname=postgres host=node1 > port=5432' publication pub_node1; > > #node3 > create subscription sub_node3 connection 'dbname=postgres host=node2 > port=5433 publication pub_node2; > > #node1 > create table t1(c1 int ); > > #node2 > create table t1(c1 int); > alter subscription sub_node2 refresh publication; > > # Additionally this command will be replicated to node3, creating a > subscription sub2_node2 in node3 which will subscribe data from node1 > create subscription sub2_node2 connection 'dbname=postgres host=node1 > port=5432' publication pub_node1; > > After this any insert into t1 from node1 will be replicated to node2 > and node3, additionally node2's replicated data(which was replicated > from node1) will also be sent to node3 causing inconsistency. If the > table has unique or primary key constraints, it will lead to an error. > > Another option would be to replicate the create subscription in > disabled state and not support few ddl replication of alter > subscription which will connect to publisher like: > 1) Alter subscription sub1 enable; > 2) Alter subscription sub1 refresh publication; > > But in this case also, we will be able to support few alter > subscription commands and not support few alter subscription commands. > I feel it is better that we do not need to support ddl replication of > create/alter/drop subscription command and let users handle the > subscription commands. > Thoughts? > So essentially, node3 is subscribed 2x from the same table in node1 node1 --> node2 || ddl |V +-> node3 I think the suggested options are: option #1. If you support CREATE SUBSCRIPTION DDL replication then you can end up with the conflict troubles you described above option #2. If you support CREATE SUBSCRIPTION DDL replication but only make sure it is disabled first then your above scenario might be OK but you will also need to *not* allow DDL replication of the ALTER SUBSCRIPTION which might cause it to become re-enabled. IMO adding tricky rules is just inviting more problems. option #3. Do nothing, don't support it. +1 but see below for a variation of this ~ Actually, I am sort of expecting lots of potential difficulties with DDL replication and this CREATE SUBSCRIPTION problem is just one of them. IMO it would be a mistake to try and make the first version of these patches try to do *everything*. E.g. Why invent tricky solutions to problems without yet knowing user expectations/requirements? Therefore, my first impression is to do a generic option #4: option #4. This is a variation of "do nothing". My suggestion is you can still replicate every DDL via logical replication messages but just don't actually *apply* anything on the subscriber side for the commands which are problematic (like this one is). Instead of applying, the subscriber can just log a NOTICE about each command that was skipped. This will make it easier for the user to know what didn’t happen, but they can just cut/paste that command from the NOTICE if they really want to. Also, this option #4 is deliberately generic, which means you can do the same for every kind of DDL that proves too difficult to automate in the first version (I doubt CREATE SUBSCRIPTION will be the only example). The option #4 result might look like this: test_sub=# create subscription sub1 connection 'dbname=test_pub' publication pub1; NOTICE: created replication slot "sub1" on publisher CREATE SUBSCRIPTION NOTICE: subscription "sub1" skipping DDL: create subscription sub_node2 connection 'dbname=postgres host=node1 port=5432' publication pub_node1; ... (And if it turns out users really do want this then it can be revisited in later patch versions) -- Kind Regards, Peter Smith. Fujitsu Australia
Force testing of query jumbling code in TAP tests
Hi all, As mentioned two times on this thread, there is not much coverage for the query jumbling code, even if it is in core: https://www.postgresql.org/message-id/Y5BHOUhX3zTH/i...@paquier.xyz Ths issue is that we have the options to enable it, but only pg_stat_statements is able to enable and stress it. This causes coverage to be missed for all query patterns that are not covered directly by pg_stat_statements, like XML expressions, various DML patterns, etc. More aggressive testing would also ensure that no nodes are marked as no_query_jumble while they should be included in a computation. Attached is a patch to improve that. The main regression database is able to cover everything, basically, so I'd like to propose the addition of some extra configuration in 027_stream_regress.pl to enable pg_stat_statements. This could be added in the pg_upgrade tests, but that felt a bit less adapted here. Or can people think about cases where checking pg_stat_statements makes more sense after an upgrade or on a standby? One thing that makes sense for a standby is to check that the contents of pg_stat_statements are empty? With this addition, the query jumbling gets covered at 95%~, while https://coverage.postgresql.org/src/backend/nodes/index.html reports currently 35%. Thoughts or comments? -- Michael diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile index 570bf42b58..c60314d195 100644 --- a/src/test/recovery/Makefile +++ b/src/test/recovery/Makefile @@ -9,7 +9,7 @@ # #- -EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm +EXTRA_INSTALL=contrib/pg_prewarm contrib/pg_stat_statements contrib/test_decoding subdir = src/test/recovery top_builddir = ../../.. diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl index 69d6ddf281..9fb565f33c 100644 --- a/src/test/recovery/t/027_stream_regress.pl +++ b/src/test/recovery/t/027_stream_regress.pl @@ -14,6 +14,16 @@ $node_primary->init(allows_streaming => 1); $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25'); $node_primary->append_conf('postgresql.conf', 'max_prepared_transactions = 10'); + +# Enable pg_stat_statements to force tests with do query jumbling. +# pg_stat_statements.max should be large enough to hold all the entries +# of the regression database. +$node_primary->append_conf('postgresql.conf', + qq{shared_preload_libraries = 'pg_stat_statements' +pg_stat_statements.max = 5 +compute_query_id = 'regress' +}); + # We'll stick with Cluster->new's small default shared_buffers, but since that # makes synchronized seqscans more probable, it risks changing the results of # some test queries. Disable synchronized seqscans to prevent that. signature.asc Description: PGP signature
Re: Introduce a new view for checkpointer related stats
On Fri, Feb 10, 2023 at 10:00 PM Bharath Rupireddy wrote: > > Thanks. FWIW, I've attached the patch introducing pg_stat_checkpointer > as 0003 here. > > Please review the attached v7 patch set. Needed a rebase. Please review the attached v8 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From c7d1a10b113c90e3712e5f836c21b5a40f909a24 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 13 Feb 2023 05:01:52 + Subject: [PATCH v8] Drop buffers_backend column from pg_stat_bgwriter view --- doc/src/sgml/monitoring.sgml | 9 - src/backend/catalog/system_views.sql | 1 - src/backend/postmaster/checkpointer.c | 20 +-- .../utils/activity/pgstat_checkpointer.c | 2 -- src/backend/utils/adt/pgstatfuncs.c | 6 -- src/include/catalog/pg_proc.dat | 4 src/include/pgstat.h | 1 - src/test/regress/expected/rules.out | 1 - 8 files changed, 5 insertions(+), 39 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index dca50707ad..c0cae0f4f3 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4059,15 +4059,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i - - - buffers_backend bigint - - - Number of buffers written directly by a backend - - - buffers_backend_fsync bigint diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 34ca0e739f..9bb96e27aa 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1112,7 +1112,6 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, -pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index aaad5c5228..baadd08044 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -91,17 +91,15 @@ * requesting backends since the last checkpoint start. The flags are * chosen so that OR'ing is the correct way to combine multiple requests. * - * num_backend_writes is used to count the number of buffer writes performed - * by user backend processes. This counter should be wide enough that it - * can't overflow during a single processing cycle. num_backend_fsync - * counts the subset of those writes that also had to do their own fsync, - * because the checkpointer failed to absorb their request. + * num_backend_fsync counts the subset of buffer writes performed by user + * backend processes that also had to do their own fsync, because the + * checkpointer failed to absorb their request. * * The requests array holds fsync requests sent by backends and not yet * absorbed by the checkpointer. * - * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and - * the requests fields are protected by CheckpointerCommLock. + * Unlike the checkpoint fields, num_backend_fsync and the requests fields are + * protected by CheckpointerCommLock. *-- */ typedef struct @@ -125,7 +123,6 @@ typedef struct ConditionVariable start_cv; /* signaled when ckpt_started advances */ ConditionVariable done_cv; /* signaled when ckpt_done advances */ - uint32 num_backend_writes; /* counts user backend buffer writes */ uint32 num_backend_fsync; /* counts user backend fsync calls */ int num_requests; /* current # of requests */ @@ -1096,10 +1093,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); - /* Count all backend writes regardless of if they fit in the queue */ - if (!AmBackgroundWriterProcess()) - CheckpointerShmem->num_backend_writes++; - /* * If the checkpointer isn't running or the request queue is full, the * backend will have to perform its own fsync request. But before forcing @@ -1272,12 +1265,9 @@ AbsorbSyncRequests(void) LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); /* Transfer stats counts into pending pgstats message */ - PendingCheckpointerStats.buf_written_backend - += CheckpointerShmem->num_backend_writes; PendingCheckpointerStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync; - CheckpointerShmem->num_backend_writes = 0; CheckpointerShmem->num_backend_fsync = 0; /* diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
Hi, On 2/10/23 10:46 PM, Andres Freund wrote: Hi, On 2023-02-09 11:38:18 +0100, Drouvot, Bertrand wrote: Please find attached a patch proposal for $SUBJECT. The idea has been raised in [1] by Andres: it would allow to simplify even more the work done to generate pg_stat_get_xact*() functions with Macros. Thanks! Thanks for looking at it! I think this is useful beyond being able to generate those functions with macros. The fact that we had to deal with transactional code in pgstatfuncs.c meant that a lot of the relevant itnernals had to be exposed "outside" pgstat, which seems architecturally not great. Right, good point. Indeed, with the reconciliation done in find_tabstat_entry() then all the pg_stat_get_xact*() functions (making use of find_tabstat_entry()) now "look the same" (should they take into account live subtransactions or not). I'm not bothered by making all of pg_stat_get_xact* functions more expensive, they're not a hot code path. But if we need to, we could just add a parameter to find_tabstat_entry() indicating whether we need to reconcile or not. I think that's a good idea to avoid doing extra work if not needed. V2 adds such a bool. /* save stats for this function, later used to compensate for recursion */ - fcu->save_f_total_time = pending->f_counts.f_total_time; + fcu->save_f_total_time = pending->f_total_time; /* save current backend-wide total time */ fcu->save_total = total_func_time; The diff is noisy due to all the mechanical changes like the above. Could that be split into a separate commit? Fully agree, the PgStat_BackendFunctionEntry stuff will be done in a separate patch. find_tabstat_entry(Oid rel_id) { PgStat_EntryRef *entry_ref; + PgStat_TableStatus *tablestatus = NULL; entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, MyDatabaseId, rel_id); if (!entry_ref) entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, InvalidOid, rel_id); if (entry_ref) - return entry_ref->pending; - return NULL; + { + PgStat_TableStatus *tabentry = (PgStat_TableStatus *) entry_ref->pending; I'd add an early return for the !entry_ref case, that way you don't need to indent the bulk of the function. Good point, done in V2. + PgStat_TableXactStatus *trans; + + tablestatus = palloc(sizeof(PgStat_TableStatus)); + memcpy(tablestatus, tabentry, sizeof(PgStat_TableStatus)); For things like this I'd use *tablestatus = *tabentry; that way the compiler will warn you about mismatching types, and you don't need the sizeof(). Good point, done in V2. + /* live subtransactions' counts aren't in t_counts yet */ + for (trans = tabentry->trans; trans != NULL; trans = trans->upper) + { + tablestatus->t_counts.t_tuples_inserted += trans->tuples_inserted; + tablestatus->t_counts.t_tuples_updated += trans->tuples_updated; + tablestatus->t_counts.t_tuples_deleted += trans->tuples_deleted; + } + } Hm, why do we end uup with t_counts still being used here, but removed other places? t_counts are not removed, maybe you are confused with the "f_counts" that were removed in V1 due to the PgStat_BackendFunctionEntry related changes? diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 6737493402..40a6fbf871 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1366,7 +1366,10 @@ pg_stat_get_xact_numscans(PG_FUNCTION_ARGS) if ((tabentry = find_tabstat_entry(relid)) == NULL) result = 0; else + { result = (int64) (tabentry->t_counts.t_numscans); + pfree(tabentry); + } PG_RETURN_INT64(result); } I don't think we need to bother with individual pfrees in this path. The caller will call the function in a dedicated memory context, that'll be reset very soon after this. Oh right, the palloc is done in the ExprContext memory context that is reset soon after. Removing the pfrees in V2 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index f793ac1516..1d3f068a1c 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -474,17 +474,33 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid) * If no entry found, return NULL, don't create a new one */ PgStat_TableStatus * -find_tabstat_entry(Oid rel_id) +find_tabstat_entry(Oid rel_id, bool add_subtrans) { PgStat_EntryRef *entry_ref; + PgStat_TableXactStatus *trans; +
Re: Making Vars outer-join aware
On Mon, Feb 13, 2023 at 7:58 AM Justin Pryzby wrote: > The patch broke this query: > > select from pg_inherits inner join information_schema.element_types > right join (select from pg_constraint as sample_2) on true > on false, lateral (select scope_catalog, inhdetachpending from > pg_publication_namespace limit 3); > ERROR: could not devise a query plan for the given query Thanks for the report! I've looked at it a little bit and traced down to function have_unsafe_outer_join_ref(). The comment there says * In practice, this test never finds a problem ... * ... * It still seems worth checking * as a backstop, but we don't go to a lot of trouble: just reject if the * unsatisfied part includes any outer-join relids at all. This seems not correct as showed by the counterexample. ISTM that we need to do the check honestly as what the other comment says * If the parameterization is only partly satisfied by the outer rel, * the unsatisfied part can't include any outer-join relids that could * null rels of the satisfied part. The NOT_USED part of code is doing this check. But I think we need a little tweak. We should check the nullable side of related outer joins against the satisfied part, rather than inner_paramrels. Maybe something like attached. However, this test seems to cost some cycles after the change. So I wonder if it's worthwhile to perform it, considering that join order restrictions should be able to guarantee there is no problem here. BTW, here is a simplified query that can trigger this issue on HEAD. select * from t1 inner join t2 left join (select null as c from t3 left join t4 on true) as sub on true on true, lateral (select c, t1.a from t5 offset 0 ) ss; Thanks Richard v1-0001-Fix-for-have_unsafe_outer_join_ref.patch Description: Binary data
Re: recovery modules
On Thu, Feb 09, 2023 at 02:48:26PM -0800, Nathan Bossart wrote: > On Thu, Feb 09, 2023 at 12:18:55PM -0800, Andres Freund wrote: >>> >>> -typedef bool (*ArchiveCheckConfiguredCB) (void); >>> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state); >>> >>> >>> If true is returned, the server will proceed with >> >> Hm. I wonder if ArchiveCheckConfiguredCB() should actually work without the >> state. We're not really doing anything yet, at that point, so it shouldn't >> really need state? >> >> The reason I'm wondering is that I think we should consider calling this from >> the GUC assignment hook, at least in postmaster. Which would make it more >> convenient to not have state, I think? > > I agree that this callback should typically not need the state, but I'm not > sure whether it fits into the assignment hook for archive_library. This > callback is primarily meant for situations when you have archiving enabled, > but your module isn't configured yet (e.g., archive_command is empty). In > this case, we keep the WAL around, but we don't try to archive it until > this hook returns true. It's up to each module to define that criteria. I > can imagine someone introducing a GUC in their archive module that > temporarily halts archiving via this callback. In that case, calling it > via a GUC assignment hook probably won't work. In fact, checking whether > archive_command is empty in that hook might not work either. Keeping the state in the configure check callback does not strike me as a problem, FWIW. >>> >>> -typedef void (*ArchiveShutdownCB) (void); >>> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state); >>> >>> >>> >> >> Perhaps mention that this needs to free state it allocated in the >> ArchiveModuleState, or it'll be leaked? > > done > > I left this out originally because the archiver exits shortly after calling > this. However, if you have DSM segments or something, it's probably wise > to make sure those are cleaned up. And I suppose we might not always exit > immediately after this callback, so establishing the habit of freeing the > state could be a good idea. In addition to updating this part of the docs, > I wrote a shutdown callback for basic_archive that frees its state. This makes sense to me. Still, DSM segments had better be cleaned up with dsm_backend_shutdown(). + basic_archive_context = data->context; + if (CurrentMemoryContext == basic_archive_context) + MemoryContextSwitchTo(TopMemoryContext); + + if (MemoryContextIsValid(basic_archive_context)) + MemoryContextDelete(basic_archive_context); This is a bit confusing, because it means that we enter in the shutdown callback with one context, but exit it under TopMemoryContext. Are you sure that this will be OK when there could be multiple callbacks piled up with before_shmem_exit()? shmem_exit() has nothing specific to memory contexts. Is putting the new headers in src/include/postmaster/ the best move in the long term? Perhaps that's fine, but I was wondering whether a new location like archive/ would make more sense. pg_arch.h being in the postmaster section is fine. -- Michael signature.asc Description: PGP signature
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand" wrote in >> I think this is useful beyond being able to generate those functions >> with >> macros. The fact that we had to deal with transactional code in >> pgstatfuncs.c >> meant that a lot of the relevant itnernals had to be exposed "outside" >> pgstat, >> which seems architecturally not great. >> > Right, good point. Agreed. > Removing the pfrees in V2 attached. Ah, that sound good. if (!entry_ref) + { entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, InvalidOid, rel_id); + return tablestatus; + } We should return something if the call returns a non-null value? So, since we want to hide the internal from pgstatfuncs, the additional flag should be gone. If the additional cost doesn't bother anyone, I don't mind to remove the flag. The patch will get far simpler by that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Feb 11, 2023 at 2:33 PM John Naylor wrote: > > I didn't get any closer to radix-tree regression, Me neither. It seems that in v26, inserting chunks into node-32 is slow but needs more analysis. I'll share if I found something interesting. > but I did find some inefficiencies in tidstore_add_tids() that are worth > talking about first, addressed in a rough fashion in the attached .txt > addendums that I can clean up and incorporate later. > > To start, I can reproduce the regression with this test as well: > > select * from bench_tidstore_load(0, 10 * 1000 * 1000); > > v15 + v26 store + adjustments: > mem_allocated | load_ms > ---+- > 98202152 |1676 > > v26 0001-0008 > mem_allocated | load_ms > ---+- > 98202032 |1826 > > ...and reverting to the alternate way to update the parent didn't help: > > v26 0001-6, 0008, insert_inner w/ null parent > > mem_allocated | load_ms > ---+- > 98202032 |1825 > > ...and I'm kind of glad that wasn't the problem, because going back to that > would be a pain for the shmem case. > > Running perf doesn't show anything much different in the proportions (note > that rt_set must have been inlined when declared locally in v26): > > v15 + v26 store + adjustments: > 65.88% postgres postgres [.] tidstore_add_tids > 10.74% postgres postgres [.] rt_set >9.20% postgres postgres [.] palloc0 >6.49% postgres postgres [.] rt_node_insert_leaf > > v26 0001-0008 > 78.50% postgres postgres [.] tidstore_add_tids >8.88% postgres postgres [.] palloc0 >6.24% postgres postgres [.] local_rt_node_insert_leaf > > v2699-0001: The first thing I noticed is that palloc0 is taking way more time > than it should, and it's because the compiler doesn't know the values[] array > is small. One reason we need to zero the array is to make the algorithm > agnostic about what order the offsets come in, as I requested in a previous > review. Thinking some more, I was way too paranoid about that. As long as > access methods scan the line pointer array in the usual way, maybe we can > just assert that the keys we create are in order, and zero any unused array > entries as we find them. (I admit I can't actually think of a reason we would > ever encounter offsets out of order.) I can think that something like traversing a HOT chain could visit offsets out of order. But fortunately we prune such collected TIDs before heap vacuum in heap case. > Also, we can keep track of the last key we need to consider for insertion > into the radix tree, and ignore the rest. That might shave a few cycles > during the exclusive lock when the max offset of an LP_DEAD item < 64 on a > given page, which I think would be common in the wild. I also got rid of the > special case for non-encoding, since shifting by zero should work the same > way. These together led to a nice speedup on the v26 branch: > > mem_allocated | load_ms > ---+- > 98202032 |1386 > > v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass > to tid_to_key_off(). That's bad for tidstore_add_tids() because > ItemPointerSetBlockNumber() must do this in order to allow the struct to be > SHORTALIGN'd: > > static inline void > BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber) > { > blockId->bi_hi = blockNumber >> 16; > blockId->bi_lo = blockNumber & 0x; > } > > Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse > the above process: > > static inline BlockNumber > BlockIdGetBlockNumber(const BlockIdData *blockId) > { > return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) > blockId->bi_lo); > } > > There is no reason to do any of this if we're not reading/writing directly > to/from an on-disk tid etc. To avoid this, I created a new function > encode_key_off() [name could be better], which deals with the raw block > number that we already have. Then turn tid_to_key_off() into a wrapper around > that, since we still need the full conversion for tidstore_lookup_tid(). > > v2699-0003: Get rid of all the remaining special cases for encoding/or not. I > am unaware of the need to optimize that case or treat it in any way > differently. I haven't tested this on an installation with non-default > blocksize and didn't measure this separately, but 0002+0003 gives: > > mem_allocated | load_ms > ---+- > 98202032 |1259 > > If these are acceptable, I can incorporate them into a later patchset. These are nice improvements! I agree with all changes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com