Re: Refactoring of compression options in pg_basebackup
(Added Jeevan in CC, for awareness) On Mon, Jan 03, 2022 at 03:35:57PM +, gkokola...@pm.me wrote: > I propose to initialize streamer to NULL when declared at the top of > CreateBackupStreamer(). Yes, that may be noisy. Done this way. > You can see that after the check_pg_config() test, only 3 tests follow, > namely: > * $node->command_ok() > * is(scalar(), ...) > * is($gzip_is_valid, ...) Indeed. The CF bot was complaining about that, actually. Thinking more about this stuff, pg_basebackup --compress is an option that exists already for a couple of years, and that's independent of the backend-side compression that Robert and Jeevan are working on, so I'd like to move on this code cleanup. We can always figure out the LZ4 part for pg_basebackup after, if necessary. Attached is an updated patch. The CF bot should improve with that. -- Michael From a216d5569fa6d963c2f846d9f5539d5ea7ddbd62 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 5 Jan 2022 16:58:00 +0900 Subject: [PATCH v2] Refactor options of pg_basebackup for compression level and methods --- src/bin/pg_basebackup/pg_basebackup.c| 81 +++- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 47 +++- src/bin/pg_basebackup/walmethods.c | 47 doc/src/sgml/ref/pg_basebackup.sgml | 22 +- 4 files changed, 156 insertions(+), 41 deletions(-) diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 1739ac6382..7f29200832 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -113,6 +113,7 @@ static bool showprogress = false; static bool estimatesize = true; static int verbose = 0; static int compresslevel = 0; +static WalCompressionMethod compression_method = COMPRESSION_NONE; static IncludeWal includewal = STREAM_WAL; static bool fastcheckpoint = false; static bool writerecoveryconf = false; @@ -359,7 +360,9 @@ usage(void) printf(_(" -X, --wal-method=none|fetch|stream\n" " include required WAL files with specified method\n")); printf(_(" -z, --gzip compress tar output\n")); - printf(_(" -Z, --compress=0-9 compress tar output with given compression level\n")); + printf(_(" -Z, --compress=1-9 compress tar output with given compression level\n")); + printf(_(" --compression-method=METHOD\n" + " method to compress data\n")); printf(_("\nGeneral options:\n")); printf(_(" -c, --checkpoint=fast|spread\n" " set fast or spread checkpointing\n")); @@ -524,7 +527,7 @@ LogStreamerMain(logstreamer_param *param) stream.do_sync); else stream.walmethod = CreateWalTarMethod(param->xlog, - COMPRESSION_NONE, /* ignored */ + compression_method, compresslevel, stream.do_sync); @@ -975,7 +978,7 @@ CreateBackupStreamer(char *archive_name, char *spclocation, bool is_recovery_guc_supported, bool expect_unterminated_tarfile) { - bbstreamer *streamer; + bbstreamer *streamer = NULL; bbstreamer *manifest_inject_streamer = NULL; bool inject_manifest; bool must_parse_archive; @@ -1034,19 +1037,22 @@ CreateBackupStreamer(char *archive_name, char *spclocation, archive_file = NULL; } + if (compression_method == COMPRESSION_NONE) + streamer = bbstreamer_plain_writer_new(archive_filename, + archive_file); #ifdef HAVE_LIBZ - if (compresslevel != 0) + else if (compression_method == COMPRESSION_GZIP) { strlcat(archive_filename, ".gz", sizeof(archive_filename)); streamer = bbstreamer_gzip_writer_new(archive_filename, archive_file, compresslevel); } - else #endif - streamer = bbstreamer_plain_writer_new(archive_filename, - archive_file); - + else + { + Assert(false); /* not reachable */ + } /* * If we need to parse the archive for whatever reason, then we'll @@ -1765,6 +1771,7 @@ main(int argc, char **argv) {"no-manifest", no_argument, NULL, 5}, {"manifest-force-encode", no_argument, NULL, 6}, {"manifest-checksums", required_argument, NULL, 7}, + {"compression-method", required_argument, NULL, 8}, {NULL, 0, NULL, 0} }; int c; @@ -1872,14 +1879,10 @@ main(int argc, char **argv) do_sync = false; break; case 'z': -#ifdef HAVE_LIBZ -compresslevel = Z_DEFAULT_COMPRESSION; -#else -compresslevel = 1; /* will be rejected below */ -#endif +compression_method = COMPRESSION_GZIP; break; case 'Z': -if (!option_parse_int(optarg, "-Z/--compress", 0, 9, +if (!option_parse_int(optarg, "-Z/--compress", 1, 9, &compresslevel)) exit(1); break; @@ -1941,6 +1944,18 @@ main(int argc, char **argv) case 7: manifest_checksums = pg_strdup(optarg); break; + case 8: +if (pg_strcasecmp(optarg, "gzip") == 0) + compress
Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set
On Thu, Dec 16, 2021 at 11:51:55AM +0900, Michael Paquier wrote: > I may have missed one thing or two, but I think that's pretty much > what we should be looking for to do the switch to TAP in terms of > coverage. Rebased patch to cool down the CF bot, as per the addition of --no-sync to pg_upgrade. -- Michael From f7af03ec5e5c6a685796d58b6d82eb9f603565de Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 5 Jan 2022 17:11:12 +0900 Subject: [PATCH v5] Switch tests of pg_upgrade to use TAP --- src/bin/pg_upgrade/.gitignore| 5 + src/bin/pg_upgrade/Makefile | 23 +- src/bin/pg_upgrade/TESTING | 33 ++- src/bin/pg_upgrade/t/001_basic.pl| 9 + src/bin/pg_upgrade/t/002_pg_upgrade.pl | 275 ++ src/bin/pg_upgrade/test.sh | 279 --- src/test/perl/PostgreSQL/Test/Cluster.pm | 25 ++ src/tools/msvc/vcregress.pl | 94 +--- 8 files changed, 355 insertions(+), 388 deletions(-) create mode 100644 src/bin/pg_upgrade/t/001_basic.pl create mode 100644 src/bin/pg_upgrade/t/002_pg_upgrade.pl delete mode 100644 src/bin/pg_upgrade/test.sh diff --git a/src/bin/pg_upgrade/.gitignore b/src/bin/pg_upgrade/.gitignore index 2d3bfeaa50..3b64522ab6 100644 --- a/src/bin/pg_upgrade/.gitignore +++ b/src/bin/pg_upgrade/.gitignore @@ -7,3 +7,8 @@ /loadable_libraries.txt /log/ /tmp_check/ + +# Generated by pg_regress +/sql/ +/expected/ +/results/ diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile index 44d06be5a6..35b6c123a5 100644 --- a/src/bin/pg_upgrade/Makefile +++ b/src/bin/pg_upgrade/Makefile @@ -28,6 +28,12 @@ OBJS = \ override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS) LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) +# required for 002_pg_upgrade.pl +REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB +REGRESS_OUTPUTDIR=$(abs_top_builddir)/src/bin/pg_upgrade +export REGRESS_OUTPUTDIR + all: pg_upgrade pg_upgrade: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils @@ -49,17 +55,8 @@ clean distclean maintainer-clean: pg_upgrade_dump_globals.sql \ pg_upgrade_dump_*.custom pg_upgrade_*.log -# When $(MAKE) is present, make automatically infers that this is a -# recursive make. which is not actually what we want here, as that -# e.g. prevents output synchronization from working (as make thinks -# that the subsidiary make knows how to deal with that itself, but -# we're invoking a shell script that doesn't know). Referencing -# $(MAKE) indirectly avoids that behaviour. -# See https://www.gnu.org/software/make/manual/html_node/MAKE-Variable.html#MAKE-Variable -NOTSUBMAKEMAKE=$(MAKE) +check: + $(prove_check) -check: test.sh all temp-install - MAKE=$(NOTSUBMAKEMAKE) $(with_temp_install) bindir=$(abs_top_builddir)/tmp_install/$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) $< - -# installcheck is not supported because there's no meaningful way to test -# pg_upgrade against a single already-running server +installcheck: + $(prove_installcheck) diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING index 78b9747908..43a71566e2 100644 --- a/src/bin/pg_upgrade/TESTING +++ b/src/bin/pg_upgrade/TESTING @@ -2,21 +2,30 @@ THE SHORT VERSION - On non-Windows machines, you can execute the testing process -described below by running +described below by running the following command in this directory: make check -in this directory. This will run the shell script test.sh, performing -an upgrade from the version in this source tree to a new instance of -the same version. -To test an upgrade from a different version, you must have a built -source tree for the old version as well as this version, and you -must have done "make install" for both versions. Then do: +This will run the TAP tests to run pg_upgrade, performing an upgrade +from the version in this source tree to a new instance of the same +version. + +To test an upgrade from a different version, there are two options +available: + +1) You have a built source tree for the old version as well as this +version's binaries. Then set up the following variables before +launching the test: export oldsrc=...somewhere/postgresql (old version's source tree) -export oldbindir=...otherversion/bin (old version's installed bin dir) -export bindir=...thisversion/bin (this version's installed bin dir) -export libdir=...thisversion/lib (this version's installed lib dir) -sh test.sh +export oldinstall=...otherversion/ (old version's install base path) + +2) You have a dump that can be used to set up the old version, as well +as this version's binaries. Then set up the following variables: +export olddump=...somewhere/dump.sql (old version's dump) +export oldinstall=...otherversion/ (old version's install base path) + +Finally, the tests can be done by
Re: row filtering for logical replication
On Tue, Dec 28, 2021 at 6:33 PM houzj.f...@fujitsu.com wrote: > > On Mon, Dec 27, 2021 9:19 PM Hou Zhijie wrote: > > On Mon, Dec 27, 2021 9:16 PM houzj.f...@fujitsu.com > > wrote: > > > On Thur, Dec 23, 2021 4:28 PM Peter Smith wrote: > > > > Here is the v54* patch set: > > > > > > Attach the v55 patch set which add the following testcases in 0002 patch. > > When reviewing the row filter patch, I found few things that could be > improved. > 1) We could transform the same row filter expression twice when >ALTER PUBLICATION ... SET TABLE WHERE (...). Because we invoke >GetTransformedWhereClause in both AlterPublicationTables() and >publication_add_relation(). I was thinking it might be better if we only >transform the expression once in AlterPublicationTables(). > > 2) When transforming the expression, we didn’t set the correct p_sourcetext. >Since we need to transforming serval expressions which belong to different >relations, I think it might be better to pass queryString down to the > actual >transform function and set p_sourcetext to the actual queryString. > I have tried the following few examples to check the error_position and it seems to be showing correct position without your 0004 patch. postgres=# create publication pub for table t1 where (10); ERROR: argument of PUBLICATION WHERE must be type boolean, not type integer LINE 1: create publication pub for table t1 where (10); ^ Also, transformPubWhereClauses() seems to be returning the same list as it was passed to it. Do we really need to return anything from transformPubWhereClauses()? -- With Regards, Amit Kapila.
Re: Refactoring of compression options in pg_basebackup
On Wednesday, January 5th, 2022 at 9:00 AM, Michael Paquier wrote: > On Mon, Jan 03, 2022 at 03:35:57PM +, gkokola...@pm.me wrote: > > I propose to initialize streamer to NULL when declared at the top of > > CreateBackupStreamer(). > > Yes, that may be noisy. Done this way. Great. > > You can see that after the check_pg_config() test, only 3 tests follow, > > namely: > > * $node->command_ok() > > * is(scalar(), ...) > > * is($gzip_is_valid, ...) > > Indeed. The CF bot was complaining about that, actually. Great. > Thinking more about this stuff, pg_basebackup --compress is an option > that exists already for a couple of years, and that's independent of > the backend-side compression that Robert and Jeevan are working on, so > I'd like to move on this code cleanup. We can always figure out the > LZ4 part for pg_basebackup after, if necessary. I agree that the cleanup in itself is helpful. It feels awkward to have two utilities under the same path, with distinct options for the same functionality. When the backend-side compression is completed, were there really be a need for client-side compression? If yes, then it seems logical to have distinct options for them and this cleanup makes sense. If not, then it seems logical to maintain the current options list and 'simply' change the internals of the code, and this cleanup makes sense. > Attached is an updated patch. The CF bot should improve with that. +1 > -- > Michael Cheers, //Georgios
Re: refactoring basebackup.c
On 1/4/22 8:07 PM, Robert Haas wrote: Before sending an email like this, it would be a good idea to read the documentation for the --server-compression option. Sure, Thanks Robert. One scenario where I feel error message is confusing and if it is not supported at all then error message need to be a little bit more clear if we use -z (or -Z ) with -t , we are getting this error [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch -z pg_basebackup: error: only tar mode backups can be compressed Try "pg_basebackup --help" for more information. but after removing -z option backup is in tar mode only edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch [edb@centos7tushar bin]$ ls /tmp/test0 backup_manifest base.tar -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: row filtering for logical replication
On Wed, Jan 5, 2022 at 2:45 PM Amit Kapila wrote: > > On Tue, Dec 28, 2021 at 6:33 PM houzj.f...@fujitsu.com > wrote: > > > > On Mon, Dec 27, 2021 9:19 PM Hou Zhijie wrote: > > > On Mon, Dec 27, 2021 9:16 PM houzj.f...@fujitsu.com > > > > > > wrote: > > > > On Thur, Dec 23, 2021 4:28 PM Peter Smith wrote: > > > > > Here is the v54* patch set: > > > > > > > > Attach the v55 patch set which add the following testcases in 0002 > > > > patch. > > > > When reviewing the row filter patch, I found few things that could be > > improved. > > 1) We could transform the same row filter expression twice when > >ALTER PUBLICATION ... SET TABLE WHERE (...). Because we invoke > >GetTransformedWhereClause in both AlterPublicationTables() and > >publication_add_relation(). I was thinking it might be better if we only > >transform the expression once in AlterPublicationTables(). > > > > 2) When transforming the expression, we didn’t set the correct p_sourcetext. > >Since we need to transforming serval expressions which belong to > > different > >relations, I think it might be better to pass queryString down to the > > actual > >transform function and set p_sourcetext to the actual queryString. > > > > I have tried the following few examples to check the error_position > and it seems to be showing correct position without your 0004 patch. > postgres=# create publication pub for table t1 where (10); > ERROR: argument of PUBLICATION WHERE must be type boolean, not type integer > LINE 1: create publication pub for table t1 where (10); > > ^ > I understand why the error position could vary even though it is showing the correct location in the above example after reading another related email [1]. > Also, transformPubWhereClauses() seems to be returning the same list > as it was passed to it. Do we really need to return anything from > transformPubWhereClauses()? > One more point about this function: the patch seems to be doing some work even when where clause is not specified which can be avoided. Another minor comment: +static bool pgoutput_row_filter(enum ReorderBufferChangeType changetype, Do we need to specify the 'enum' type before changetype parameter? [1] - https://www.postgresql.org/message-id/1513381.1640626456%40sss.pgh.pa.us -- With Regards, Amit Kapila.
Re: Proposal: remove obsolete hot-standby testing infrastructure
On 03.01.22 22:50, Tom Lane wrote: The attached proposed patch removes some ancient infrastructure for manually testing hot standby. I doubt anyone has used this in years, because AFAICS there is nothing here that's not done better by the src/test/recovery TAP tests. (Or if there is, we ought to migrate it into the TAP tests.) I looked into this some time ago and concluded that this test contains a significant amount of testing that isn't obviously done anywhere else. I don't have the notes anymore, and surely some things have progressed since, but I wouldn't just throw the old test suite away without actually checking.
[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]
Tom Lane writes: > Dag Lem writes: > >> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that >> two other modules are using UTF8 characters in tests - citext and >> unaccent. > > Yeah, neither of those have been upgraded to said best practice. > (If you feel like doing the legwork to improve that situation, > that'd be great.) > Please find attached a patch to run the previously commented-out UTF8-dependent tests for citext, according to best practice. For now I don't dare to touch the unaccent module, which seems to be UTF8-only anyway. Best regards Dag Lem diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile index a7de52928d..789932fe36 100644 --- a/contrib/citext/Makefile +++ b/contrib/citext/Makefile @@ -11,7 +11,7 @@ DATA = citext--1.4.sql \ citext--1.0--1.1.sql PGFILEDESC = "citext - case-insensitive character string data type" -REGRESS = citext +REGRESS = citext citext_utf8 ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out index 3bac0534fb..48b4de8993 100644 --- a/contrib/citext/expected/citext.out +++ b/contrib/citext/expected/citext.out @@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t; t (1 row) --- Multibyte sanity tests. Uncomment to run. --- SELECT 'À'::citext = 'À'::citext AS t; --- SELECT 'À'::citext = 'à'::citext AS t; --- SELECT 'À'::text = 'à'::text AS f; -- text wins. --- SELECT 'À'::citext <> 'B'::citext AS t; --- Test combining characters making up canonically equivalent strings. --- SELECT 'Ä'::text <> 'Ä'::text AS t; --- SELECT 'Ä'::citext <> 'Ä'::citext AS t; --- Test the Turkish dotted I. The lowercase is a single byte while the --- uppercase is multibyte. This is why the comparison code can't be optimized --- to compare string lengths. --- SELECT 'i'::citext = 'İ'::citext AS t; --- Regression. --- SELECT 'láska'::citext <> 'laská'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive; --- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative; -- Test > and >= SELECT 'B'::citext > 'a'::citext AS t; t diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out index 57fc863f7a..8ab4d4224e 100644 --- a/contrib/citext/expected/citext_1.out +++ b/contrib/citext/expected/citext_1.out @@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t; t (1 row) --- Multibyte sanity tests. Uncomment to run. --- SELECT 'À'::citext = 'À'::citext AS t; --- SELECT 'À'::citext = 'à'::citext AS t; --- SELECT 'À'::text = 'à'::text AS f; -- text wins. --- SELECT 'À'::citext <> 'B'::citext AS t; --- Test combining characters making up canonically equivalent strings. --- SELECT 'Ä'::text <> 'Ä'::text AS t; --- SELECT 'Ä'::citext <> 'Ä'::citext AS t; --- Test the Turkish dotted I. The lowercase is a single byte while the --- uppercase is multibyte. This is why the comparison code can't be optimized --- to compare string lengths. --- SELECT 'i'::citext = 'İ'::citext AS t; --- Regression. --- SELECT 'láska'::citext <> 'laská'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive; --- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative; -- Test > and >= SELECT 'B'::citext > 'a'::citext AS t; t diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out new file mode 100644 index 00..1f4fa79aff --- /dev/null +++ b/contrib/citext/expected/citext_utf8.out @@ -0,0 +1,119 @@ +/* + * This test must be run in a database with UTF-8 encoding, + * because other encodings don't support all the characters used. + */ +SELECT getdatabaseencoding() <> 'UTF8' + AS skip_test \gset +\if :skip_test +\quit +\endif +set client_encoding
RE: Optionally automatically disable logical replication subscriptions on error
On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威 wrote: > On Thursday, December 16, 2021 8:51 PM osumi.takami...@fujitsu.com > wrote: > > Attached the updated patch v14. > > A comment to the timing of printing a log: Thank you for your review ! > After the log[1] was printed, I altered subscription's option > (DISABLE_ON_ERROR) from true to false before invoking > DisableSubscriptionOnError to disable subscription. Subscription was not > disabled. > [1] "LOG: logical replication subscription "sub1" will be disabled due to an > error" > > I found this log is printed in function WorkerErrorRecovery: > + ereport(LOG, > + errmsg("logical replication subscription \"%s\" will > be disabled due to an error", > +MySubscription->name)); > This log is printed here, but in DisableSubscriptionOnError, there is a check > to > confirm subscription's disableonerr field. If disableonerr is found changed > from > true to false in DisableSubscriptionOnError, subscription will not be > disabled. > > In this case, "disable subscription" is printed, but subscription will not be > disabled actually. > I think it is a little confused to user, so what about moving this message > after > the check which is mentioned above in DisableSubscriptionOnError? Makes sense. I moved the log print after the check of the necessity to disable the subscription. Also, I've scrutinized and refined the new TAP test as well for refactoring. As a result, I fixed wait_for_subscriptions() so that some extra codes that can be simplified, such as escaped variable and one part of WHERE clause, are removed. Other change I did is to replace two calls of wait_for_subscriptions() with polling_query_until() for the subscriber, in order to make the tests better and more suitable for the test purposes. Again, for this refinement, I've conducted a tight loop test to check no timing issue and found no problem. Best Regards, Takamichi Osumi v15-0001-Optionally-disable-subscriptions-on-error.patch Description: v15-0001-Optionally-disable-subscriptions-on-error.patch
Re: row filtering for logical replication
BTW I think it's not great to commit with the presented split. We would have non-trivial short-lived changes for no good reason (0002 in particular). I think this whole series should be a single patch, with the commit message being a fusion of messages explaining in full what the functional change is, listing all the authors together. Having a commit message like in 0001 where all the distinct changes are explained in separate sections with each section listing its own author, does not sound very useful or helpful. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end." (2nd Commandment for C programmers)
Re: Pluggable toaster
On Thu, 30 Dec 2021 at 16:40, Teodor Sigaev wrote: > We are working on custom toaster for JSONB [1], because current TOAST is > universal for any data type and because of that it has some disadvantages: > - "one toast fits all" may be not the best solution for particular > type or/and use cases > - it doesn't know the internal structure of data type, so it cannot > choose an optimal toast strategy > - it can't share common parts between different rows and even > versions of rows Agreed, Oleg has made some very clear analysis of the value of having a higher degree of control over toasting from within the datatype. In my understanding, we want to be able to 1. Access data from a toasted object one slice at a time, by using knowledge of the structure 2. If toasted data is updated, then update a minimum number of slices(s), without rewriting the existing slices 3. If toasted data is expanded, then allownew slices to be appended to the object without rewriting the existing slices > Modification of current toaster for all tasks and cases looks too > complex, moreover, it will not works for custom data types. Postgres > is an extensible database, why not to extent its extensibility even > further, to have pluggable TOAST! We propose an idea to separate > toaster from heap using toaster API similar to table AM API etc. > Following patches are applicable over patch in [1] ISTM that we would want the toast algorithm to be associated with the datatype, not the column? Can you explain your thinking? We already have Expanded toast format, in-memory, which was designed specifically to allow us to access sub-structure of the datatype in-memory. So I was expecting to see an Expanded, on-disk, toast format that roughly matched that concept, since Tom has already shown us the way. (varatt_expanded). This would be usable by both JSON and PostGIS. Some other thoughts: I imagine the data type might want to keep some kind of dictionary inside the main toast pointer, so we could make allowance for some optional datatype-specific private area in the toast pointer itself, allowing a mix of inline and out-of-line data, and/or a table of contents to the slices. I'm thinking could also tackle these things at the same time: * We want to expand TOAST to 64-bit pointers, so we can have more pointers in a table * We want to avoid putting the data length into the toast pointer, so we can allow the toasted data to be expanded without rewriting everything (to avoid O(N^2) cost) -- Simon Riggshttp://www.EnterpriseDB.com/
Re: refactoring basebackup.c
On Wed, Jan 5, 2022 at 5:11 AM tushar wrote: > One scenario where I feel error message is confusing and if it is not > supported at all then error message need to be a little bit more clear > > if we use -z (or -Z ) with -t , we are getting this error > [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch -z > pg_basebackup: error: only tar mode backups can be compressed > Try "pg_basebackup --help" for more information. > > but after removing -z option backup is in tar mode only > > edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch > [edb@centos7tushar bin]$ ls /tmp/test0 > backup_manifest base.tar OK, fair enough, I can adjust the error message for that case. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Refactoring of compression options in pg_basebackup
On Mon, Dec 20, 2021 at 7:43 PM Michael Paquier wrote: > Yeah, consistency would be good. For the client-side compression of > LZ4, we have shaped things around the existing --compress option, and > there is 6f164e6 that offers an API to parse that at option-level, > meaning less custom error strings. I'd like to think that splitting > the compression level and the compression method is still the right > choice, except if --server-compression combined with a client-side > compression is a legal combination. This would not really make sense, > though, no? So we'd better block this possibility from the start? Right. It's blocked right now, but Tushar noticed on the other thread that the error message isn't as good as it could be, so I'll improve that a bit. Still the issue wasn't overlooked. > > If they don't decompress the backup and run pg_verifybackup on it then > > I'm not sure how much they help. Yet, I don't know how to do that > > portably. > > They help in checking that an environment does not use a buggy set of > GZIP, at least. Using pg_verifybackup on a base backup with > --format='t' could be tweaked with $ENV{TAR} for the tarballs > generation, for example, as we do in some other tests. Option sets > like "xvf" or "zxvf" should be rather portable across the buildfarm, > no? I'd like to think that this is not a requirement for adding > checks in the compression path, as a first step, though, but I agree > that it could be extended more. Oh, well, if we have a working tar available, then it's not so bad. I was thinking we couldn't really count on that, especially on Windows. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Refactoring of compression options in pg_basebackup
Robert Haas writes: > Oh, well, if we have a working tar available, then it's not so bad. I > was thinking we couldn't really count on that, especially on Windows. I think the existing precedent is to skip the test if tar isn't there, cf pg_basebackup/t/010_pg_basebackup.pl. But certainly the majority of buildfarm animals have it. regards, tom lane
Re: Refactoring of compression options in pg_basebackup
On Wed, Jan 5, 2022 at 4:17 AM wrote: > When the backend-side compression is completed, were there really be a need > for > client-side compression? If yes, then it seems logical to have distinct > options > for them and this cleanup makes sense. If not, then it seems logical to > maintain > the current options list and 'simply' change the internals of the code, and > this > cleanup makes sense. I think we're going to want to offer both options. We can't know whether the user prefers to consume CPU cycles on the server or on the client. Compressing on the server has the advantage of potentially saving transfer bandwidth, but the server is also often the busiest part of the whole system, and users are often keen to offload as much work as possible. Given that, I'd like us to be thinking about what the full set of options looks like once we have (1) compression on either the server or the client and (2) multiple compression algorithms and (3) multiple compression levels. Personally, I don't really like the decision made by this proposed patch. In this patch's view of the world, -Z is a way of providing the compression level for whatever compression algorithm you happen to have selected, but I think of -Z as being the upper-case version of -z which I think of as selecting specifically gzip. It's not particularly intuitive to me that in a command like pg_basebackup --compress=, is a compression level rather than an algorithm. So what I would propose is probably something like: pg_basebackup --compress=ALGORITHM [--compression-level=NUMBER] pg_basebackup --server-compress=ALGORITHM [--compression-level=NUMBER] And then make -z short for --compress=gzip and -Z short for --compress=gzip --compression-level=. That would be a backward-incompatible change to the definition of --compress, but as long as -Z works the same as today, I don't think many people will notice. If we like, we can notice if the argument to --compress is an integer and suggest using either -Z or --compress=gzip --compression-level= instead. In the proposed patch, you end up with pg_basebackup --compression-method=lz4 -Z2 meaning compression with lz4 level 2. I find that quite odd, though as with all such things, opinions may vary. In my proposal, that would be an error, because it would be equivalent to --compress=lz4 --compress=gzip --compression-level=2, and would thus involve conflicting compression method specifications. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SQL:2011 application time
On 21.11.21 02:51, Paul A Jungwirth wrote: Here are updated patches. They are rebased and clean up some of my TODOs. This patch set looks very interesting. It's also very big, so it's difficult to see how to get a handle on it. I did a pass through it to see if there were any obvious architectural or coding style problems. I also looked at some of your TODO comments to see if I had something to contribute there. I'm confused about how to query tables based on application time periods. Online, I see examples using AS OF, but in the SQL standard I only see this used for system time, which we are not doing here. What is your understanding of that? v10-0001-Add-PERIODs.patch src/backend/commands/tablecmds.c Might be worth explaining somewhere why AT_PASS_ADD_PERIOD needs to be its own pass. -- Ah, this is explained in ATPrepCmd(). Maybe that is okay, but I would tend to prefer a comprehensive explanation here rather than sprinkled around. make_period_not_backward(): Hardcoding the name of the operator as "<" is not good. You should perhaps lookup the less-than operator in the type cache. Look around for TYPECACHE_LT_OPR for how this is usually done. validate_period(): Could use an explanatory comment. There are a bunch of output arguments, and it's not clear what all of this is supposed to do, and what "validating" is in this context. MergeAttributes(): I would perhaps initially just prohibit inheritance situations that involve periods on either side. (It should work for partitioning, IMO, but that should be easier to arrange.) AlterTableGetLockLevel(): The choice of AccessExclusiveLock looks correct. I think the whole thing can also be grouped with some of the other "affects concurrent SELECTs" cases? Maybe the node type Period could have a slightly more specific name, perhaps PeriodDef, analogous to ColumnDef? I didn't follow why indexes would have periods, for example, the new period field in IndexStmt. Is that explained anywhere? While reading this patch I kept wondering whether it would be possible to fold periods into pg_attribute, perhaps with negative attribute numbers. Have you looked into something like that? No doubt it's also complicated, but it might simplify some things, like the name conflict checking. v10-0002-Add-temporal-PRIMARY-KEY-and-UNIQUE-constraints.patch src/backend/catalog/Catalog.pm: I see you use this change in the subsequent patches, but I would recommend skipping all this. The comments added are kind of redundant with the descr fields anyway. transformIndexConstraint(): As above, we can't look up the && operator by name. In this case, I suppose we should look it up through the index AM support operators. Further, the additions to this function are very complicated and not fully explained. I'm suspicious about things like findNewOrOldColumn() -- generally we should look up columns by number not name. Perhaps you can add a header comment or split out the code further into smaller functions. pg_dump.c getIndexes() has been refactored since to make version-specific additions easier. But your patch is now failing to apply because of this. Of course, the main problem in this patch is that for most uses it requires btree_gist. I think we should consider moving that into core, or at least the support for types that are most relevant to this functionality, specifically the date/time types. Aside from user convenience, this would also allow writing more realistic test cases. v10-0003-Add-UPDATE-DELETE-FOR-PORTION-OF.patch Use of MINVALUE and MAXVALUE for unbounded seems problematic to me. (If it is some value, it is not really larger than any value.) We have the keyword UNBOUNDED, which seems better suited. src/backend/access/brin/brin_minmax_multi.c These renaming changes seem unrelated (but still seem like a good idea). Should they be progressed separately? Again, some hardcoded operator name lookup in this patch. I don't understand why a temporal primary key is required for doing UPDATE FOR PORTION OF. I don't see this in the standard. v10-0004-Add-temporal-FOREIGN-KEYs.patch Do we really need different trigger names depending on whether the foreign key is temporal? range_as_string() doesn't appear to be used anywhere. I ran out of steam on this patch, it's very big. But it seems sound in general. How to proceed. I suppose we could focus on committing 0001 and 0002 first. That would be a sensible feature set even if the remaining patches did not make a release. I do feel we need to get btree_gist into core. That might be a big job by itself. I'm also bemused why btree_gist is so bloated compared to btree_gin. btree_gin uses macros to eliminate duplicate code where btree_gist is full of copy-and-paste. So there are some opportunities there to make things more compact. Is there anything else you think we can do as preparatory work to make the main patches more manageable?
Re: Converting WAL to SQL
On Tue, Jan 4, 2022 at 10:47:47AM -0300, Fabrízio de Royes Mello wrote: > > On Tue, Jan 4, 2022 at 9:22 AM Michael Paquier wrote: > > > > On Wed, Dec 29, 2021 at 08:50:23AM -0300, Fabrízio de Royes Mello wrote: > > > Try this: > > > https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw > > > > You may want to be careful with this, and I don't know if anybody is > > using that for serious cases so some spots may have been missed. > > > > I used it in the past during a major upgrade process from 9.2 to 9.6. > > What we did was decode the 9.6 wal files and apply transactions to the > old 9.2 to keep it in sync with the new promoted version. This was our > "rollback" strategy if something went wrong with the new 9.6 version. How did you deal with the issue that SQL isn't granular enough (vs. row-level changes) to reproduce the result reliably, as outlined here? https://momjian.us/main/blogs/pgblog/2019.html#March_6_2019 -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Index-only scan for btree_gist turns bpchar to char
On Wed, 05 Jan 2022 at 03:19, Tom Lane wrote: > Alexander Lakhin writes: >> While testing the index-only scan fix, I've discovered that replacing >> the index-only scan with the index scan changes contrib/btree_gist >> output because index-only scan for btree_gist returns a string without >> padding. > > Ugh, yeah. This seems to be because gbt_bpchar_compress() strips > trailing spaces (using rtrim1) before storing the value. The > idea evidently is to simplify gbt_bpchar_consistent, but it's not > acceptable if the opclass is supposed to support index-only scan. > > I see two ways to fix this: > > * Disallow index-only scan, by removing the fetch function for this > opclass. This'd require a module version bump, so people wouldn't > get that fix automatically. > > * Change gbt_bpchar_compress to not trim spaces (it becomes just > like gbt_text_compress), and adapt gbt_bpchar_consistent to cope. > This does nothing for the problem immediately, unless you REINDEX > affected indexes --- but over time an index's entries would get > replaced with untrimmed versions. > > I also wondered if we could make the fetch function reconstruct the > padding, but it doesn't seem to have access to the necessary info. > If we fix this In the second way, the range query has the same results in both seq scan and index only scan. However, it will incur other problems. For the following query: SELECT *, octet_length(a) FROM chartmp WHERE a = '31b0'; Currently, we can get a | octet_length --+-- 31b0 |4 After fixed, we cannot get any result. For the equal condition, we must put the extra spaces to make it work. Here is a patch for POC testing. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. diff --git a/contrib/btree_gist/btree_text.c b/contrib/btree_gist/btree_text.c index 8019d11281..5fd425047f 100644 --- a/contrib/btree_gist/btree_text.c +++ b/contrib/btree_gist/btree_text.c @@ -121,16 +121,7 @@ gbt_bpchar_compress(PG_FUNCTION_ARGS) } if (entry->leafkey) - { - - Datum d = DirectFunctionCall1(rtrim1, entry->key); - GISTENTRY trim; - - gistentryinit(trim, d, - entry->rel, entry->page, - entry->offset, true); - retval = gbt_var_compress(&trim, &tinfo); - } + retval = gbt_var_compress(entry, &tinfo); else retval = entry; @@ -179,7 +170,6 @@ gbt_bpchar_consistent(PG_FUNCTION_ARGS) boolretval; GBT_VARKEY *key = (GBT_VARKEY *) DatumGetPointer(entry->key); GBT_VARKEY_R r = gbt_var_key_readable(key); - void *trim = (void *) DatumGetPointer(DirectFunctionCall1(rtrim1, PointerGetDatum(query))); /* All cases served by this function are exact */ *recheck = false; @@ -189,7 +179,7 @@ gbt_bpchar_consistent(PG_FUNCTION_ARGS) tinfo.eml = pg_database_encoding_max_length(); } - retval = gbt_var_consistent(&r, trim, strategy, PG_GET_COLLATION(), + retval = gbt_var_consistent(&r, query, strategy, PG_GET_COLLATION(), GIST_LEAF(entry), &tinfo, fcinfo->flinfo); PG_RETURN_BOOL(retval); }
Re: Index-only scan for btree_gist turns bpchar to char
Japin Li writes: > Here is a patch for POC testing. This is certainly not right. You've made gbt_bpchar_consistent work identically to gbt_text_consistent, but it needs to implement a test equivalent to bpchareq, ie ignore trailing spaces in both inputs. The minimum-effort fix would be to apply rtrim1 to both strings in gbt_bpchar_consistent, but I wonder if we can improve on that by pushing the ignore-trailing-spaces behavior further down. I didn't look yet at whether gbt_var_consistent can support any type-specific behavior. regards, tom lane
Re: refactoring basebackup.c
On Tue, Dec 28, 2021 at 1:12 PM Jeevan Ladhe wrote: > Hi Tushar, > > You need to apply Robert's v10 version patches 0002, 0003 and 0004, before > applying the lz4 patch(v8 version). > Please let me know if you still face any issues. > Thanks, Jeevan. I tested —server-compression option using different other options of pg_basebackup, also checked -t/—server-compression from pg_basebackup of v15 will throw an error if the server version is v14 or below. Things are looking good to me. Two open issues - 1)lz4 value is missing for --server-compression in pg_basebackup --help 2)Error messages need to improve if using -t server with -z/-Z regards,
Re: SQL/JSON: functions
On 1/5/22 00:51, Himanshu Upadhyaya wrote: > On Thu, Dec 9, 2021 at 7:34 PM Himanshu Upadhyaya > wrote: >> 3) >> Is not that result of the two below queries should match because both are >> trying to retrieve the information from the JSON object. >> >> postgres=# SELECT JSON_OBJECT('track' VALUE '{ >> "segments": [ >> { >> "location": [ 47.763, 13.4034 ], >> "start time": "2018-10-14 10:05:14", >> "HR": 73 >> }, >> { >> "location": [ 47.706, 13.2635 ], >> "start time": "2018-10-14 101:39:21", >> "HR": 135 >> } >> ] >> } >> }')->'track'->'segments'; >> ?column? >> -- >> >> (1 row) >> >> postgres=# select '{ >> "track": { >> "segments": [ >> { >> "location": [ 47.763, 13.4034 ], >> "start time": "2018-10-14 10:05:14", >> "HR": 73 >> }, >> { >> "location": [ 47.706, 13.2635 ], >> "start time": "2018-10-14 10:39:21", >> "HR": 135 >> } >> ] >> } >> }'::jsonb->'track'->'segments'; >> >> ?column? >> --- >> [{"HR": 73, "location": [47.763, 13.4034], "start time": "2018-10-14 >> 10:05:14"}, {"HR": 135, "location": [47.706, 13.2635], "start time": >> "2018-10-14 10:39:21"}] >> (1 row) >> > just wanted to check your opinion on the above, is this an expected behaviour? Your VALUE clause is actually not legal JSON - it has one too many braces at the end. The reason postgres didn't complain about it is that JSON_OBJECT is treating it as a string. If you correct the JSON and cast it as jsonb you get the desired result: andrew=# SELECT JSON_OBJECT('track' VALUE '{ "segments": [ { "location": [ 47.763, 13.4034 ], "start time": "2018-10-14 10:05:14", "HR": 73 }, { "location": [ 47.706, 13.2635 ], "start time": "2018-10-14 101:39:21", "HR": 135 } ] }'::jsonb)->'track'->'segments'; ?column? [{"HR": 73, "location": [47.763, 13.4034], "start time": "2018-10-14 10:05:14"}, {"HR": 135, "location": [47.706, 13.2635], "start time": "2018-10-14 101:39:21"}] (1 row) >> Few comments For 0002-SQL-JSON-constructors-v59.patch: > Also, any thoughts on this? I will look at that separately. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: make tuplestore helper function
Thanks for the detailed review! On Fri, Dec 17, 2021 at 3:04 PM Justin Pryzby wrote: > > On Mon, Dec 13, 2021 at 05:27:52PM -0500, Melanie Plageman wrote: > > Done in attached v4 > > Thanks. > > I think these comments can be removed from the callers: > /* it's a simple type, so don't use get_call_result_type */ Done in attached v5 > You removed one call to tuplestore_donestoring(), but not the others. > I guess you could remove them all (optionally as a separate patch). I removed it in that one location because I wanted to get rid of the local variable it used. I am fine with removing the other occurrences, but I thought there might be some reason why instead of removing it, they made it into a no-op. > The rest of these are questions more than comments, and I'm not necessarily > suggesting to change the patch: > > This isn't new in your patch, but for me, it's more clear if the callers have > a > separate boolean for randomAccess, rather than having the long expression in > the function call: > > + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, > + rsinfo->allowedModes & SFRM_Materialize_Random); > > vs > > randomAccess = (rsinfo->allowedModes & SFRM_Materialize_Random) != 0; > - tupstore = tuplestore_begin_heap(randomAccess, false, work_mem); > + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, randomAccess); > > One could also argue for passing rsinfo->allowedModes, instead of > (rsinfo->allowedModes & SFRM_Materialize_Random). I've changed the patch to have MakeFuncResultTuplestore() check rsinfo->allowedModes and pass in the resulting boolean to tuplestore_begin_heap(). Because I modified the MakeFuncResultTuplestore() function signature to accommodate this, I had to collapse the two patches into one, as I couldn't maintain the call sites which passed in a constant. > There's a couples places that you're checking expectedDesc where it wasn't > being checked before. Is that some kind of live bug ? > pg_config() text_to_table() Yes, expectedDesc was accessed in these locations without checking that it is not NULL. Should that be a separate patch? > It'd be nice if the "expected tuple format" error didn't need to be duplicated > for each SRFs. I suppose the helper function could just take a boolean > determining whether or not to throw an error (passing "expectedDesc==NULL"). > You'd have to call the helper before CreateTupleDescCopy() - which you're > already doing in a couple places for similar reasons. So, I don't think it makes sense to move that error message into MakeFuncResultTuplestore() for the cases in which NULL is passed in for the tuple descriptor. expectedDesc is not accessed at all in these cases inside the function (since get_call_result_type()) is not called. For the cases when a tuple descriptor is passed in, only two callers actually check for expectedDesc -- each_worker_jsonb and each_worker. Therefore, I don't think it makes sense to add another parameter just to move that check inside for two callers. > I noticed that tuplestore.h isn't included most places. I suppose it's > included via execnodes.h. Your patch doesn't change that, arguably it > should've always been included. I've now included it in funcapi.c. - Melanie From 8ea7f90162f73049541fb6f53e714298a5ecfc41 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Tue, 2 Nov 2021 12:20:55 -0400 Subject: [PATCH v5] Add helper to make tuplestore And use it to refactor out existing duplicate code. Note that for func-api callers previously calling tuplestore_begin_heap() with a constant value for the randomAccess parameter, by using MakeFuncResultTuplestore(), they are now passing true or false based on whether or not SFRM_Materialize_Random is set in rsinfo->allowedModes. This is consistent with the instructions in src/backend/utils/fmgr/README, which state that tuplestores "must be created with randomAccess = true if SFRM_Materialize_Random is set in allowedModes, but it can (and preferably should) be created with randomAccess = false if not". Author: Melanie Plageman Reviewed-by: Justin Pryzby Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs%2BPxeHw1CWJeXKofkE6TuZg%40mail.gmail.com --- src/backend/access/transam/xlogfuncs.c| 16 +--- src/backend/commands/event_trigger.c | 32 +-- src/backend/commands/extension.c | 48 +-- src/backend/commands/prepare.c| 15 +--- src/backend/foreign/foreign.c | 51 +-- src/backend/libpq/hba.c | 19 + src/backend/replication/logical/launcher.c| 16 +--- .../replication/logical/logicalfuncs.c| 18 +--- src/backend/replication/logical/origin.c | 19 + src/backend/replication/slotfuncs.c | 16 +--- src/backend/replication/walsender.c | 16 +--- src/backend/storage/ipc/shmem.c | 16 +--- src/backend/utils/adt/datetime.c | 1
Re: Converting WAL to SQL
On Thu, Jan 6, 2022 at 12:19 AM Bruce Momjian wrote: > > On Tue, Jan 4, 2022 at 10:47:47AM -0300, Fabrízio de Royes Mello wrote: > > > > > > What we did was decode the 9.6 wal files and apply transactions to the > > old 9.2 to keep it in sync with the new promoted version. This was our > > "rollback" strategy if something went wrong with the new 9.6 version. > > How did you deal with the issue that SQL isn't granular enough (vs. > row-level changes) to reproduce the result reliably, as outlined here? This is a logical decoding plugin, so it's SQL containing decoded row-level changes. It will behave the same as a publication/suscription (apart from being far less performant, due to being plain SQL of course).
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Hi, On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > That's pretty much the same - XLogInsert() can trigger an > XLogWrite()/Flush(). > > I think it's a complete no-go to add throttling to these places. It's quite > possible that it'd cause new deadlocks, and it's almost guaranteed to have > unintended consequences (e.g. replication falling back further because > XLogFlush() is being throttled). I thought of another way to implement this feature. What if we checked the current distance somewhere within XLogInsert(), but only set InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we check if XLogDelayPending is true and sleep the appropriate time. That way the sleep doesn't happen with important locks held / within a critical section, but we still delay close to where we went over the maximum lag. And the overhead should be fairly minimal. I'm doubtful that implementing the waits on a transactional level provides a meaningful enough amount of control - there's just too much WAL that can be generated within a transaction. Greetings, Andres Freund
Re: Add 64-bit XIDs into PostgreSQL 15
On Thu, 30 Dec 2021 at 13:19, Maxim Orlov wrote: > > Hi, hackers! > > Long time wraparound was a really big pain for highly loaded systems. One > source of performance degradation is the need to vacuum before every > wraparound. > And there were several proposals to make XIDs 64-bit like [1], [2], [3] and > [4] to name a few. Very good to see this revived. > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, > refactoring and was rebased to PG15. > > Main changes: > - Change TransactionId to 64bit This sounds like a good route to me. > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > remains 32bit > -- 32bit xid is named ShortTransactionId now. > -- Exception: see "double xmax" format below. > - Heap page format is changed to contain xid and multixact base value, > tuple's xmin and xmax are offsets from. > -- xid_base and multi_base are stored as a page special data. PageHeader >remains unmodified. > -- If after upgrade page has no free space for special data, tuples are >converted to "double xmax" format: xmin became virtual >FrozenTransactionId, xmax occupies the whole 64bit. >Page converted to new format when vacuum frees enough space. > - In-memory tuples (HeapTuple) were enriched with copies of xid_base and > multi_base from a page. I think we need more Overview of Behavior than is available with this patch, perhaps in the form of a README, such as in src/backend/access/heap/README.HOT. Most people's comments are about what the opportunities and problems caused, and mine are definitely there also. i.e. explain the user visible behavior. Please explain the various new states that pages can be in and what the effects are, My understanding is this would be backwards compatible, so we can upgrade to it. Please confirm. Thanks -- Simon Riggshttp://www.EnterpriseDB.com/
Re: [Proposal] Add foreign-server health checks infrastructure
On Tue, Jan 4, 2022 at 10:21 PM Shinya Kato wrote: > Thank you for the new patch! > > On 2021-12-15 15:40, kuroda.hay...@fujitsu.com wrote: > > Dear Kato-san, > > > > Thank you for giving comments! And sorry for late reply. > > I rebased my patches. > > > >> Even for local-only transaction, I thought it useless to execute > >> CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I > >> make it so that it determines outside whether it contains SQL to the > >> remote or not? > > > > Yeah, remote-checking timeout will be enable even ifa local > > transaction is opened. > > In my understanding, postgres cannot distinguish whether opening > > transactions > > are using only local object or not. > > My first idea was that turning on the timeout when GetFdwRoutineXXX > > functions > > were called, but in some cases FDWs may not use remote connection even > > if > > they call such a handler function. e.g. postgresExplainForeignScan(). > > Hence I tried another approach that unregister all checking callback > > the end of each transactions. Only FDWs can notice that transactions > > are remote or local, > > so they must register callback when they really connect to other > > database. > > This implementation will avoid calling remote checking in the case of > > local transaction, > > but multiple registering and unregistering may lead another overhead. > > I attached which implements that. > > > It certainly incurs another overhead, but I think it's better than the > previous one. > So far, I haven't encountered any problems, but I'd like other people's > opinions. > > >> The following points are minor. > >> 1. In foreign.c, some of the lines are too long and should be broken. > >> 2. In pqcomm.c, the newline have been removed, what is the intention > >> of > >> this? > >> 3. In postgres.c, > >> 3-1. how about inserting a comment between lines 2713 and 2714, > >> similar > >> to line 2707? > >> 3-2. the line breaks in line 3242 and line 3243 should be aligned. > >> 3-3. you should change > >> /* > >> * Skip checking foreign servers while reading > >> messages. > >> */ > >> to > >> /* > >> * Skip checking foreign servers while reading > >> messages. > >> */ > >> 4. In connection.c, There is a typo in line 1684, so "fucntion" should > >> be changed to "function". > > > > Maybe all of them were fixed. Thanks! > Thank you, and it looks good to me. > > > Hi, + UnregisterAllCheckingRemoteServersCallback(); UnregisterAllCheckingRemoteServersCallback -> UnregisterAllCheckingRemoteServersCallbacks + CheckingRemoteServersCallbackItem *item; + item = fdw_callbacks; The above two lines can be combined. +UnregisterCheckingRemoteServersCallback(CheckingRemoteServersCallback callback, + void *arg) Is the above func called anywhere ? + if (item->callback == callback && item->arg == arg) Is comparing argument pointers stable ? What if the same argument is cloned ? + CallCheckingRemoteServersCallbacks(); + + if (remote_servers_connection_check_interval > 0) + enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT, + remote_servers_connection_check_interval); Should the call to CallCheckingRemoteServersCallbacks() be placed under the if block checking remote_servers_connection_check_interval ? If remote_servers_connection_check_interval is 0, it seems the callbacks shouldn't be called. Cheers
Re: [PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]
Dag Lem writes: > Please find attached a patch to run the previously commented-out > UTF8-dependent tests for citext, according to best practice. For now I > don't dare to touch the unaccent module, which seems to be UTF8-only > anyway. I tried this on a bunch of different locale settings and concluded that we need to restrict the locale to avoid failures: it falls over with locale C. With that, it passes on all UTF8 LANG settings on RHEL8 and FreeBSD 12, and all except am_ET.UTF-8 on current macOS. I'm not sure what the deal is with am_ET, but macOS has a long and sad history of wonky UTF8 locales, so I was actually expecting worse. If the buildfarm shows more problems, we can restrict it further --- I won't be too upset if we end up restricting to just Linux systems, like collate.linux.utf8. Anyway, pushed to see what happens. regards, tom lane
Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
Hi, On 2021-12-28 15:49:00 +, Jelte Fennema wrote: > The first patch is a cleaned up version of my previous patch. I think I > addressed > all feedback on the previous version in that patch (e.g. removed windows > code, > fixed formatting). To me it seems a bit problematic to introduce a divergence between windows / everything else here. Isn't that just going to lead to other complaints just like this thread, where somebody discovered the hard way that there's platform dependent behaviour here? > The second patch is a new one, it implements honouring of the connect_timeout > connection option in PQcancel. This patch requires the first patch to also be > applied, > but since it seemed fairly separate and the code is not trivial I didn't want > the first > patch to be blocked on this. > > Finally, I would love it if once these fixes are merged the would also be > backpatched to > previous versions of libpq. Does that seem possible? As far as I can tell it > would be fine, > since it doesn't really change any of the public APIs. The only change is > that the pg_cancel > struct now has a few additional fields. But since that struct is defined in > libpq-int.h, so that > struct should not be used by users of libpq directly, right?. I'm not really convinced this is a good patch to backpatch. There does seem to be some potential for subtle breakage - code in signal handlers is notoriously finnicky, it's a rarely exercised code path, etc. It's also not fixing something that previously worked. > + * NOTE: These socket options are currently not set for Windows. The > + * reason is that signal safety in this function is very important, and > it > + * was not clear to if the functions required to set the socket options > on > + * Windows were signal-safe. > + */ > +#ifndef WIN32 > + if (!IS_AF_UNIX(cancel->raddr.addr.ss_family)) > + { > +#ifdef TCP_USER_TIMEOUT > + if (cancel->pgtcp_user_timeout >= 0) > + { > + if (setsockopt(tmpsock, IPPROTO_TCP, TCP_USER_TIMEOUT, > +(char *) > &cancel->pgtcp_user_timeout, > + > sizeof(cancel->pgtcp_user_timeout)) < 0) > + { > + strlcpy(errbuf, "PQcancel() -- > setsockopt(TCP_USER_TIMEOUT) failed: ", errbufsize); > + goto cancel_errReturn; > + } > + } > +#endif > + > + if (cancel->keepalives != 0) > + { > + int on = 1; > + > + if (setsockopt(tmpsock, > +SOL_SOCKET, SO_KEEPALIVE, > +(char *) &on, sizeof(on)) < > 0) > + { > + strlcpy(errbuf, "PQcancel() -- > setsockopt(SO_KEEPALIVE) failed: ", errbufsize); > + goto cancel_errReturn; > + } > + } This is very repetitive - how about introducing a helper function for this? > @@ -4467,8 +4601,8 @@ retry3: > > crp.packetlen = pg_hton32((uint32) sizeof(crp)); > crp.cp.cancelRequestCode = (MsgType) pg_hton32(CANCEL_REQUEST_CODE); > - crp.cp.backendPID = pg_hton32(be_pid); > - crp.cp.cancelAuthCode = pg_hton32(be_key); > + crp.cp.backendPID = pg_hton32(cancel->be_pid); > + crp.cp.cancelAuthCode = pg_hton32(cancel->be_key); Others might differ, but I'd separate changing the type passed to internal_cancel() into its own commit. Greetings, Andres Freund
Are we missing a dot in initdb's output?
Hi, this is more a cosmetic concern, but anyway: running initdb gives this output: ... creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. ... Shouldn't there be a "." after "authentication for local connections"? Probably it should be like this: initdb: warning: Enabling "trust" authentication for local connections. initdb's output a few lines earlier gives this, which all close with a "dot" and start with upper case: "The database cluster will be initialized with locale "en_US.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled." Regards Daniel
Re: Converting WAL to SQL
On Wed, Jan 5, 2022 at 2:19 PM Julien Rouhaud wrote: > > On Thu, Jan 6, 2022 at 12:19 AM Bruce Momjian wrote: > > > > On Tue, Jan 4, 2022 at 10:47:47AM -0300, Fabrízio de Royes Mello wrote: > > > > > > > > > What we did was decode the 9.6 wal files and apply transactions to the > > > old 9.2 to keep it in sync with the new promoted version. This was our > > > "rollback" strategy if something went wrong with the new 9.6 version. > > > > How did you deal with the issue that SQL isn't granular enough (vs. > > row-level changes) to reproduce the result reliably, as outlined here? > > This is a logical decoding plugin, so it's SQL containing decoded > row-level changes. It will behave the same as a > publication/suscription (apart from being far less performant, due to > being plain SQL of course). Exactly! -- Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings
Andres Freund writes: > On 2021-12-28 15:49:00 +, Jelte Fennema wrote: >> Finally, I would love it if once these fixes are merged the would also be >> backpatched to >> previous versions of libpq. > I'm not really convinced this is a good patch to backpatch. There does seem to > be some potential for subtle breakage - code in signal handlers is notoriously > finnicky, it's a rarely exercised code path, etc. It's also not fixing > something that previously worked. IMO, this is a new feature not a bug fix, and as such it's not backpatch material ... especially if we don't have 100.00% confidence in it. regards, tom lane
Re: Are we missing a dot in initdb's output?
"Daniel Westermann (DWE)" writes: > initdb: warning: enabling "trust" authentication for local connections > You can change this by editing pg_hba.conf or using the option -A, or > --auth-local and --auth-host, the next time you run initdb. > Shouldn't there be a "." after "authentication for local connections"? Meh, I think this is mimicking our coding style for primary vs. detail messages. regards, tom lane
Re: daitch_mokotoff module
Dag Lem writes: > Thomas Munro writes: > >> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem wrote: >>> However I guess this won't make any difference wrt. actually running the >>> tests, as long as there seems to be an encoding problem in the cfbot >> >> Fixed -- I told it to pull down patches as binary, not text. Now it >> makes commits that look healthier, and so far all the Unix systems >> have survived CI: >> >> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f >> http://cfbot.cputube.org/dag-lem.html >> > > Great! > > Dag > > After this I did the mistake of including a patch for citext in this thread, which is now picked up by cfbot instead of the Daitch-Mokotoff patch. Attaching the original patch again in order to hopefully fix my mistake. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..1d5bd84be8 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..1b7263c349 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,593 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "daitch_mokotoff.h" + +#include "postgres.h" +#include "utils/builtins.h" +#include "mb/pg_wchar.h" + +#include +#include + +/* Internal C implementation */ +static char *_daitch_mokotoff(char *word, char *soundex, size_t n); + + +PG_FUN
Re: Collecting statistics about contents of JSONB columns
On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra wrote: > 0006-Add-jsonb-statistics-20211230.patch Hi Tomas, -CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) +CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) https://cirrus-ci.com/task/6405547984420864 It looks like there is a Unicode BOM sequence in there, which is upsetting our Windows testing but not the 3 Unixes (not sure why). Probably added by an editor.
Re: Logical insert/update/delete WAL records for custom table AMs
On Wed, 10 Nov 2021 at 03:17, Amit Kapila wrote: > > On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis wrote: > > > > On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote: > > > I am not talking about decoding plugin but rather decoding itself, > > > basically, the work we do in reorderbuffer.c, decode.c, etc. The two > > > things I remember were tuple format and transaction ids as mentioned > > > in my previous email. > > > > If it's difficult to come up with something that will work for all > > table AMs, then it suggests that we might want to go towards fully > > extensible rmgr, and have a decoding method in the rmgr. > > > > I started a thread (with a patch) here: > > > > > > https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.ca...@j-davis.com > > > > > I think we should try to find a solution for > > > tuple format as the current decoding code relies on it if we want > > > decoding to deal with another table AMs transparently. > > > > Agreed, but it seems like we need basic extensibility first. For now, > > we'll need to convert to a heap tuple, > > > > Okay, but that might have a cost because we need to convert it before > WAL logging it, and then we probably also need to convert back to the > original table AM format during recovery/standby apply. I spoke with Jeff in detail about this patch in NYC Dec 21, and I now think it is worth pursuing. It seems much more likely that this would be acceptable than fully extensible rmgr. Amit asks a good question: should we be writing a message that seems to presume the old heap tuple format? My answer is that we clearly need it to be written in *some* common format, and the current heap format currently works, so de facto it is the format we should use. Right now, TAMs have to reformat back into this same format, so it is the way the APIs work. Put it another way: I don't see any other format that makes sense to use, now, but that could change in the future. So I'm signing up as a reviewer and we'll see if this is good to go. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Collecting statistics about contents of JSONB columns
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra wrote: > The patch does something far more > elegant - it simply uses stavalues to store an array of JSONB documents, > each describing stats for one path extracted from the sampled documents. Sounds good. > I'm sure there's plenty open questions - for example I think we'll need > some logic to decide which paths to keep, otherwise the statistics can > get quite big, if we're dealing with large / variable documents. We're > already doing something similar for MCV lists. > > One of Nikita's patches not included in this thread allow "selective" > statistics, where you can define in advance a "filter" restricting which > parts are considered interesting by ANALYZE. That's interesting, but I > think we need some simple MCV-like heuristics first anyway. > > Another open question is how deep the stats should be. Imagine documents > like this: > >{"a" : {"b" : {"c" : {"d" : ... > > The current patch build stats for all possible paths: > > "a" > "a.b" > "a.b.c" > "a.b.c.d" > > and of course many of the paths will often have JSONB documents as > values, not simple scalar values. I wonder if we should limit the depth > somehow, and maybe build stats only for scalar values. The user interface for designing filters sounds hard, so I'd hope we can ignore that for now. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: biblio.sgml dead link
> On 5 Jan 2022, at 02:26, Michael Paquier wrote: > > On Tue, Jan 04, 2022 at 06:10:07PM +0100, Erik Rijkers wrote: >> The replacement is a ~200 euro pdf (2021). I'd be thankful if someone would >> send the pdf to me; maybe I can update my JSON tests. >> >> And we should remove that entry from the bibliography (or have it point to >> the new page [1]). > > Removing the entry seems a bit overdoing it to me, and updating to a > paywall does not sound completely right to me either. Another thing > that we could do is to just remove the link, but keep its reference. > > Any thoughts from others? We definitely shouldn't remove it, it's referenced from functions-json.html and that IMO adds value. I think we should remove the link, not because it costs money but because it can be purchased from several places, and choosing one to "favor" seems to invite criticism. Kind of how we don't link to an online store for buying the other books. -- Daniel Gustafsson https://vmware.com/
Re: pg_stat_statements and "IN" conditions
> On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote: > We can debate whether the rules proposed here are good for > pg_stat_statements or not, but it seems inevitable that they will be a > disaster for some other consumers of the query hash. Hm, which consumers do you mean here, potential extension? Isn't the ability to use an external module to compute queryid make this situation possible anyway? > do you really think that a query with two int > parameters is equivalent to one with five float parameters for all > query-identifying purposes? Nope, and it will be hard to figure this out no matter which approach we're talking about, because it mostly depends on the context and type of queries I guess. Instead, such functionality should allow some reasonable configuration. To be clear, the use case I have in mind here is not four or five, but rather a couple of hundreds constants where chances that the whole construction was generated automatically by ORM is higher than normal. > I can see the merits of allowing different numbers of IN elements > to be considered equivalent for pg_stat_statements, but this patch > seems to go far beyond that basic idea, and I fear the side-effects > will be very bad. Not sure why it goes far beyond, but then there were two approaches under consideration, as I've stated in the first message. I already don't remember all the details, but another one was evolving around doing similar things in a more limited fashion in transformAExprIn. The problem would be then to carry the information, necessary to represent the act of "merging" some number of queryids together. Any thoughts here? The idea of keeping the original queryid untouched and add another type of id instead sounds interesting, but it will add too much overhead for a quite small use case I guess.
Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?
On Sun, Jun 13, 2021 at 8:25 PM Thomas Munro wrote: > Just as an FYI: this entry currently fails with "Timed out!" on cfbot > because of an oversight in the master branch[1], AFAICS. It should > pass again once that's fixed. > > [1] > https://www.postgresql.org/message-id/CA%2BhUKGLah2w1pWKHonZP_%2BEQw69%3Dq56AHYwCgEN8GDzsRG_Hgw%40mail.gmail.com That's fixed now. So what should we do about this patch? This is a bug, so it would be nice to do *something*. I don't really like the fact that this makes the behavior contingent on USE_ASSERT_CHECKING, and I suggest that you make a new symbol like USE_BARRIER_SMGR_RELEASE which by default gets defined on WIN32, but can be defined elsewhere if you want (see the treatment of EXEC_BACKEND in pg_config_manual.h). Furthermore, I can't see back-patching this, given that it would be the very first use of the barrier machinery. But I think it would be good to get something into master, because then we'd actually be using this procsignalbarrier stuff for something. On a good day we've fixed a bug. On a bad day we'll learn something new about how procsignalbarrier needs to work. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund wrote: > I guess there's a somewhat hacky way to get somewhere without actually > increasing the size. We could take 3 bytes from the fork number and use that > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for > everyone. > > It's not like we can use those bytes in a useful way, due to alignment > requirements. Declaring that the high 7 bytes are for the relNode portion and > the low byte for the fork would still allow efficient comparisons and doesn't > seem too ugly. I think this idea is worth more consideration. It seems like 2^56 relfilenodes ought to be enough for anyone, recalling that you can only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a bunch of code that is there to guard against relfilenodes being reused. In particular, we can remove the code that leaves a 0-length tombstone file around until the next checkpoint to guard against relfilenode reuse. On Windows, we still need https://commitfest.postgresql.org/36/2962/ because of the problem that Windows won't remove files from the directory listing until they are both unlinked and closed. But in general this seems like it would lead to cleaner code. For example, GetNewRelFileNode() needn't loop. If it allocate the smallest unsigned integer that the cluster (or database) has never previously assigned, the file should definitely not exist on disk, and if it does, an ERROR is appropriate, as the database is corrupted. This does assume that allocations from this new 56-bit relfilenode counter are properly WAL-logged. I think this would also solve a problem Dilip mentioned to me today: suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's been trying to do. Then suppose you do "ALTER DATABASE foo SET TABLESPACE used_recently_but_not_any_more". You might get an error complaining that “some relations of database \“%s\” are already in tablespace \“%s\“” because there could be tombstone files in that database. With this combination of changes, you could just use the barrier mechanism from https://commitfest.postgresql.org/36/2962/ to wait for those files to disappear, because they've got to be previously-unliked files that Windows is still returning because they're still opening -- or else they could be a sign of a corrupted database, but there are no other possibilities. I think, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SQL:2011 application time
On Wed, Jan 5, 2022 at 11:07 AM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 21.11.21 02:51, Paul A Jungwirth wrote: > > Here are updated patches. They are rebased and clean up some of my > > TODOs. > > This patch set looks very interesting. It's also very big, so it's > difficult to see how to get a handle on it. I did a pass through it > to see if there were any obvious architectural or coding style > problems. I also looked at some of your TODO comments to see if I had > something to contribute there. > > I'm confused about how to query tables based on application time > periods. Online, I see examples using AS OF, but in the SQL standard > I only see this used for system time, which we are not doing here. > What is your understanding of that? > Paul has previously supplied me with this document https://cs.ulb.ac.be/public/_media/teaching/infoh415/tempfeaturessql2011.pdf and that formed the basis of a lot of my questions a few months earlier. There was similar work being done for system periods, which are a bit simpler but require a side (history) table to be created. I was picking people's brains about some aspects of system versioning to see if I could help bringing that into this already very large patchset, but haven't yet felt like I had done enough research to post it. It is my hope that we can at least get the syntax for both application and system versioning committed, even if it's just stubbed in with not-yet-supported errors.
Re: Pre-allocating WAL files
On 12/30/21, 3:52 AM, "Maxim Orlov" wrote: > I did check the patch too and found it to be ok. Check and check-world are > passed. > Overall idea seems to be good in my opinion, but I'm not sure where is the > optimal place to put the pre-allocation. > > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov wrote: >> I've checked the patch v7. It applies cleanly, code is good, check-world >> tests passed without problems. >> I think it's ok to use checkpointer for this and the overall patch can be >> committed. But the seen performance gain makes me think again before adding >> this feature. I did tests myself a couple of months ago and got similar >> results. >> Really don't know whether is it worth the effort. Thank you both for your review. Nathan
Bugs in pgoutput.c
Commit 6ce16088b caused me to look at pgoutput.c's handling of cache invalidations, and I was pretty appalled by what I found. * rel_sync_cache_relation_cb does the wrong thing when called for a cache flush (i.e., relid == 0). Instead of invalidating all RelationSyncCache entries as it should, it will do nothing. * When rel_sync_cache_relation_cb does invalidate an entry, it immediately zaps the entry->map structure, even though that might still be in use (as per the adjacent comment that carefully explains why this isn't safe). I'm not sure if this could lead to a dangling-pointer core dump, but it sure seems like it could lead to failing to translate tuples that are about to be sent. * Similarly, rel_sync_cache_publication_cb is way too eager to reset the pubactions flags, which would likely lead to failing to transmit changes that we should transmit. The attached patch fixes these things, but I'm still pretty unhappy with the general design of the data structures in pgoutput.c, because there is this weird random mishmash of static variables along with a palloc'd PGOutputData struct. This cannot work if there are ever two active LogicalDecodingContexts in the same process. I don't think serial use of LogicalDecodingContexts (ie, destroy one and then make another) works very well either, because pgoutput_shutdown is a mere fig leaf that ignores all the junk the module previously made (in CacheMemoryContext no less). So I wonder whether either of those scenarios is possible/supported/ expected to be needed in future. Also ... maybe I'm not looking in the right place, but I do not see anything anywhere in logical decoding that is taking any lock on the relation being processed. How can that be safe? In particular, how do we know that the data collected by get_rel_sync_entry isn't already stale by the time we return from the function? Setting replicate_valid = true at the bottom of the function would overwrite any notification we might have gotten from a syscache callback while reading catalog data. regards, tom lane diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index a08da859b4..f7e991cd16 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -108,11 +108,13 @@ typedef struct RelationSyncEntry { Oid relid; /* relation oid */ + bool replicate_valid; /* overall validity flag for entry */ + bool schema_sent; List *streamed_txns; /* streamed toplevel transactions with this * schema */ - bool replicate_valid; + /* are we publishing this rel? */ PublicationActions pubactions; /* @@ -903,7 +905,9 @@ LoadPublications(List *pubnames) } /* - * Publication cache invalidation callback. + * Publication syscache invalidation callback. + * + * Called for invalidations on pg_publication. */ static void publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue) @@ -1130,13 +1134,12 @@ get_rel_sync_entry(PGOutputData *data, Oid relid) HASH_ENTER, &found); Assert(entry != NULL); - /* Not found means schema wasn't sent */ + /* initialize entry, if it's new */ if (!found) { - /* immediately make a new entry valid enough to satisfy callbacks */ + entry->replicate_valid = false; entry->schema_sent = false; entry->streamed_txns = NIL; - entry->replicate_valid = false; entry->pubactions.pubinsert = entry->pubactions.pubupdate = entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false; entry->publish_as_relid = InvalidOid; @@ -1166,13 +1169,40 @@ get_rel_sync_entry(PGOutputData *data, Oid relid) { oldctx = MemoryContextSwitchTo(CacheMemoryContext); if (data->publications) + { list_free_deep(data->publications); - +data->publications = NIL; + } data->publications = LoadPublications(data->publication_names); MemoryContextSwitchTo(oldctx); publications_valid = true; } + /* + * Reset schema_sent status as the relation definition may have + * changed. Also reset pubactions to empty in case rel was dropped + * from a publication. Also free any objects that depended on the + * earlier definition. + */ + entry->schema_sent = false; + list_free(entry->streamed_txns); + entry->streamed_txns = NIL; + entry->pubactions.pubinsert = false; + entry->pubactions.pubupdate = false; + entry->pubactions.pubdelete = false; + entry->pubactions.pubtruncate = false; + if (entry->map) + { + /* + * Must free the TupleDescs contained in the map explicitly, + * because free_conversion_map() doesn't. + */ + FreeTupleDesc(entry->map->indesc); + FreeTupleDesc(entry->map->outdesc); + free_conversion_map(entry->map); + } + entry->map = NULL; + /* * Build publication cache. We can't use one provided by relcache as * relcache considers all publications given relation is in, but here @@ -1212,16 +1242,18 @@ get_rel_sync_entr
Re: a misbehavior of partition row movement (?)
Pushed 0001. I had to adjust the query used in pg_dump; you changed the attribute name in the query used for pg15, but left unchanged the one for older branches, so pg_dump failed for all branches other than 15. Also, psql's describe.c required a small tweak to a version number test. https://github.com/alvherre/postgres/commit/3451612e0fa082d3ec953551c6d25432bd725502 Thanks! What was 0002 is attached, to keep cfbot happy. It's identical to your v11-0002. I have pushed it thinking that we would not backpatch any of this fix. However, after running the tests and realizing that I didn't need an initdb for either patch, I wonder if maybe the whole series *is* backpatchable. There is one possible problem, which is that psql and pg_dump would need testing to verify that they work decently (i.e. no crash, no misbehavior) with partitioned tables created with the original code. But there are few ABI changes, maybe we can cope and get all branches fixes instead of just 15. What do you think? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo) >From d1b067ad541f80191763e329577e0d3f62d00d82 Mon Sep 17 00:00:00 2001 From: amitlan Date: Mon, 11 Oct 2021 14:57:19 +0900 Subject: [PATCH v12] Enforce foreign key correctly during cross-partition updates When an update on a partitioned table referenced in foreign keys constraint causes a row to move from one partition to another, which is implemented by deleting the old row from the source leaf partition followed by inserting the new row into the destination leaf partition, firing the foreign key triggers on that delete event can result in surprising outcomes for those keys. For example, a given foreign key's delete trigger which implements the ON DELETE CASCADE clause of that key will delete any referencing rows, although it should not, because the referenced row is simply being moved into another partition. This commit teaches trigger.c to skip queuing such delete trigger events on the leaf partitions in favor of an UPDATE event fired on the "root" target relation. Doing so makes sense because both the old and the new tuple "logically" belong to the latter. The after trigger event queuing interface now allows passing the source and the destination partitions of a particular cross-partition update when registering the update event for the root partitioned table. Along with the 2 ctids of the old and the new tuple, an after trigger event now also stores the OIDs of those partitions. The tuples fetched from the source and the destination partitions are converted into the root table format before they are passed to the trigger function. The implementation currently has a limitation that only the foreign keys pointing into the query's target relation are considered, not those of its sub-partitioned partitions. That seems like a reasonable limitation, it sounds rare to have distinct foreign keys pointing into sub-partitioned partitions, but not into the root table. --- doc/src/sgml/ref/update.sgml | 7 + src/backend/commands/trigger.c| 322 +++--- src/backend/executor/execMain.c | 19 +- src/backend/executor/execReplication.c| 5 +- src/backend/executor/nodeModifyTable.c| 187 - src/backend/utils/adt/ri_triggers.c | 6 + src/include/commands/trigger.h| 4 + src/include/executor/executor.h | 3 +- src/include/nodes/execnodes.h | 3 + src/test/regress/expected/foreign_key.out | 204 +- src/test/regress/sql/foreign_key.sql | 135 - 11 files changed, 840 insertions(+), 55 deletions(-) diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml index 3fa54e5f70..3ba13010e7 100644 --- a/doc/src/sgml/ref/update.sgml +++ b/doc/src/sgml/ref/update.sgml @@ -316,6 +316,13 @@ UPDATE count partition (provided the foreign data wrapper supports tuple routing), they cannot be moved from a foreign-table partition to another partition. + + + An attempt of moving a row from one partition to another will fail if a + foreign key is found to directly reference a non-root partitioned table + in the partition tree, unless that table is also directly mentioned + in the UPDATEquery. + diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index 452b743f21..1ed6dd1b38 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -94,7 +94,11 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata, FmgrInfo *finfo, Instrumentation *instr, MemoryContext per_tuple_context); -static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo, +static void AfterTriggerSaveEvent(EState *estate, + ModifyTableState *mtstate, + ResultRelInfo *re
Re: row filtering for logical replication
FYI - v58 is currently known to be broken due to a recent commit [1]. I plan to post a v59* later today to address this as well as other recent review comments. -- [1] https://github.com/postgres/postgres/commit/6ce16088bfed97f982f66a9dc17b8364df289e4d Kind Regards, Peter Smith. Fujitsu Australia.
Re: Emit "checkpoint skipped because system is idle" message at LOG level if log_checkpoints is set
On Wed, Jan 05, 2022 at 10:45:06AM +0530, Dilip Kumar wrote: > On Wed, Jan 5, 2022 at 10:24 AM Bharath Rupireddy > wrote: > > > > Postgres server emits a message at DEBUG1 level when it skips a > > checkpoint. At times, developers might be surprised after figuring out > > from server logs that there were no checkpoints happening at all > > during a certain period of time when DEBUG1 messages aren't captured. > > How about emitting the message at LOG level if log_checkpoints is set? > > Patch attached. > > +1 to convert to LOG when log_checkpoints is set. I think it would be odd to write logs of increased severity, for the case where we did not do anything. I think it really is a debug log. I don't think the log level should be changed to avoid "developer" confusion, as you said (I'm not sure if you mean a postgres developer or an application developer, though). Is there any evidence that this has caused user confusion in the last 4 years ? |commit 6ef2eba3f57f17960b7cd4958e18aa79e357de2f |Author: Andres Freund |Date: Thu Dec 22 11:31:50 2016 -0800 | |Skip checkpoints, archiving on idle systems. Note that logging a message may not be benign ; I think it could cause the disks to spin up, that would othewise have been in power saving mode, especially if you log to syslog, which can issue fsync. Also, part of the argument for enabling log_checkpoint by default was that a small, quiescent instance would not write logs every 5 minutes. -- Justin
Re: skip replication slot snapshot/map file removal during end-of-recovery checkpoint
On 12/23/21, 3:17 AM, "Bharath Rupireddy" wrote: > Currently the end-of-recovery checkpoint can be much slower, impacting > the server availability, if there are many replication slot files > .snap or map- to be enumerated and deleted. How about skipping > the .snap and map- file handling during the end-of-recovery > checkpoint? It makes the server available faster and the next regular > checkpoint can deal with these files. If required, we can have a GUC > (skip_replication_slot_file_handling or some other better name) to > control this default being the existing behavior. I suggested something similar as a possibility in the other thread where these tasks are being discussed [0]. I think it is worth considering, but IMO it is not a complete solution to the problem. If there are frequently many such files to delete and regular checkpoints are taking longer, the shutdown/end-of-recovery checkpoint could still take a while. I think it would be better to separate these tasks from checkpointing instead. Nathan [0] https://postgr.es/m/A285A823-0AF2-4376-838E-847FA4710F9A%40amazon.com
Re: Report checkpoint progress in server logs
On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote: > Magnus Hagander writes: > >> Therefore, reporting the checkpoint progress in the server logs, much > >> like [1], seems to be the best way IMO. > > > I find progress reporting in the logfile to generally be a terrible > > way of doing things, and the fact that we do it for the startup > > process is/should be only because we have no other choice, not because > > it's the right choice. > > I'm already pretty seriously unhappy about the log-spamming effects of > 64da07c41 (default to log_checkpoints=on), and am willing to lay a side > bet that that gets reverted after we have some field experience with it. > This proposal seems far worse from that standpoint. Keep in mind that > our out-of-the-box logging configuration still doesn't have any log > rotation ability, which means that the noisier the server is in normal > operation, the sooner you fill your disk. I think we are looking at three potential observable behaviors people might care about: * the current activity/progress of checkpoints * the historical reporting of checkpoint completion, mixed in with other log messages for later analysis * the aggregate behavior of checkpoint operation I think it is clear that checkpoint progress activity isn't useful for the server logs because that information has little historical value, but does fit for a progress view. As Tom already expressed, we will have to wait to see if non-progress checkpoint information in the logs has sufficient historical value. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Add 64-bit XIDs into PostgreSQL 15
On Tue, Jan 4, 2022 at 10:22:50PM +, Finnerty, Jim wrote: > I'm concerned about the maintainability impact of having 2 new > on-disk page formats. It's already complex enough with XIDs and > multixact-XIDs. > > If the lack of space for the two epochs in the special data area is > a problem only in an upgrade scenario, why not resolve the problem > before completing the upgrade process like a kind of post-process > pg_repack operation that converts all "double xmax" pages to > the "double-epoch" page format? i.e. maybe the "double xmax" > representation is needed as an intermediate representation during > upgrade, but after upgrade completes successfully there are no pages > with the "double-xmax" representation. This would eliminate a whole > class of coding errors and would make the code dealing with 64-bit > XIDs simpler and more maintainable. Well, yes, we could do this, and it would avoid the complexity of having to support two XID representations, but we would need to accept that fast pg_upgrade would be impossible in such cases, since every page would need to be checked and potentially updated. You might try to do this while the server is first started and running queries, but I think we found out from the online checkpoint patch that having the server in an intermediate state while running queries is very complex --- it might be simpler to just accept two XID formats all the time than enabling the server to run with two formats for a short period. My big point is that this needs more thought. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: GUC flags
On Wed, Jan 05, 2022 at 02:17:11PM +0900, Michael Paquier wrote: > On Tue, Jan 04, 2022 at 09:06:48PM -0600, Justin Pryzby wrote: > > I think pg_get_guc_flags() may be best, but I'm interested to hear other > > opinions. > > My opinion on this matter is rather close to what you have here with > handling things through one extra attribute. But I don't see the > point of using an extra function where users would need to do a manual > mapping of the flag bits back to a a text representation of them. If it were implemented as a function, this would be essentially internal and left undocumented. Only exposed for the purpose of re-implementing check_guc. > I would suggest to just add one text[] to pg_show_all_settings Good idea to use the backing function without updating the view. pg_settings is currently defined with "SELECT *". Is it fine to enumerate a list of columns instead ? >From 713046d82668c4e189b78e9e274b272bccaaa480 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 5 Dec 2021 23:22:04 -0600 Subject: [PATCH v4 1/2] check_guc: fix absurd number of false positives Avoid false positives; Avoid various assumptions; Avoid list of exceptions; Simplify shell/awk/sed/grep; This requires GNU awk for RS as a regex. --- src/backend/utils/misc/check_guc | 69 +--- 1 file changed, 10 insertions(+), 59 deletions(-) diff --git a/src/backend/utils/misc/check_guc b/src/backend/utils/misc/check_guc index b171ef0e4ff..323ca13191b 100755 --- a/src/backend/utils/misc/check_guc +++ b/src/backend/utils/misc/check_guc @@ -1,78 +1,29 @@ -#!/bin/sh +#! /bin/sh +set -e -## currently, this script makes a lot of assumptions: +## this script makes some assumptions about the formatting of files it parses. ## in postgresql.conf.sample: ## 1) the valid config settings may be preceded by a '#', but NOT '# ' ## (we use this to skip comments) -## 2) the valid config settings will be followed immediately by ' =' -## (at least one space preceding the '=') -## in guc.c: -## 3) the options have PGC_ on the same line as the option -## 4) the options have '{' on the same line as the option - -## Problems -## 1) Don't know what to do with TRANSACTION ISOLATION LEVEL - -## if an option is valid but shows up in only one file (guc.c but not -## postgresql.conf.sample), it should be listed here so that it -## can be ignored -INTENTIONALLY_NOT_INCLUDED="debug_deadlocks in_hot_standby \ -is_superuser lc_collate lc_ctype lc_messages lc_monetary lc_numeric lc_time \ -pre_auth_delay role seed server_encoding server_version server_version_num \ -session_authorization trace_lock_oidmin trace_lock_table trace_locks trace_lwlocks \ -trace_notify trace_userlocks transaction_isolation transaction_read_only \ -zero_damaged_pages" ### What options are listed in postgresql.conf.sample, but don't appear ### in guc.c? -# grab everything that looks like a setting and convert it to lower case -SETTINGS=`grep ' =' postgresql.conf.sample | -grep -v '^# ' | # strip comments -sed -e 's/^#//' | -awk '{print $1}'` - -SETTINGS=`echo "$SETTINGS" | tr 'A-Z' 'a-z'` +# grab everything that looks like a setting +SETTINGS=`sed '/^#[[:alnum:]]/!d; s/^#//; s/ =.*//; /^include/d' postgresql.conf.sample` for i in $SETTINGS ; do - hidden=0 ## it sure would be nice to replace this with an sql "not in" statement - ## it doesn't seem to make sense to have things in .sample and not in guc.c -# for hidethis in $INTENTIONALLY_NOT_INCLUDED ; do -#if [ "$hidethis" = "$i" ] ; then -# hidden=1 -#fi -# done - if [ "$hidden" -eq 0 ] ; then -grep -i '"'$i'"' guc.c > /dev/null -if [ $? -ne 0 ] ; then - echo "$i seems to be missing from guc.c"; -fi; - fi + grep -i "\"$i\"" guc.c >/dev/null || +echo "$i seems to be missing from guc.c"; done ### What options are listed in guc.c, but don't appear ### in postgresql.conf.sample? # grab everything that looks like a setting and convert it to lower case - -SETTINGS=`grep '{.* PGC_' guc.c | awk '{print $1}' | \ - sed -e 's/{//g' -e 's/"//g' -e 's/,//'` - -SETTINGS=`echo "$SETTINGS" | tr 'A-Z' 'a-z'` - +SETTINGS=`gawk -F '[",]' 'BEGIN{RS="\n\t{\n"} /",[[:space:]]*PGC_.*.*gettext_noop/ && !/NOT_IN_SAMPLE/{print tolower($2)}' guc.c` for i in $SETTINGS ; do - hidden=0 - ## it sure would be nice to replace this with an sql "not in" statement - for hidethis in $INTENTIONALLY_NOT_INCLUDED ; do -if [ "$hidethis" = "$i" ] ; then - hidden=1 -fi - done - if [ "$hidden" -eq 0 ] ; then -grep -i '#'$i' ' postgresql.conf.sample > /dev/null -if [ $? -ne 0 ] ; then - echo "$i seems to be missing from postgresql.conf.sample"; -fi - fi + grep "#$i " postgresql.conf.sample >/dev/null || +echo "$i seems to be missing from postgresql.conf.sample"; done -- 2.17.1 >From d4773ee874ddc6aaa1238e4d1ae7ec6f8e20f36e Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 8 Dec 2021 12:06:35 -0600
Remove trailing comma from enums
I saw one (and then went looking and found some more) enum with a trailing comma. These are quite rare in the PG src, so I doubt they are intentional. PSA a patch to remove the trailing commas for all that I found. -- Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Remove-trailing-commas-from-enums.patch Description: Binary data
Re: Add 64-bit XIDs into PostgreSQL 15
On Thu, Dec 30, 2021 at 03:15:16PM +0300, Maxim Orlov wrote: > PFA updated working patch v6 for PG15 development cycle. > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, > refactoring and was rebased to PG15. > > Main changes: > - Change TransactionId to 64bit > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > remains 32bit > -- 32bit xid is named ShortTransactionId now. > -- Exception: see "double xmax" format below. > - Heap page format is changed to contain xid and multixact base value, > tuple's xmin and xmax are offsets from. > -- xid_base and multi_base are stored as a page special data. PageHeader > remains unmodified. > -- If after upgrade page has no free space for special data, tuples are > converted to "double xmax" format: xmin became virtual > FrozenTransactionId, xmax occupies the whole 64bit. > Page converted to new format when vacuum frees enough space. I think it is a great idea to allow the 64-XID to span the 32-bit xmin and xmax fields when needed. It would be nice if we can get focus on this feature so we are sure it gets into PG 15. Can we add this patch incrementally so people can more easily analyze it? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [PATCH] Accept IP addresses in server certificate SANs
On Mon, 2022-01-03 at 16:19 +, Jacob Champion wrote: > On Fri, 2021-12-17 at 15:40 +0900, Kyotaro Horiguchi wrote: > > > > + inet_net_pton_ipv4(const char *src, u_char *dst) > > (calls inet_net_pton_ipv4_internal(src, dst, true)) > > + inet_pton_ipv4(const char *src, u_char *dst) > > (calls inet_net_pton_ipv4_internal(src, dst, false)) > > Sounds good, I will make that change. Thanks for the feedback! v3 implements a pg_inet_pton(), but for IPv6 instead of IPv4 as presented above (since we only need inet_pton() for IPv6 in this case). It's split into a separate patch (0003) for ease of review. Thanks! --Jacob From 3081b1e0f7af60241978c77bba71e806ba5c25f6 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Wed, 24 Nov 2021 14:46:11 -0800 Subject: [PATCH v3 1/3] Move inet_net_pton() to src/port This will be helpful for IP address verification in libpq. --- src/backend/utils/adt/Makefile | 1 - src/include/port.h | 3 +++ src/include/utils/builtins.h| 4 src/port/Makefile | 1 + src/{backend/utils/adt => port}/inet_net_pton.c | 16 +++- src/tools/msvc/Mkvcbuild.pm | 2 +- 6 files changed, 20 insertions(+), 7 deletions(-) rename src/{backend/utils/adt => port}/inet_net_pton.c (96%) diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 41b486bcef..d173d52157 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -43,7 +43,6 @@ OBJS = \ geo_selfuncs.o \ geo_spgist.o \ inet_cidr_ntop.o \ - inet_net_pton.o \ int.o \ int8.o \ json.o \ diff --git a/src/include/port.h b/src/include/port.h index fd9c9d6f94..cca29839bc 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -518,6 +518,9 @@ extern int pg_codepage_to_encoding(UINT cp); extern char *pg_inet_net_ntop(int af, const void *src, int bits, char *dst, size_t size); +/* port/inet_net_pton.c */ +extern int pg_inet_net_pton(int af, const char *src, void *dst, size_t size); + /* port/pg_strong_random.c */ extern void pg_strong_random_init(void); extern bool pg_strong_random(void *buf, size_t len); diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index b07eefaf1e..a5a18e62ba 100644 --- a/src/include/utils/builtins.h +++ b/src/include/utils/builtins.h @@ -92,10 +92,6 @@ extern int xidComparator(const void *arg1, const void *arg2); extern char *pg_inet_cidr_ntop(int af, const void *src, int bits, char *dst, size_t size); -/* inet_net_pton.c */ -extern int pg_inet_net_pton(int af, const char *src, - void *dst, size_t size); - /* network.c */ extern double convert_network_to_scalar(Datum value, Oid typid, bool *failure); extern Datum network_scan_first(Datum in); diff --git a/src/port/Makefile b/src/port/Makefile index b3754d8940..7191cd6633 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -43,6 +43,7 @@ OBJS = \ bsearch_arg.o \ chklocale.o \ inet_net_ntop.o \ + inet_net_pton.o \ noblock.o \ path.o \ pg_bitutils.o \ diff --git a/src/backend/utils/adt/inet_net_pton.c b/src/port/inet_net_pton.c similarity index 96% rename from src/backend/utils/adt/inet_net_pton.c rename to src/port/inet_net_pton.c index d3221a1313..bae50ba67e 100644 --- a/src/backend/utils/adt/inet_net_pton.c +++ b/src/port/inet_net_pton.c @@ -14,14 +14,18 @@ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. * - * src/backend/utils/adt/inet_net_pton.c + * src/port/inet_net_pton.c */ #if defined(LIBC_SCCS) && !defined(lint) static const char rcsid[] = "Id: inet_net_pton.c,v 1.4.2.3 2004/03/17 00:40:11 marka Exp $"; #endif +#ifndef FRONTEND #include "postgres.h" +#else +#include "postgres_fe.h" +#endif #include #include @@ -29,9 +33,19 @@ static const char rcsid[] = "Id: inet_net_pton.c,v 1.4.2.3 2004/03/17 00:40:11 m #include #include +#ifndef FRONTEND #include "utils/builtins.h" /* pgrminclude ignore */ /* needed on some * platforms */ #include "utils/inet.h" +#else +/* + * In a frontend build, we can't include inet.h, but we still need to have + * sensible definitions of these two constants. Note that pg_inet_net_ntop() + * assumes that PGSQL_AF_INET is equal to AF_INET. + */ +#define PGSQL_AF_INET (AF_INET + 0) +#define PGSQL_AF_INET6 (AF_INET + 1) +#endif static int inet_net_pton_ipv4(const char *src, u_char *dst); diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm index 404c45a6f3..fa87511e04 100644 --- a/src/tools/msvc/Mkvcbuild.pm +++ b/src/tools/msvc/Mkvcbuild.pm @@ -100,7 +100,7 @@ sub mkvcbuild our @pgportfiles = qw( chklocale.c explicit_bzero.c fls.c getpeereid.c getrusage.c inet_aton.c - getaddrinfo.c gettimeofday.c inet_net_ntop.c kill.c open.c + getaddrinfo.c gettimeofday.c inet_net_ntop.c inet_net_pton.c kill.c
Re: Add 64-bit XIDs into PostgreSQL 15
On Tue, Jan 4, 2022 at 05:49:07PM +, Finnerty, Jim wrote: > Hi Maxim, > I’m glad to see that you’re trying to carry the 64-bit XID work forward. > I > had not noticed that my earlier patch (also derived from Alexander Kortkov’s > patch) was responded to back in September. Perhaps we can merge some of the > code cleanup that it contained, such as using XID_FMT everywhere and creating > a > type for the kind of page returned by TransactionIdToPage() to make the code > cleaner. > > Is your patch functionally the same as the PostgresPro implementation? If > so, I think it would be helpful for everyone’s understanding to read the > PostgresPro documentation on VACUUM. See in particular section “Forced > shrinking pg_clog and pg_multixact” > > https://postgrespro.com/docs/enterprise/9.6/routine-vacuuming# > vacuum-for-wraparound Good point --- we still need vacuum freeze. It would be good to understand how much value we get in allowing vacuum freeze to be done less often --- how big can pg_xact/pg_multixact get before they are problems? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Add 64-bit XIDs into PostgreSQL 15
On Wed, Jan 05, 2022 at 06:51:37PM -0500, Bruce Momjian wrote: > On Tue, Jan 4, 2022 at 10:22:50PM +, Finnerty, Jim wrote: > > I'm concerned about the maintainability impact of having 2 new > > on-disk page formats. It's already complex enough with XIDs and > > multixact-XIDs. > > > > If the lack of space for the two epochs in the special data area is > > a problem only in an upgrade scenario, why not resolve the problem > > before completing the upgrade process like a kind of post-process > > pg_repack operation that converts all "double xmax" pages to > > the "double-epoch" page format? i.e. maybe the "double xmax" > > representation is needed as an intermediate representation during > > upgrade, but after upgrade completes successfully there are no pages > > with the "double-xmax" representation. This would eliminate a whole > > class of coding errors and would make the code dealing with 64-bit > > XIDs simpler and more maintainable. > > Well, yes, we could do this, and it would avoid the complexity of having > to support two XID representations, but we would need to accept that > fast pg_upgrade would be impossible in such cases, since every page > would need to be checked and potentially updated. > > You might try to do this while the server is first started and running > queries, but I think we found out from the online checkpoint patch that I think you meant the online checksum patch. Which this reminded me of, too. https://commitfest.postgresql.org/31/2611/ > having the server in an intermediate state while running queries is very > complex --- it might be simpler to just accept two XID formats all the > time than enabling the server to run with two formats for a short > period. My big point is that this needs more thought. -- Justin
Re: Add 64-bit XIDs into PostgreSQL 15
On Wed, Jan 5, 2022 at 06:12:26PM -0600, Justin Pryzby wrote: > On Wed, Jan 05, 2022 at 06:51:37PM -0500, Bruce Momjian wrote: > > On Tue, Jan 4, 2022 at 10:22:50PM +, Finnerty, Jim wrote: > > > I'm concerned about the maintainability impact of having 2 new > > > on-disk page formats. It's already complex enough with XIDs and > > > multixact-XIDs. > > > > > > If the lack of space for the two epochs in the special data area is > > > a problem only in an upgrade scenario, why not resolve the problem > > > before completing the upgrade process like a kind of post-process > > > pg_repack operation that converts all "double xmax" pages to > > > the "double-epoch" page format? i.e. maybe the "double xmax" > > > representation is needed as an intermediate representation during > > > upgrade, but after upgrade completes successfully there are no pages > > > with the "double-xmax" representation. This would eliminate a whole > > > class of coding errors and would make the code dealing with 64-bit > > > XIDs simpler and more maintainable. > > > > Well, yes, we could do this, and it would avoid the complexity of having > > to support two XID representations, but we would need to accept that > > fast pg_upgrade would be impossible in such cases, since every page > > would need to be checked and potentially updated. > > > > You might try to do this while the server is first started and running > > queries, but I think we found out from the online checkpoint patch that > > I think you meant the online checksum patch. Which this reminded me of, too. > > https://commitfest.postgresql.org/31/2611/ Sorry, yes, I have checkpoint on my mind. ;-) -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: make tuplestore helper function
On Wed, Jan 05, 2022 at 12:09:16PM -0500, Melanie Plageman wrote: > On Fri, Dec 17, 2021 at 3:04 PM Justin Pryzby wrote: > > The rest of these are questions more than comments, and I'm not necessarily > > suggesting to change the patch: > > > > This isn't new in your patch, but for me, it's more clear if the callers > > have a > > separate boolean for randomAccess, rather than having the long expression in > > the function call: > > > > + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, > > + rsinfo->allowedModes & SFRM_Materialize_Random); > > > > vs > > > > randomAccess = (rsinfo->allowedModes & SFRM_Materialize_Random) != > > 0; > > - tupstore = tuplestore_begin_heap(randomAccess, false, work_mem); > > + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, randomAccess); > > > > One could also argue for passing rsinfo->allowedModes, instead of > > (rsinfo->allowedModes & SFRM_Materialize_Random). > > I've changed the patch to have MakeFuncResultTuplestore() check > rsinfo->allowedModes and pass in the resulting boolean to > tuplestore_begin_heap(). Because I modified the > MakeFuncResultTuplestore() function signature to accommodate this, I had > to collapse the two patches into one, as I couldn't maintain the call > sites which passed in a constant. > > There's a couples places that you're checking expectedDesc where it wasn't > > being checked before. Is that some kind of live bug ? > > pg_config() text_to_table() > > Yes, expectedDesc was accessed in these locations without checking that > it is not NULL. Should that be a separate patch? I don't know. The patch is already easy to review, since it's mostly limited to removing code and fixing inconsistencies (NULL check) and possible inefficiencies (randomAccess). If the expectedDesc NULL check were an 0001 patch, then 0002 (the main patch) would be even easier to review. Only foreign.c is different. > > You removed one call to tuplestore_donestoring(), but not the others. > > I guess you could remove them all (optionally as a separate patch). > > I removed it in that one location because I wanted to get rid of the > local variable it used. What local variable ? I see now that logicalfuncs.c never had a local variable called tupstore. I'm sure it was intended since 2014 (b89e15105) to say tupestore_donestoring(p->tupstore). But donestoring(typo) causes no error, since the define is a NOP. src/include/utils/tuplestore.h:#define tuplestore_donestoring(state) ((void) 0) > I am fine with removing the other occurrences, > but I thought there might be some reason why instead of removing it, > they made it into a no-op. I assume Tom left it (actually, added it back in dd04e958c) to avoid breaking extensions for no reason. And maybe to preserve the possbility of at some point in the future doing something again during the "done" step. I'd leave it for input from a committer about those: - remove tuplestore_donestoring() calls ? - split expectedDesc NULL check to an 0001 patch ? - anything other opened questions ? I'm marking this as RFC, with the caveat that the newline before MakeFuncResultTuplestore went missing again :) -- Justin
Re: Remove trailing comma from enums
On Thu, Jan 6, 2022 at 12:56 PM Peter Smith wrote: > I saw one (and then went looking and found some more) enum with a > trailing comma. > > These are quite rare in the PG src, so I doubt they are intentional. > > PSA a patch to remove the trailing commas for all that I found. -1. I don't see the problem with C99 trailing commas. They avoid noisy diff lines when patches add/remove items.
Re: Proposal: remove obsolete hot-standby testing infrastructure
Peter Eisentraut writes: > On 03.01.22 22:50, Tom Lane wrote: >> The attached proposed patch removes some ancient infrastructure for >> manually testing hot standby. > I looked into this some time ago and concluded that this test contains a > significant amount of testing that isn't obviously done anywhere else. > I don't have the notes anymore, and surely some things have progressed > since, but I wouldn't just throw the old test suite away without > actually checking. Fair enough ... so I looked, and there's not much at all that I'm worried about. hs_standby_allowed: This is basically checking that the standby can see data from the primary, which we surely have covered. Although it does also cover propagation of nextval, which AFAICS is not tested in src/test/recovery, so perhaps that's worth troubling over. There are also some checks that particular commands are allowed on the standby, which seem to me to be not too helpful; see also comments on the next file. hs_standby_disallowed: Inverse of the above: check that some commands are disallowed. We check some of these in 001_stream_rep.pl, and given the current code structure in utility.c (ClassifyUtilityCommandAsReadOnly etc), I do not see much point in adding more test cases of the same sort. The only likely new bug in that area would be misclassification of some new command, and no amount of testing of existing cases will catch that. There are also tests that particular functions are disallowed, which isn't something that goes through ClassifyUtilityCommandAsReadOnly. Nonetheless, adding more test cases here wouldn't help catch future oversights of that type, so I remain unexcited. hs_standby_functions: Mostly also checking that things are disallowed. There's also a test of pg_cancel_backend, which is cute but probably suffers from timing instability (i.e., delayed arrival of the signal might change the output). Moreover, pg_cancel_backend is already covered in the isolation tests, and I see no reason to think it'd operate differently on a standby. hs_standby_check: Checks pg_is_in_recovery(), which is checked far more thoroughly by pg_ctl/t/003_promote.pl. hs_primary_extremes: Checks that we can cope with deep subtransaction nesting. Maybe this is worth preserving, but I sort of doubt it --- the standby doesn't even see the nesting does it? Also checks that the standby can cope with 257 exclusive locks at once, which corresponds to no interesting limit that I know of. So basically, I'd be inclined to add a couple of tests of sequence-update propagation to src/test/recovery and call it good. regards, tom lane
Re: Remove trailing comma from enums
Thomas Munro writes: > On Thu, Jan 6, 2022 at 12:56 PM Peter Smith wrote: >> These are quite rare in the PG src, so I doubt they are intentional. >> PSA a patch to remove the trailing commas for all that I found. > -1. I don't see the problem with C99 trailing commas. They avoid > noisy diff lines when patches add/remove items. I think they're rare because up till very recently we catered to pre-C99 compilers that wouldn't accept them. There's not much point in insisting on that now, though. Personally I'm less excited than Thomas about trailing commas being good for reducing diff noise, mainly because I think that "add new entries at the end" is an anti-pattern, and if you put new items where they logically belong then the problem is much rarer. But I'm not going to argue against committers who want to do it like that, either. regards, tom lane
Re: Map WAL segment files on PMEM as WAL buffers
Rebased. On Fri, Nov 5, 2021 at 3:47 PM Takashi Menjo wrote: > > Hi Daniel, > > The issue you told has been fixed. I attach the v5 patchset to this email. > > The v5 has all the patches in the v4, and in addition, has the > following two new patches: > > - (v5-0002) Support build with MSVC on Windows: Please have > src\tools\msvc\config.pl as follows to "configure --with-libpmem:" > > $config->{pmem} = 'C:\path\to\pmdk\x64-windows'; > > - (v5-0006) Compatible to Windows: This patch resolves conflicting > mode_t typedefs and libpmem API variants (U or W, like Windows API). > > Best regards, > Takashi > > On Thu, Nov 4, 2021 at 5:46 PM Takashi Menjo wrote: > > > > Hello Daniel, > > > > Thank you for your comment. I had the following error message with > > MSVC on Windows. It looks the same as what you told me. I'll fix it. > > > > | > cd src\tools\msvc > > | > build > > | (..snipped..) > > | Copying pg_config_os.h... > > | Generating configuration headers... > > | undefined symbol: HAVE_LIBPMEM at src/include/pg_config.h line 347 > > at C:/Users/menjo/Documents/git/postgres/src/tools/msvc/Mkvcbuild.pm > > line 860. > > > > Best regards, > > Takashi > > > > > > On Wed, Nov 3, 2021 at 10:04 PM Daniel Gustafsson wrote: > > > > > > > On 28 Oct 2021, at 08:09, Takashi Menjo wrote: > > > > > > > Rebased, and added the patches below into the patchset. > > > > > > Looks like the 0001 patch needs to be updated to support Windows and > > > MSVC. See > > > src/tools/msvc/Mkvcbuild.pm and Solution.pm et.al for inspiration on how > > > to add > > > the MSVC equivalent of --with-libpmem. Currently the patch fails in the > > > "Generating configuration headers" step in Solution.pm. > > > > > > -- > > > Daniel Gustafsson https://vmware.com/ > > > > > > > > > -- > > Takashi Menjo > > > > -- > Takashi Menjo -- Takashi Menjo v6-0003-Add-wal_pmem_map-to-GUC.patch Description: Binary data v6-0001-Add-with-libpmem-option-for-PMEM-support.patch Description: Binary data v6-0004-Export-InstallXLogFileSegment.patch Description: Binary data v6-0002-Support-build-with-MSVC-on-Windows.patch Description: Binary data v6-0005-Map-WAL-segment-files-on-PMEM-as-WAL-buffers.patch Description: Binary data v6-0008-Let-wal_pmem_map-be-constant-unless-with-libpmem.patch Description: Binary data v6-0006-Compatible-to-Windows.patch Description: Binary data v6-0009-Ensure-WAL-mappings-before-assertion.patch Description: Binary data v6-0007-WAL-statistics-in-cases-of-wal_pmem_map-true.patch Description: Binary data v6-0010-Update-document.patch Description: Binary data v6-0011-Preallocate-and-initialize-more-WAL-if-wal_pmem_m.patch Description: Binary data
Re: Remove trailing comma from enums
On Thu, Jan 6, 2022 at 12:23 PM Tom Lane wrote: > > Thomas Munro writes: > > On Thu, Jan 6, 2022 at 12:56 PM Peter Smith wrote: > >> These are quite rare in the PG src, so I doubt they are intentional. > >> PSA a patch to remove the trailing commas for all that I found. > > > -1. I don't see the problem with C99 trailing commas. They avoid > > noisy diff lines when patches add/remove items. > > I think they're rare because up till very recently we catered to > pre-C99 compilers that wouldn't accept them. There's not much > point in insisting on that now, though. > > Personally I'm less excited than Thomas about trailing commas > being good for reducing diff noise, mainly because I think > that "add new entries at the end" is an anti-pattern, and > if you put new items where they logically belong then the > problem is much rarer. But I'm not going to argue against > committers who want to do it like that, either. FWIW, the background of this was that one of these examples overlapped with a feature currently in development and it just caused a waste of everyone's time by firstly "fixing" (removing) the extra comma and then getting multiple code reviews saying the change was unrelated to that feature and so having to remove that fix again. So I felt removing all such commas at HEAD not only makes all the enums consistent, but it may prevent similar time-wasting for others in the future. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: row filtering for logical replication
On Thu, Jan 6, 2022 at 1:10 AM Alvaro Herrera wrote: > > BTW I think it's not great to commit with the presented split. We would > have non-trivial short-lived changes for no good reason (0002 in > particular). I think this whole series should be a single patch, with Yes, we know that eventually these parts will be combined and committed as a single patch. What you see not is still a work-in-progress. The current separation has been mostly for helping multiple people collaborate without too much clashing. e.g., the 0002 patch has been kept separate just to help do performance testing of that part in isolation. > the commit message being a fusion of messages explaining in full what > the functional change is, listing all the authors together. Having a > commit message like in 0001 where all the distinct changes are explained > in separate sections with each section listing its own author, does not > sound very useful or helpful. > Yes, the current v58-0001 commit message is just a combination of previous historical patch comments as each of them got merged back into the main patch. This message format was just a quick/easy way to ensure that no information was accidentally lost along the way. We understand that prior to the final commit this will all need to be fused together just like you are suggesting. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: biblio.sgml dead link
On Wed, Jan 05, 2022 at 09:56:26PM +0100, Daniel Gustafsson wrote: > I think we should remove the link, not because it costs money but because it > can be purchased from several places, and choosing one to "favor" seems to > invite criticism. Kind of how we don't link to an online store for buying the > other books. Yeah, that's my feeling as well. So done this way across the board. -- Michael signature.asc Description: PGP signature
Re: Add 64-bit XIDs into PostgreSQL 15
Hi! On Thu, Jan 6, 2022 at 3:02 AM Bruce Momjian wrote: > > On Thu, Dec 30, 2021 at 03:15:16PM +0300, Maxim Orlov wrote: > > PFA updated working patch v6 for PG15 development cycle. > > It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, > > refactoring and was rebased to PG15. > > > > Main changes: > > - Change TransactionId to 64bit > > - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax > > remains 32bit > > -- 32bit xid is named ShortTransactionId now. > > -- Exception: see "double xmax" format below. > > - Heap page format is changed to contain xid and multixact base value, > > tuple's xmin and xmax are offsets from. > > -- xid_base and multi_base are stored as a page special data. PageHeader > >remains unmodified. > > -- If after upgrade page has no free space for special data, tuples are > >converted to "double xmax" format: xmin became virtual > >FrozenTransactionId, xmax occupies the whole 64bit. > >Page converted to new format when vacuum frees enough space. > > I think it is a great idea to allow the 64-XID to span the 32-bit xmin > and xmax fields when needed. It would be nice if we can get focus on > this feature so we are sure it gets into PG 15. Can we add this patch > incrementally so people can more easily analyze it? I see at least the following major issues/questions in this patch. 1) Current code relies on the fact that TransactionId can be atomically read from/written to shared memory. With 32-bit systems and 64-bit TransactionId, that's not true anymore. Therefore, the patch has concurrency issues on 32-bit systems. We need to carefully review these issues and provide a fallback for 32-bit systems. I suppose nobody is thinking about dropping off 32-bit systems, right? Also, I wonder how critical for us is the overhead for 32-bit systems. They are going to become legacy, so overhead isn't so critical, right? 2) With this patch we still need to freeze to cut SLRUs. This is especially problematic with Multixacts, because systems heavily using row-level locks can consume an enormous amount of multixacts. That is especially problematic when we have 2x bigger multixacts. We probably can provide an alternative implementation for multixact vacuum, which doesn't require scanning all the heaps. That is a pretty amount of work though. The clog is much smaller and we can cut it more rarely. Perhaps, we could tolerate freezing to cut clog, couldn't we? 3) 2x bigger in-memory representation of TransactionId have overheads. In particular, it could mitigate the effect of recent advancements from Andres Freund. I'm not exactly sure we should/can do something with this. But I think this should be at least carefully analyzed. 4) SP-GiST index stores TransactionId on pages. Could we tolerate dropping SP-GiST indexes on a major upgrade? Or do we have to invent something smarter? 5) 32-bit limitation within the page mentioned upthread by Stephen Frost should be also carefully considered. Ideally, we should get rid of it, but I don't have particular ideas in this field for now. At least, we should make sure we did our best at error reporting and monitoring capabilities. I think the realistic goal for PG 15 development cycle would be agreement on a roadmap for all the items above (and probably some initial implementations). -- Regards, Alexander Korotkov
Re: row filtering for logical replication
On Wed, Jan 5, 2022 at 9:52 PM Amit Kapila wrote: > ... > Another minor comment: > +static bool pgoutput_row_filter(enum ReorderBufferChangeType changetype, > > Do we need to specify the 'enum' type before changetype parameter? > That is because there is currently no typedef for the enum ReorderBufferChangeType. Of course, it is easy to add a typedef and then this 'enum' is not needed in the signature, but I wasn't sure if adding a new typedef strictly belonged as part of this Row-Filter patch or not. -- Kind Regards. Peter Smith. Fujitsu Australia.
RE: Optionally automatically disable logical replication subscriptions on error
On Wednesday, January 5, 2022 8:53 PM osumi.takami...@fujitsu.com wrote: > > On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威 > wrote: > > On Thursday, December 16, 2021 8:51 PM osumi.takami...@fujitsu.com > > wrote: > > > Attached the updated patch v14. > > > > A comment to the timing of printing a log: > Thank you for your review ! > > > After the log[1] was printed, I altered subscription's option > > (DISABLE_ON_ERROR) from true to false before invoking > > DisableSubscriptionOnError to disable subscription. Subscription was not > > disabled. > > [1] "LOG: logical replication subscription "sub1" will be disabled due to > > an > > error" > > > > I found this log is printed in function WorkerErrorRecovery: > > + ereport(LOG, > > + errmsg("logical replication subscription \"%s\" will > > be disabled due to an error", > > + MySubscription->name)); > > This log is printed here, but in DisableSubscriptionOnError, there is a > > check to > > confirm subscription's disableonerr field. If disableonerr is found changed > > from > > true to false in DisableSubscriptionOnError, subscription will not be > > disabled. > > > > In this case, "disable subscription" is printed, but subscription will not > > be > > disabled actually. > > I think it is a little confused to user, so what about moving this message > > after > > the check which is mentioned above in DisableSubscriptionOnError? > Makes sense. I moved the log print after > the check of the necessity to disable the subscription. > > Also, I've scrutinized and refined the new TAP test as well for refactoring. > As a result, I fixed wait_for_subscriptions() > so that some extra codes that can be simplified, > such as escaped variable and one part of WHERE clause, are removed. > Other change I did is to replace two calls of wait_for_subscriptions() > with polling_query_until() for the subscriber, in order to > make the tests better and more suitable for the test purposes. > Again, for this refinement, I've conducted a tight loop test > to check no timing issue and found no problem. > Thanks for updating the patch. Here are some comments: 1) + /* +* We would not be here unless this subscription's disableonerr field was +* true when our worker began applying changes, but check whether that +* field has changed in the interim. +*/ + if (!subform->subdisableonerr) + { + /* +* Disabling the subscription has been done already. No need of +* additional work. +*/ + heap_freetuple(tup); + table_close(rel, RowExclusiveLock); + CommitTransactionCommand(); + return; + } I don't understand what does "Disabling the subscription has been done already" mean, I think we only run here when subdisableonerr is changed in the interim. Should we modify this comment? Or remove it because there are already some explanations before. 2) + /* Set the subscription to disabled, and note the reason. */ + values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(false); + replaces[Anum_pg_subscription_subenabled - 1] = true; I didn't see the code corresponding to "note the reason". Should we modify the comment? 3) + booldisableonerr; /* Whether errors automatically disable */ This comment is hard to understand. Maybe it can be changed to: Indicates if the subscription should be automatically disabled when subscription workers detect any errors. Regards, Tang
Re: ICU for global collation
On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote: > On 04.01.22 03:21, Julien Rouhaud wrote: > > > > - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) > > > - mylocale = pg_newlocale_from_collation(collid); > > > + if (!lc_collate_is_c(collid)) > > > + { > > > + if (collid != DEFAULT_COLLATION_OID) > > > + mylocale = pg_newlocale_from_collation(collid); > > > + else if (default_locale.provider == COLLPROVIDER_ICU) > > > + mylocale = &default_locale; > > > + } > > > > There are really a lot of places with this new code. Maybe it could be some > > new function/macro to wrap that for the normal case (e.g. not formatting.c)? > > Right, we could just put this into pg_newlocale_from_collation(), but the > comment there says > > * In fact, they shouldn't call this function at all when they are dealing > * with the default locale. That can save quite a bit in hotspots. > > I don't know how to assess that. > > We could pack this into a macro or inline function if we are concerned about > this. Yes that was my idea, just have a new function (inline function or a macro then since pg_newlocale_from_collation() clearly warns about performance concerns) that have the whole is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls pg_newlocale_from_collation() only when needed.
Re: Map WAL segment files on PMEM as WAL buffers
The cfbot showed issues compiling on linux and windows. http://cfbot.cputube.org/takashi-menjo.html https://cirrus-ci.com/task/6125740327436288 [02:30:06.538] In file included from xlog.c:38: [02:30:06.538] ../../../../src/include/access/xlogpmem.h:32:42: error: unknown type name ‘tli’ [02:30:06.538]32 | PmemXLogEnsurePrevMapped(XLogRecPtr ptr, tli) [02:30:06.538] | ^~~ [02:30:06.538] xlog.c: In function ‘GetXLogBuffer’: [02:30:06.538] xlog.c:1959:19: warning: implicit declaration of function ‘PmemXLogEnsurePrevMapped’ [-Wimplicit-function-declaration] [02:30:06.538] 1959 |openLogSegNo = PmemXLogEnsurePrevMapped(endptr, tli); https://cirrus-ci.com/task/6688690280857600?logs=build#L379 [02:33:25.752] c:\cirrus\src\include\access\xlogpmem.h(33,1): error C2081: 'tli': name in formal parameter list illegal (compiling source file src/backend/access/transam/xlog.c) [c:\cirrus\postgres.vcxproj] I'm attaching a probable fix. Unfortunately, for patches like this, most of the functionality isn't exercised unless the library is installed and compilation and runtime are enabled by default. In 0009: recaluculated => recalculated 0010-Update-document should be squished with 0003-Add-wal_pmem_map-to-GUC (and maybe 0002 and 0001). I believe the patches after 0005 are more WIP, so it's fine if they're not squished yet. I'm not sure what the point is of this one: 0008-Let-wal_pmem_map-be-constant-unl + ereport(ERROR, + (errcode_for_file_access(), +errmsg("could not pmem_map_file \"%s\": %m", path))); => The outer parenthesis are not needed since e3a87b4. >From e5614f2ea3ff6aaf016343f81f74366440e18f6f Mon Sep 17 00:00:00 2001 From: Takashi Menjo Date: Tue, 23 Mar 2021 13:32:27 +0900 Subject: [PATCH 01/13] Add --with-libpmem option for PMEM support --- configure | 99 ++ configure.ac | 17 +++ src/include/pg_config.h.in | 6 +++ 3 files changed, 122 insertions(+) diff --git a/configure b/configure index 3b19105328d..22c364fac4f 100755 --- a/configure +++ b/configure @@ -699,6 +699,7 @@ with_gnu_ld LD LDFLAGS_SL LDFLAGS_EX +with_libpmem LZ4_LIBS LZ4_CFLAGS with_lz4 @@ -868,6 +869,7 @@ with_libxslt with_system_tzdata with_zlib with_lz4 +with_libpmem with_gnu_ld with_ssl with_openssl @@ -1576,6 +1578,7 @@ Optional Packages: use system time zone data in DIR --without-zlib do not use Zlib --with-lz4 build with LZ4 support + --with-libpmem build with PMEM support --with-gnu-ld assume the C compiler uses GNU ld [default=no] --with-ssl=LIB use LIB for SSL/TLS support (openssl) --with-openssl obsolete spelling of --with-ssl=openssl @@ -9033,6 +9036,41 @@ fi done fi +# +# libpmem +# +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with PMEM support" >&5 +$as_echo_n "checking whether to build with PMEM support... " >&6; } + + + +# Check whether --with-libpmem was given. +if test "${with_libpmem+set}" = set; then : + withval=$with_libpmem; + case $withval in +yes) + +$as_echo "#define USE_LIBPMEM 1" >>confdefs.h + + ;; +no) + : + ;; +*) + as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5 + ;; + esac + +else + with_libpmem=no + +fi + + +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libpmem" >&5 +$as_echo "$with_libpmem" >&6; } + + # # Assignments # @@ -13504,6 +13542,56 @@ fi fi +if test "$with_libpmem" = yes; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_memcpy in -lpmem" >&5 +$as_echo_n "checking for pmem_memcpy in -lpmem... " >&6; } +if ${ac_cv_lib_pmem_pmem_memcpy+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_check_lib_save_LIBS=$LIBS +LIBS="-lpmem $LIBS" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char pmem_memcpy (); +int +main () +{ +return pmem_memcpy (); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + ac_cv_lib_pmem_pmem_memcpy=yes +else + ac_cv_lib_pmem_pmem_memcpy=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +LIBS=$ac_check_lib_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_memcpy" >&5 +$as_echo "$ac_cv_lib_pmem_pmem_memcpy" >&6; } +if test "x$ac_cv_lib_pmem_pmem_memcpy" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_LIBPMEM 1 +_ACEOF + + LIBS="-lpmem $LIBS" + +else + as_fn_error $? "library 'libpmem' (version >= 1.5) is required for PMEM support" "$LINENO" 5 +fi + +fi + ## ## H
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Wed, Jan 5, 2022 at 11:16 PM Andres Freund wrote: > > Hi, > > On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > > That's pretty much the same - XLogInsert() can trigger an > > XLogWrite()/Flush(). > > > > I think it's a complete no-go to add throttling to these places. It's quite > > possible that it'd cause new deadlocks, and it's almost guaranteed to have > > unintended consequences (e.g. replication falling back further because > > XLogFlush() is being throttled). > > I thought of another way to implement this feature. What if we checked the > current distance somewhere within XLogInsert(), but only set > InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we > check if XLogDelayPending is true and sleep the appropriate time. > > That way the sleep doesn't happen with important locks held / within a > critical section, but we still delay close to where we went over the maximum > lag. And the overhead should be fairly minimal. +1, this sounds like a really good idea to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Bugs in pgoutput.c
On Thu, Jan 6, 2022 at 3:42 AM Tom Lane wrote: > > Commit 6ce16088b caused me to look at pgoutput.c's handling of > cache invalidations, and I was pretty appalled by what I found. > > * rel_sync_cache_relation_cb does the wrong thing when called for > a cache flush (i.e., relid == 0). Instead of invalidating all > RelationSyncCache entries as it should, it will do nothing. > > * When rel_sync_cache_relation_cb does invalidate an entry, > it immediately zaps the entry->map structure, even though that > might still be in use (as per the adjacent comment that carefully > explains why this isn't safe). I'm not sure if this could lead > to a dangling-pointer core dump, but it sure seems like it could > lead to failing to translate tuples that are about to be sent. > > * Similarly, rel_sync_cache_publication_cb is way too eager to > reset the pubactions flags, which would likely lead to failing > to transmit changes that we should transmit. > > The attached patch fixes these things, but I'm still pretty > unhappy with the general design of the data structures in > pgoutput.c, because there is this weird random mishmash of > static variables along with a palloc'd PGOutputData struct. > This cannot work if there are ever two active LogicalDecodingContexts > in the same process. I don't think serial use of LogicalDecodingContexts > (ie, destroy one and then make another) works very well either, > because pgoutput_shutdown is a mere fig leaf that ignores all the > junk the module previously made (in CacheMemoryContext no less). > So I wonder whether either of those scenarios is possible/supported/ > expected to be needed in future. > > Also ... maybe I'm not looking in the right place, but I do not > see anything anywhere in logical decoding that is taking any lock > on the relation being processed. How can that be safe? > We don't need to acquire a lock on relation while decoding changes from WAL because it uses a historic snapshot to build a relcache entry and all the later changes to the rel are absorbed while decoding WAL. It is important to not acquire a lock on user-defined relations during decoding otherwise it could lead to deadlock as explained in the email [1]. * Would it be better if we move all the initialization done by patch in get_rel_sync_entry() to a separate function as I expect future patches might need to reset more things? * * logical decoding callback calls - but invalidation events can come in - * *during* a callback if we access the relcache in the callback. Because - * of that we must mark the cache entry as invalid but not remove it from - * the hash while it could still be referenced, then prune it at a later - * safe point. - * - * Getting invalidations for relations that aren't in the table is - * entirely normal, since there's no way to unregister for an invalidation - * event. So we don't care if it's found or not. + * *during* a callback if we do any syscache or table access in the + * callback. As we don't take locks on tables, can invalidation events be accepted during table access? I could be missing something but I see relation.c accepts invalidation messages only when lock mode is not 'NoLock'. [1] - https://www.postgresql.org/message-id/CAA4eK1Ks%2Bp8wDbzhDr7yMYEWDbWFRJAd_uOY-moikc%2Bzr9ER%2Bg%40mail.gmail.com -- With Regards, Amit Kapila.
Re: row filtering for logical replication
On Thu, Jan 6, 2022 at 8:43 AM Peter Smith wrote: > > On Wed, Jan 5, 2022 at 9:52 PM Amit Kapila wrote: > > > ... > > > Another minor comment: > > +static bool pgoutput_row_filter(enum ReorderBufferChangeType changetype, > > > > Do we need to specify the 'enum' type before changetype parameter? > > > > That is because there is currently no typedef for the enum > ReorderBufferChangeType. > But I see that the 0002 patch is already adding the required typedef. > Of course, it is easy to add a typedef and then this 'enum' is not > needed in the signature, but I wasn't sure if adding a new typedef > strictly belonged as part of this Row-Filter patch or not. > I don't see any harm in doing so. -- With Regards, Amit Kapila.
Re: In-placre persistance change of a relation
At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund wrote in > The tap tests seems to fail on all platforms. See > https://cirrus-ci.com/build/4911549314760704 > > E.g. the linux failure is > > [16:45:15.569] > [16:45:15.569] # Failed test 'inserted' > [16:45:15.569] # at t/027_persistence_change.pl line 121. > [16:45:15.569] # Looks like you failed 1 test of 25. > [16:45:15.569] [16:45:15] t/027_persistence_change.pl .. > [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100) > [16:45:15.569] Failed 1/25 subtests > [16:45:15.569] [16:45:15] > [16:45:15.569] > [16:45:15.569] Test Summary Report > [16:45:15.569] --- > [16:45:15.569] t/027_persistence_change.pl(Wstat: 256 Tests: 25 > Failed: 1) > [16:45:15.569] Failed test: 18 > [16:45:15.569] Non-zero exit status: 1 > [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + > 48.94 cusr 17.13 csys = 66.24 CPU) > > https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change Thank you very much. It still doesn't fail on my devlopment environment (CentOS8), but I found a silly bug of the test script. I'm still not sure the reason the test item failed but I repost the updated version then watch what the CI says. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From e2deae1bef19827803e0e8f85b1e45e3fcd88505 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v14 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c| 52 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 545 +- src/backend/commands/tablecmds.c | 266 +++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 +++ src/backend/storage/smgr/md.c | 93 ++- src/backend/storage/smgr/smgr.c | 32 + src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h| 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h| 17 + src/test/recovery/t/027_persistence_change.pl | 247 24 files changed, 1707 insertions(+), 182 deletions(-) create mode 100644 src/test/recovery/t/027_persistence_change.pl diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 773d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: +action = "CREATE"; +break; + case XLOG_SMGR_MARK_UNLINK: +action = "DELETE"; +break; + default: +action = ""; +break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } }
Re: In-placre persistance change of a relation
Hi, On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi wrote: >At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund wrote in >> The tap tests seems to fail on all platforms. See >> https://cirrus-ci.com/build/4911549314760704 >> >> E.g. the linux failure is >> >> [16:45:15.569] >> [16:45:15.569] # Failed test 'inserted' >> [16:45:15.569] # at t/027_persistence_change.pl line 121. >> [16:45:15.569] # Looks like you failed 1 test of 25. >> [16:45:15.569] [16:45:15] t/027_persistence_change.pl .. >> [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100) >> [16:45:15.569] Failed 1/25 subtests >> [16:45:15.569] [16:45:15] >> [16:45:15.569] >> [16:45:15.569] Test Summary Report >> [16:45:15.569] --- >> [16:45:15.569] t/027_persistence_change.pl(Wstat: 256 Tests: 25 >> Failed: 1) >> [16:45:15.569] Failed test: 18 >> [16:45:15.569] Non-zero exit status: 1 >> [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys >> + 48.94 cusr 17.13 csys = 66.24 CPU) >> >> https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change > >Thank you very much. It still doesn't fail on my devlopment >environment (CentOS8), but I found a silly bug of the test script. >I'm still not sure the reason the test item failed but I repost the >updated version then watch what the CI says. Fwiw, you can now test the same way as cfbot does with a lower turnaround time, as explained in src/tools/ci/README -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Map WAL segment files on PMEM as WAL buffers
On Thu, Jan 6, 2022 at 5:00 PM Justin Pryzby wrote: > I'm attaching a probable fix. Unfortunately, for patches like this, most of > the functionality isn't exercised unless the library is installed and > compilation and runtime are enabled by default. By the way, you could add a separate patch marked not-for-commit that adds, say, an apt-get command to the Linux task in the .cirrus.yml file, and whatever --with-blah stuff you might need to the configure part, if that'd be useful to test this code. Eventually, if we wanted to support that permanently for all CI testing, we'd want to push package installation down to the image building scripts (not in the pg source tree) so that CI starts with everything we need pre-installed, but one of the goals of the recent CI work was to make it possible for patches that include dependency changes to be possible (for example the alternative SSL library threads).
Re: Skipping logical replication transactions on subscriber side
On Wed, Jan 5, 2022 at 9:48 AM Dilip Kumar wrote: > > On Wed, Jan 5, 2022 at 9:01 AM Amit Kapila wrote: > > > > On Mon, Dec 27, 2021 at 9:54 AM Masahiko Sawada > > wrote: > > > > Do you mean to say that you want to omit it even when we are > > committing the changes? > > > > > Apart from that, I'm vaguely concerned that the logic seems to be > > > getting complex. Probably it comes from the fact that we store > > > skip_xid in the catalog and update the catalog to clear/set the > > > skip_xid. It might be worth revisiting the idea of storing skip_xid on > > > shmem (e.g., ReplicationState)? > > > > > > > IIRC, the problem with that idea was that we won't remember skip_xid > > information after server restart and the user won't even know that it > > has to set it again. > > > I agree, that if we don't keep it in the catalog then after restart if > the transaction replayed again then the user has to set the skip xid > again and that would be pretty inconvenient because the user might > have to analyze the failure again and repeat the same process he did > before restart. But OTOH the combination of restart and the skip xid > might not be very frequent so this might not be a very bad option. > Basically, I am in favor of storing it in a catalog as that solution > looks cleaner at least from the user pov but if we think there are a > lot of complexities from the implementation pov then we might analyze > the approach of storing in shmem as well. > Fair point, but I think it is better to see the patch or the problems that can't be solved if we pursue storing it in catalog. Even, if we decide to store it in shmem, we need to invent some way to inform the user that we have not honored the previous setting of skip_xid and it needs to be reset again. -- With Regards, Amit Kapila.
Re: Refactoring of compression options in pg_basebackup
On Wed, Jan 05, 2022 at 10:33:38AM -0500, Robert Haas wrote: > I think we're going to want to offer both options. We can't know > whether the user prefers to consume CPU cycles on the server or on the > client. Compressing on the server has the advantage of potentially > saving transfer bandwidth, but the server is also often the busiest > part of the whole system, and users are often keen to offload as much > work as possible. Yeah. There are cases for both. I just got to wonder whether it makes sense to allow both server-side and client-side compression to be used at the same time. That would be a rather strange case, but well, with the correct set of options that could be possible. > Given that, I'd like us to be thinking about what the full set of > options looks like once we have (1) compression on either the server > or the client and (2) multiple compression algorithms and (3) multiple > compression levels. Personally, I don't really like the decision made > by this proposed patch. In this patch's view of the world, -Z is a way > of providing the compression level for whatever compression algorithm > you happen to have selected, but I think of -Z as being the upper-case > version of -z which I think of as selecting specifically gzip. It's > not particularly intuitive to me that in a command like pg_basebackup > --compress=, is a compression level rather than > an algorithm. So what I would propose is probably something like: > > pg_basebackup --compress=ALGORITHM [--compression-level=NUMBER] > pg_basebackup --server-compress=ALGORITHM [--compression-level=NUMBER] > > And then make -z short for --compress=gzip and -Z short for > --compress=gzip --compression-level=. That would be a > backward-incompatible change to the definition of --compress, but as > long as -Z works the same as today, I don't think many people will > notice. If we like, we can notice if the argument to --compress is an > integer and suggest using either -Z or --compress=gzip > --compression-level= instead. My view of things is slightly different, aka I'd rather keep --compress to mean a compression level with an integer option, but introduce a --compression-method={lz4,gzip,none}, with -Z being a synonym of --compression-method=gzip. That's at least the path we chose for pg_receivewal. I don't mind sticking with one way or another, as what you are proposing is basically the same thing I have in mind, but both tools ought to use the same set of options. Hmm. Perhaps at the end the problem is with --compress, where we don't know if it means a compression level or a compression method? For me, --compress means the former, and for you the latter. So a third way of seeing things is to drop completely --compress, but have one --compression-method and one --compression-level. That would bring a clear split. Or just one --compression-method for the client-side compression as you are proposing for the server-side compression, however I'd like to think that a split between the method and level is more intuitive. > In the proposed patch, you end up with pg_basebackup > --compression-method=lz4 -Z2 meaning compression with lz4 level 2. I > find that quite odd, though as with all such things, opinions may > vary. In my proposal, that would be an error, because it would be > equivalent to --compress=lz4 --compress=gzip --compression-level=2, > and would thus involve conflicting compression method specifications. It seems to me that you did not read the patch closely enough. The attached patch does not add support for LZ4 in pg_basebackup on the client-side yet. Once it gets added, though, the idea is that using --compress with LZ4 would result in an error. That's what happens with pg_receivewal on HEAD, for one. The patch just shapes things to plug LZ4 more easily in the existing code of pg_basebackup.c, and walmethods.c. So.. As of now, it is actually possible to cut the pie in three parts. There are no real objections to the cleanup of walmethods.c and the addition of some conditional TAP tests with pg_basebackup and client-side compression, as far as I can see, only to the option renaming part. Attached are two patches, then. 0001 is the cleanup of walmethods.c to rely the compression method, with more tests (tests that could be moved into their own patch, as well). 0002 is the addition of the options I suggested upthread, but we may change that depending on what gets used for the server-side compression for consistency so I am not suggesting to merge that until we agree on the full picture. The point of this thread was mostly about 0001, so I am fine to discard 0002. Thoughts? -- Michael From b2beb69c23ea8346a684e8c7ae5a5b60bade2066 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 6 Jan 2022 13:46:56 +0900 Subject: [PATCH v3 1/2] Refactor walmethods.c's tar method to use compression method Some TAP tests are added for pg_basebackup --compress, while on it. --- src/bin/pg_basebackup/pg_baseback
Re: Delay the variable initialization in get_rel_sync_entry
On Wed, Jan 5, 2022 at 10:31 AM Michael Paquier wrote: > On Fri, Dec 24, 2021 at 01:27:26PM +, houzj.f...@fujitsu.com wrote: > > Here is the perf result of pgoutput_change after applying the patch. > > I didn't notice something else that stand out. > > > > |--2.99%--pgoutput_change > > |--1.80%--get_rel_sync_entry > > |--1.56%--hash_search > > > > Also attach complete profiles. > > Thanks. I have also done my own set of measurements, and the > difference is noticeable in the profiles I looked at. So, applied > down to 13. Thanks. I agree the variables were being defined in the wrong place before this patch. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Map WAL segment files on PMEM as WAL buffers
On Thu, Jan 06, 2022 at 05:52:01PM +1300, Thomas Munro wrote: > On Thu, Jan 6, 2022 at 5:00 PM Justin Pryzby wrote: > > I'm attaching a probable fix. Unfortunately, for patches like this, most of > > the functionality isn't exercised unless the library is installed and > > compilation and runtime are enabled by default. > > By the way, you could add a separate patch marked not-for-commit that > adds, say, an apt-get command to the Linux task in the .cirrus.yml > file, and whatever --with-blah stuff you might need to the configure > part, if that'd be useful to test this code. In general, I think that's more or less essential. But in this case it really doesn't work :( running bootstrap script ... 2022-01-05 23:17:30.244 CST [12088] FATAL: file not on PMEM: path "pg_wal/00010001" -- Justin
Re: GUC flags
On Wed, Jan 05, 2022 at 05:55:17PM -0600, Justin Pryzby wrote: > pg_settings is currently defined with "SELECT *". Is it fine to enumerate a > list of columns instead ? I'd like to think that this is a better practice when it comes documenting the columns, but I don't see an actual need for this extra complication here. > + initStringInfo(&ret); > + appendStringInfoChar(&ret, '{'); > + > + if (flags & GUC_NO_SHOW_ALL) > + appendStringInfo(&ret, "NO_SHOW_ALL,"); > + if (flags & GUC_NO_RESET_ALL) > + appendStringInfo(&ret, "NO_RESET_ALL,"); > + if (flags & GUC_NOT_IN_SAMPLE) > + appendStringInfo(&ret, "NOT_IN_SAMPLE,"); > + if (flags & GUC_EXPLAIN) > + appendStringInfo(&ret, "EXPLAIN,"); > + if (flags & GUC_RUNTIME_COMPUTED) > + appendStringInfo(&ret, "RUNTIME_COMPUTED,"); > + > + /* Remove trailing comma, if any */ > + if (ret.len > 1) > + ret.data[--ret.len] = '\0'; The way of building the text array is incorrect here. See heap_tuple_infomask_flags() in pageinspect as an example with all the HEAP_* flags. I think that you should allocate an array of Datums, use CStringGetTextDatum() to assign each array element, wrapping the whole with construct_array() to build the final value for the parameter tuple. -- Michael signature.asc Description: PGP signature
Re: Refactoring of compression options in pg_basebackup
On Wed, Jan 05, 2022 at 10:22:06AM -0500, Tom Lane wrote: > I think the existing precedent is to skip the test if tar isn't there, > cf pg_basebackup/t/010_pg_basebackup.pl. But certainly the majority of > buildfarm animals have it. Even Windows environments should be fine, aka recent edc2332. -- Michael signature.asc Description: PGP signature
Re: GUC flags
At Tue, 4 Jan 2022 21:06:48 -0600, Justin Pryzby wrote in > On Wed, Jan 05, 2022 at 11:47:57AM +0900, Kyotaro Horiguchi wrote: > > At Tue, 28 Dec 2021 20:32:40 -0600, Justin Pryzby > > wrote in > In that case, *all* the flags should be exposed. There's currently 20, which > means it may not work well after all - it's already too long, and could get > longer, and/or overflow the alphabet... Yeah, if we show all 20 properties, the string is too long as well as all properties cannot have a sensible abbreviation character.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: GUC flags
On Thu, Jan 06, 2022 at 02:19:08PM +0900, Michael Paquier wrote: > On Wed, Jan 05, 2022 at 05:55:17PM -0600, Justin Pryzby wrote: > > pg_settings is currently defined with "SELECT *". Is it fine to enumerate a > > list of columns instead ? > > I'd like to think that this is a better practice when it comes > documenting the columns, but I don't see an actual need for this extra > complication here. The reason is to avoid showing the flags in the pg_settings view, which should not be bloated just so we can retire check_guc. > > + initStringInfo(&ret); > > + appendStringInfoChar(&ret, '{'); > > + > > + if (flags & GUC_NO_SHOW_ALL) > > + appendStringInfo(&ret, "NO_SHOW_ALL,"); > > + if (flags & GUC_NO_RESET_ALL) > > + appendStringInfo(&ret, "NO_RESET_ALL,"); > > + if (flags & GUC_NOT_IN_SAMPLE) > > + appendStringInfo(&ret, "NOT_IN_SAMPLE,"); > > + if (flags & GUC_EXPLAIN) > > + appendStringInfo(&ret, "EXPLAIN,"); > > + if (flags & GUC_RUNTIME_COMPUTED) > > + appendStringInfo(&ret, "RUNTIME_COMPUTED,"); > > + > > + /* Remove trailing comma, if any */ > > + if (ret.len > 1) > > + ret.data[--ret.len] = '\0'; > > The way of building the text array is incorrect here. See > heap_tuple_infomask_flags() in pageinspect as an example with all the > HEAP_* flags. I think that you should allocate an array of Datums, > use CStringGetTextDatum() to assign each array element, wrapping the > whole with construct_array() to build the final value for the > parameter tuple. I actually did it that way last night ... however GetConfigOptionByNum() is expecting it to return a text string, not an array. -- Justin
Re: SQL:2011 application time
On Wed, Jan 5, 2022 at 8:07 AM Peter Eisentraut wrote: > > This patch set looks very interesting. Thank you for the review! I'll work on your feedback but in the meantime here are replies to your questions: > I'm confused about how to query tables based on application time > periods. Online, I see examples using AS OF, but in the SQL standard > I only see this used for system time, which we are not doing here. Correct, the standard only gives it for system time. I think application time is intended to be more "in user space" so it's fine to use regular operators in your WHERE condition against the time columns, whereas system time is more of a managed thing---automatic, read-only, possibly stored in a separate table. Having a special syntax cue lets the RDBMS know it needs to involve the historical records. > validate_period(): Could use an explanatory comment. There are a > bunch of output arguments, and it's not clear what all of this is > supposed to do, and what "validating" is in this context. I'm not too happy with that function, but a previous reviewer asked me to factor out what was shared between the CREATE TABLE and ALTER TABLE cases. It does some sanity checks on the columns you've chosen, and along the way it collects info about those columns that we'll need later. But yeah all those out parameters are pretty ugly. I'll see if I can come up with a stronger abstraction for it, and at the very least I'll add some comments. > MergeAttributes(): I would perhaps initially just prohibit inheritance > situations that involve periods on either side. (It should work for > partitioning, IMO, but that should be easier to arrange.) Okay. I'm glad to hear you think partitioning won't be too hard. It is one of the last things, but to me it's a bit intimidating. > I didn't follow why indexes would have periods, for example, the new > period field in IndexStmt. Is that explained anywhere? When you create a primary key or a unique constraint (which are backed by a unique index), you can give a period name to make it a temporal constraint. We create the index first and then create the constraint as a side-effect of that (e.g. index_create calls index_constraint_create). The analysis phase generates an IndexStmt. So I think this was mostly a way to pass the period info down to the constraint. It probably doesn't actually need to be stored on pg_index though. Maybe it does for index_concurrently_create_copy. I'll add some comments, but if you think it's the wrong approach let me know. > While reading this patch I kept wondering whether it would be possible > to fold periods into pg_attribute, perhaps with negative attribute > numbers. Have you looked into something like that? No doubt it's > also complicated, but it might simplify some things, like the name > conflict checking. Hmm, I thought that sort of thing would be frowned upon. :-) But also it seems like periods really do have a bunch of details they need beyond what other attributes have (e.g. the two source attributes, the matching range type, the period type (application-vs-system), maybe some extra things for table inheritance. Also are you sure we aren't already using negative attnums somewhere already? I thought I saw something like that. > Of course, the main problem in this patch is that for most uses it > requires btree_gist. I think we should consider moving that into > core, or at least the support for types that are most relevant to this > functionality, specifically the date/time types. Aside from user > convenience, this would also allow writing more realistic test cases. I think this would be great too. How realistic do you think it is? I figured since exclusion constraints are also pretty useless without btree_gist, it wasn't asking too much to have people install the extension, but still it'd be better if it were all built in. > src/backend/access/brin/brin_minmax_multi.c > > These renaming changes seem unrelated (but still seem like a good > idea). Should they be progressed separately? I can pull this out into a separate patch. I needed to do it because when I added an `#include ` somewhere, these conflicted with the range_{de,}serialize functions declared there. > I don't understand why a temporal primary key is required for doing > UPDATE FOR PORTION OF. I don't see this in the standard. You're right, it's not in the standard. I'm doing that because creating the PK is when we add the triggers to implement UPDATE FOR PORTION OF. I thought it was acceptable since we also require a PK/unique constraint as the referent of a foreign key. But we could avoid it if I went back to the executor-based FOR PORTION OF implementation, since that doesn't depend on triggers. What do you think? Also: I noticed recently that you can't use FOR PORTION OF against an updatable view. I'm working on a new patch set to fix that. But the main reason is this PK check. So that's maybe another reason to go back to the executor implemen
RE: Optionally automatically disable logical replication subscriptions on error
On Thursday, January 6, 2022 12:17 PM Tang, Haiying/唐 海英 wrote: > Thanks for updating the patch. Here are some comments: Thank you for your review ! > 1) > + /* > + * We would not be here unless this subscription's disableonerr field > was > + * true when our worker began applying changes, but check whether > that > + * field has changed in the interim. > + */ > + if (!subform->subdisableonerr) > + { > + /* > + * Disabling the subscription has been done already. No need > of > + * additional work. > + */ > + heap_freetuple(tup); > + table_close(rel, RowExclusiveLock); > + CommitTransactionCommand(); > + return; > + } > > I don't understand what does "Disabling the subscription has been done > already" > mean, I think we only run here when subdisableonerr is changed in the interim. > Should we modify this comment? Or remove it because there are already some > explanations before. Removed. The description you pointed out was redundant. > 2) > + /* Set the subscription to disabled, and note the reason. */ > + values[Anum_pg_subscription_subenabled - 1] = > BoolGetDatum(false); > + replaces[Anum_pg_subscription_subenabled - 1] = true; > > I didn't see the code corresponding to "note the reason". Should we modify the > comment? Fixed the comment by removing the part. We come here when an error occurred and the reason is printed as log so no need to note more reason. > 3) > + booldisableonerr; /* Whether errors automatically > disable */ > > This comment is hard to understand. Maybe it can be changed to: > > Indicates if the subscription should be automatically disabled when > subscription workers detect any errors. Agreed. Fixed. Kindly have a look at the attached v16. Best Regards, Takamichi Osumi v16-0001-Optionally-disable-subscriptions-on-error.patch Description: v16-0001-Optionally-disable-subscriptions-on-error.patch
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Wed, Jan 5, 2022 at 9:46 AM Andres Freund wrote: > Hi, > > On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > > That's pretty much the same - XLogInsert() can trigger an > > XLogWrite()/Flush(). > > > > I think it's a complete no-go to add throttling to these places. It's > quite > > possible that it'd cause new deadlocks, and it's almost guaranteed to > have > > unintended consequences (e.g. replication falling back further because > > XLogFlush() is being throttled). > > I thought of another way to implement this feature. What if we checked the > current distance somewhere within XLogInsert(), but only set > InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() > we > check if XLogDelayPending is true and sleep the appropriate time. > > That way the sleep doesn't happen with important locks held / within a > critical section, but we still delay close to where we went over the > maximum > lag. And the overhead should be fairly minimal. > +1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case. > I'm doubtful that implementing the waits on a transactional level provides > a > meaningful enough amount of control - there's just too much WAL that can be > generated within a transaction. > > > Greetings, > > Andres Freund >
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM wrote: > On Wed, Jan 5, 2022 at 9:46 AM Andres Freund wrote: >> >> Hi, >> >> On 2021-12-29 11:31:51 -0800, Andres Freund wrote: >> > That's pretty much the same - XLogInsert() can trigger an >> > XLogWrite()/Flush(). >> > >> > I think it's a complete no-go to add throttling to these places. It's quite >> > possible that it'd cause new deadlocks, and it's almost guaranteed to have >> > unintended consequences (e.g. replication falling back further because >> > XLogFlush() is being throttled). >> >> I thought of another way to implement this feature. What if we checked the >> current distance somewhere within XLogInsert(), but only set >> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we >> check if XLogDelayPending is true and sleep the appropriate time. >> >> That way the sleep doesn't happen with important locks held / within a >> critical section, but we still delay close to where we went over the maximum >> lag. And the overhead should be fairly minimal. > > > +1 to the idea, this way we can fairly throttle large and smaller > transactions the same way. I will work on this model and share the patch. > Please note that the lock contention still exists in this case. Generally while checking for the interrupt we should not be holding any lwlock that means we don't have the risk of holding any buffer locks, so any other reader can continue to read from those buffers. We will only be holding some heavyweight locks like relation/tuple lock but that will not impact anyone except the writers trying to update the same tuple or the DDL who want to modify the table definition so I don't think we have any issue here because anyway we don't want other writers to continue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: 回复:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?
I corrected it according to your suggestion. thanks Wenjing. Zhihong Yu 于2021年12月25日周六 02:26写道: > > > On Thu, Dec 23, 2021 at 3:52 AM 曾文旌(义从) > wrote: > >> >> Fixed a bug found during testing. >> >> >> Wenjing >> >> Hi, > + if (condition_is_safe_pushdown_to_sublink(rinfo, > expr_info->outer)) > + { > + /* replace qual expr from outer var = const to var = const > and push down to sublink query */ > + sublink_query_push_qual(subquery, (Node > *)copyObject(rinfo->clause), expr_info->outer, expr_info->inner); > > Since sublink_query_push_qual() is always guarded > by condition_is_safe_pushdown_to_sublink(), it seems > sublink_query_push_qual() can be folded into > condition_is_safe_pushdown_to_sublink(). > > For generate_base_implied_equalities(): > > + if (ec->ec_processed) > + { > + ec_index++; > + continue; > + } > + else if (list_length(ec->ec_members) > 1) > > Minor comment: the keyword `else` can be omitted (due to `continue` above). > > +* Since there may be an unexpanded sublink in the targetList, > +* we'll skip it for now. > > Since there may be an -> If there is an > > + {"lazy_process_sublink", PGC_USERSET, QUERY_TUNING_METHOD, > + gettext_noop("enable lazy process sublink."), > > Looking at existing examples from src/backend/utils/misc/guc.c, > enable_lazy_sublink_processing seems to be consistent with existing guc > variable naming. > > +lazy_process_sublinks(PlannerInfo *root, bool single_result_rte) > > lazy_process_sublinks -> lazily_process_sublinks > > + else > + { > /* There shouldn't be any OJ info to translate, as yet */ > Assert(subroot->join_info_list == NIL); > > Indentation for the else block is off. > > + if (istop) > + f->quals = preprocess_expression_ext(root, f->quals, > EXPRKIND_QUAL, false); > + else > + f->quals = preprocess_expression_ext(root, f->quals, > EXPRKIND_QUAL, true); > > The above can be written as: > > + f->quals = preprocess_expression_ext(root, f->quals, > EXPRKIND_QUAL, !istop); > > For find_equal_conditions_contain_uplevelvar_in_sublink_query(): > + context.has_unexpected_expr == false && > `!context.has_unexpected_expr` should suffice > > equal_expr_safety_check -> is_equal_expr_safe > > Cheers > > 0001-poc-pushdown-qual-to-sublink-v5.patch Description: Binary data
Re: sqlsmith: ERROR: XX000: bogus varno: 2
On Tue, Dec 21, 2021 at 6:20 AM Tom Lane wrote: > Robert Haas writes: > > OK ... but my point is that dump and restore does work. So whatever > > cases pg_get_expr() doesn't work must be cases that aren't needed for > > that to happen. Otherwise this problem would have been found long ago. > > pg_get_expr doesn't (or didn't) depend on expression_tree_walker, > so there wasn't a problem there before. I am worried that there > might be other code paths, now or in future, that could try to apply > expression_tree_walker/mutator to relpartbound trees, which is > why I think it's a reasonable idea to teach them about such trees. > > > I realize that's a deep hole out of which we're unlikely to be able to > > climb in the short or even medium term, but we don't have to keep > > digging. We either make a rule that pg_get_expr() can apply to > > everything stored in the catalogs and produce sensible answers, which > > seems to be what you prefer, or we make it return nice errors for the > > cases that it can't handle nicely, or some combination of the two. And > > whatever we decide, we also document and enforce everywhere. > > I think having pg_get_expr throw an error for a query, as opposed to an > expression, is fine. What I don't want to do is subdivide things a lot > more finely than that; thus lumping "relpartbound" into "expression" > seems like a reasonable thing to do. Especially since we already did it > six years ago. I admit that it was an oversight on my part that relpartbound trees are not recognized by nodeFuncs.c. :-( Thanks for addressing that in the patch you posted. I guess fixing only expression_tree_walker/mutator() suffices for now, but curious to know if it was intentional that you decided not to touch the following sites: exprCollation(): it would perhaps make sense to return the collation assigned to the 1st element of listdatums/lowerdatums/upperdatums, especially given that transformPartitionBoundValue() does assign a collation to the values in those lists based on the parent's partition key specification. exprType(): could be handled similarly queryjumble.c: JumbleExpr(): whose header comment says: * expression_tree_walker() does, and therefore it's coded to be as parallel * to that function as possible. * ... * Note: the reason we don't simply use expression_tree_walker() is that the * point of that function is to support tree walkers that don't care about * most tree node types, but here we care about all types. We should complain * about any unrecognized node type. or maybe not, because relpartbound contents ought never reach queryjumble.c? -- Amit Langote EDB: http://www.enterprisedb.com
Re: generic plans and "initial" pruning
On Fri, Dec 31, 2021 at 7:56 AM Amit Langote wrote: > > On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat > wrote: >> >> On Sat, Dec 25, 2021 at 9:06 AM Amit Langote wrote: >> > >> > Executing generic plans involving partitions is known to become slower >> > as partition count grows due to a number of bottlenecks, with >> > AcquireExecutorLocks() showing at the top in profiles. >> > >> > Previous attempt at solving that problem was by David Rowley [1], >> > where he proposed delaying locking of *all* partitions appearing under >> > an Append/MergeAppend until "initial" pruning is done during the >> > executor initialization phase. A problem with that approach that he >> > has described in [2] is that leaving partitions unlocked can lead to >> > race conditions where the Plan node belonging to a partition can be >> > invalidated when a concurrent session successfully alters the >> > partition between AcquireExecutorLocks() saying the plan is okay to >> > execute and then actually executing it. >> > >> > However, using an idea that Robert suggested to me off-list a little >> > while back, it seems possible to determine the set of partitions that >> > we can safely skip locking. The idea is to look at the "initial" or >> > "pre-execution" pruning instructions contained in a given Append or >> > MergeAppend node when AcquireExecutorLocks() is collecting the >> > relations to lock and consider relations from only those sub-nodes >> > that survive performing those instructions. I've attempted >> > implementing that idea in the attached patch. >> > >> >> In which cases, we will have "pre-execution" pruning instructions that >> can be used to skip locking partitions? Can you please give a few >> examples where this approach will be useful? > > > This is mainly to be useful for prepared queries, so something like: > > prepare q as select * from partitioned_table where key = $1; > > And that too when execute q(…) uses a generic plan. Generic plans are > problematic because it must contain nodes for all partitions (without any > plan time pruning), which means CheckCachedPlan() has to spend time > proportional to the number of partitions to determine that the plan is still > usable / has not been invalidated; most of that is AcquireExecutorLocks(). > > Other bottlenecks, not addressed in this patch, pertain to some executor > startup/shutdown subroutines that process the range table of a PlannedStmt in > its entirety, whose length is also proportional to the number of partitions > when the plan is generic. > >> The benchmark is showing good results, indeed. > Indeed. Here are few comments for v1 patch: + /* Caller error if we get here without contains_init_steps */ + Assert(pruneinfo->contains_init_steps); - prunedata = prunestate->partprunedata[i]; - pprune = &prunedata->partrelprunedata[0]; - /* Perform pruning without using PARAM_EXEC Params */ - find_matching_subplans_recurse(prunedata, pprune, true, &result); + if (parentrelids) + *parentrelids = NULL; You got two blank lines after Assert. -- + /* Set up EState if not in the executor proper. */ + if (estate == NULL) + { + estate = CreateExecutorState(); + estate->es_param_list_info = params; + free_estate = true; } ... [Skip] + if (free_estate) + { + FreeExecutorState(estate); + estate = NULL; } I think this work should be left to the caller. -- /* * Stuff that follows matches exactly what ExecCreatePartitionPruneState() * does, except we don't need a PartitionPruneState here, so don't call * that function. * * XXX some refactoring might be good. */ +1, while doing it would be nice if foreach_current_index() is used instead of the i & j sequence in the respective foreach() block, IMO. -- + while ((i = bms_next_member(validsubplans, i)) >= 0) + { + Plan *subplan = list_nth(subplans, i); + + context->relations = + bms_add_members(context->relations, + get_plan_scanrelids(subplan)); + } I think instead of get_plan_scanrelids() the GetLockableRelations_worker() can be used; if so, then no need to add get_plan_scanrelids() function. -- /* Nodes containing prunable subnodes. */ + case T_MergeAppend: + { + PlannedStmt *plannedstmt = context->plannedstmt; + List *rtable = plannedstmt->rtable; + ParamListInfo params = context->params; + PartitionPruneInfo *pruneinfo; + Bitmapset *validsubplans; + Bitmapset *parentrelids; ... if (pruneinfo && pruneinfo->contains_init_steps) { int i; ... return false; } } break; Most of the declarations need t
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Wed, Jan 5, 2022 at 10:05 PM Dilip Kumar wrote: > On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM > wrote: > > > On Wed, Jan 5, 2022 at 9:46 AM Andres Freund wrote: > >> > >> Hi, > >> > >> On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > >> > That's pretty much the same - XLogInsert() can trigger an > >> > XLogWrite()/Flush(). > >> > > >> > I think it's a complete no-go to add throttling to these places. It's > quite > >> > possible that it'd cause new deadlocks, and it's almost guaranteed to > have > >> > unintended consequences (e.g. replication falling back further because > >> > XLogFlush() is being throttled). > >> > >> I thought of another way to implement this feature. What if we checked > the > >> current distance somewhere within XLogInsert(), but only set > >> InterruptPending=true, XLogDelayPending=true. Then in > ProcessInterrupts() we > >> check if XLogDelayPending is true and sleep the appropriate time. > >> > >> That way the sleep doesn't happen with important locks held / within a > >> critical section, but we still delay close to where we went over the > maximum > >> lag. And the overhead should be fairly minimal. > > > > > > +1 to the idea, this way we can fairly throttle large and smaller > transactions the same way. I will work on this model and share the patch. > Please note that the lock contention still exists in this case. > > Generally while checking for the interrupt we should not be holding > any lwlock that means we don't have the risk of holding any buffer > locks, so any other reader can continue to read from those buffers. > We will only be holding some heavyweight locks like relation/tuple > lock but that will not impact anyone except the writers trying to > update the same tuple or the DDL who want to modify the table > definition so I don't think we have any issue here because anyway we > don't want other writers to continue. > Yes, it should be ok. I was just making it explicit on Andres' previous comment on holding the heavyweight locks when wait before the commit. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: pl/pgsql feature request: shorthand for argument and local variable references
Hi Pavel, On Wed, Mar 24, 2021 at 08:09:13AM +0100, Pavel Stehule wrote: > > I am continuing to talk with Hannu about syntax. I think so using the > compile option is a good direction, but maybe renaming from "routine_label" > to "topscope_label" (proposed by Hannu) or maybe "outerscope_label" can be > a good idea. It's not clear to me what is the status of this patch. You sent some updated patch since with some typo fixes but I don't see any conclusion about the direction this patch should go and/or how to name the new option. Should the patch be reviewed or switched to Waiting on Author?
Re: a misbehavior of partition row movement (?)
Hi Alvaro, On Thu, Jan 6, 2022 at 7:27 AM Alvaro Herrera wrote: > Pushed 0001. Thank you. > I had to adjust the query used in pg_dump; you changed the attribute > name in the query used for pg15, but left unchanged the one for older > branches, so pg_dump failed for all branches other than 15. Oops, should've tested that. > Also, > psql's describe.c required a small tweak to a version number test. Ah, yes -- 15, not 14 anymore. > https://github.com/alvherre/postgres/commit/3451612e0fa082d3ec953551c6d25432bd725502 > > Thanks! What was 0002 is attached, to keep cfbot happy. It's identical > to your v11-0002. > > I have pushed it thinking that we would not backpatch any of this fix. > However, after running the tests and realizing that I didn't need an > initdb for either patch, I wonder if maybe the whole series *is* > backpatchable. Interesting thought. We do lack help from trigger.c in the v12 branch in that there's no Trigger.tgisclone, which is used in a couple of places in the fix. I haven't checked how big of a deal it would be to back-port Trigger.tgisclone to v12, but maybe that's doable. > There is one possible problem, which is that psql and pg_dump would need > testing to verify that they work decently (i.e. no crash, no > misbehavior) with partitioned tables created with the original code. I suppose you mean checking if the psql and pg_dump after applying *0001* work sanely with partitioned tables defined without 0001? Will test that. > But there are few ABI changes, maybe we can cope and get all branches > fixes instead of just 15. > > What do you think? Yeah, as long as triggers are configured as required by the fix, and that would be ensured if we're able to back-patch 0001 down to v12, I suppose it would be nice to get this fixed down to v12. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Emit "checkpoint skipped because system is idle" message at LOG level if log_checkpoints is set
At Wed, 5 Jan 2022 17:18:06 -0600, Justin Pryzby wrote in > On Wed, Jan 05, 2022 at 10:45:06AM +0530, Dilip Kumar wrote: > > On Wed, Jan 5, 2022 at 10:24 AM Bharath Rupireddy > > wrote: > > > > > > Postgres server emits a message at DEBUG1 level when it skips a > > > checkpoint. At times, developers might be surprised after figuring out > > > from server logs that there were no checkpoints happening at all > > > during a certain period of time when DEBUG1 messages aren't captured. > > > How about emitting the message at LOG level if log_checkpoints is set? > > > Patch attached. > > > > +1 to convert to LOG when log_checkpoints is set. > > I think it would be odd to write logs of increased severity, for the case > where > we did not do anything. I think it really is a debug log. > > I don't think the log level should be changed to avoid "developer" confusion, > as you said (I'm not sure if you mean a postgres developer or an application > developer, though). > > Is there any evidence that this has caused user confusion in the last 4 years > ? > > |commit 6ef2eba3f57f17960b7cd4958e18aa79e357de2f > |Author: Andres Freund > |Date: Thu Dec 22 11:31:50 2016 -0800 > | > |Skip checkpoints, archiving on idle systems. > > Note that logging a message may not be benign ; I think it could cause the > disks to spin up, that would othewise have been in power saving mode, > especially if you log to syslog, which can issue fsync. Also, part of the > argument for enabling log_checkpoint by default was that a small, quiescent > instance would not write logs every 5 minutes. Agreed. -1 to just raising elevel of the message. If someone keen to show some debug messages, it is viable for arbitrary messages by lowering log_min_messages then inserting a custom filter to emit_log_hook. It invites some overhead on irrelevant processes, but that overhead would be avoidable with a *bit* dirty hack in the filter, We might want to discuss more convenient or cleaner way to get the same result. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
At Wed, 05 Jan 2022 20:42:32 -0800, Andres Freund wrote in > Hi, > > On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi > wrote: > >I'm still not sure the reason the test item failed but I repost the > >updated version then watch what the CI says. > > Fwiw, you can now test the same way as cfbot does with a lower turnaround > time, as explained in src/tools/ci/README Fantastic! I'll give it a try. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 3:07 AM Robert Haas wrote: > > On Mon, Aug 2, 2021 at 6:38 PM Andres Freund wrote: > > I guess there's a somewhat hacky way to get somewhere without actually > > increasing the size. We could take 3 bytes from the fork number and use that > > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for > > everyone. > > > > It's not like we can use those bytes in a useful way, due to alignment > > requirements. Declaring that the high 7 bytes are for the relNode portion > > and > > the low byte for the fork would still allow efficient comparisons and > > doesn't > > seem too ugly. > > I think this idea is worth more consideration. It seems like 2^56 > relfilenodes ought to be enough for anyone, recalling that you can > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a > bunch of code that is there to guard against relfilenodes being > reused. In particular, we can remove the code that leaves a 0-length > tombstone file around until the next checkpoint to guard against > relfilenode reuse. +1 > > I think this would also solve a problem Dilip mentioned to me today: > suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's > been trying to do. Then suppose you do "ALTER DATABASE foo SET > TABLESPACE used_recently_but_not_any_more". You might get an error > complaining that “some relations of database \“%s\” are already in > tablespace \“%s\“” because there could be tombstone files in that > database. With this combination of changes, you could just use the > barrier mechanism from https://commitfest.postgresql.org/36/2962/ to > wait for those files to disappear, because they've got to be > previously-unliked files that Windows is still returning because > they're still opening -- or else they could be a sign of a corrupted > database, but there are no other possibilities. Yes, this approach will solve the problem for the WAL-logged ALTER DATABASE we are facing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Support tab completion for upper character inputs in psql
On 10.09.21 15:50, tanghy.f...@fujitsu.com wrote: (A test case for the enum case should be doable easily.) Test added. The enum test is failing on *some* platforms: t/010_tab_completion.pl .. 26/? # Failed test 'complete enum values' # at t/010_tab_completion.pl line 211. # Actual output was "ALTER TYPE mytype1 RENAME VALUE '\a\r\n'BLUE' 'bLACK' 'green' \r\npostgres=# ALTER TYPE mytype1 RENAME VALUE '" # Did not match "(?^:'bLACK' + 'BLUE' + 'green')" So the ordering of the suggested completions is different. I don't know offhand how that ordering is determined. Perhaps it's dependent on locale, readline version, or operating system. In any case, we need to figure this out to make this test stable.
Re: pl/pgsql feature request: shorthand for argument and local variable references
čt 6. 1. 2022 v 8:02 odesílatel Julien Rouhaud napsal: > Hi Pavel, > > On Wed, Mar 24, 2021 at 08:09:13AM +0100, Pavel Stehule wrote: > > > > I am continuing to talk with Hannu about syntax. I think so using the > > compile option is a good direction, but maybe renaming from > "routine_label" > > to "topscope_label" (proposed by Hannu) or maybe "outerscope_label" can > be > > a good idea. > > It's not clear to me what is the status of this patch. You sent some > updated > patch since with some typo fixes but I don't see any conclusion about the > direction this patch should go and/or how to name the new option. > > Should the patch be reviewed or switched to Waiting on Author? > I am not sure if there is enough agreement and if there is enough necessity for this feature. In this discussion there were more general questions about future development of plpgsql (about possible configurations and compiler parametrizations, and how to switch the configuration on global and on local levels). This is not fully solved yet. This is probably the reason why discussion about this specific feature (and very simple feature) has not reached a clear end. Generally, this is the same problem in Postgres. Personally, I don't want to write a complicated patch for this relatively trivial (and not too important feature). Sure, then it cannot solve more general questions. I think the plpgsql option can work well - maybe with the keyword "top_label" or "root_label". Probably some enhancing of the DECLARE statement can work too (if it will be done well). But it requires more work, because the DECLARE statement has block scope, and then the alias of the top label has to have block scope too. Theoretically we can propagate it to GUC too - same like plpgsql.variable_conflict. This simple design can work (my opinion), but I understand so it is not a general reply for people who can have more control over the behavior of plpgsql. But I think it is a different issue and should be solved separately (and this pretty complex problem, because Postgres together with PLpgSQL mix different access models - access rights, searching path, local scope, and we miss possible parametrization on schema level what has most sense for PLpgSQL). Regards Pavel