Re: Parallel copy
Thanks Greg for the testing. On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow wrote: > > > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue. > > Yes, have run 4 times. Seems to be a performance hit (whether normal > copy or parallel-1 copy) on the first COPY run on a freshly created > database. After that, results are consistent. > >From the logs, I see that it is happening only with default postgresql.conf, and there's inconsistency in table insertion times, especially from the 1st time to 2nd time. Also, the table insertion time variation is more. This is expected with the default postgresql.conf, because of the background processes interference. That's the reason we usually run with custom configuration to correctly measure the performance gain. br_default_0_1.log: 2020-09-23 22:32:36.944 JST [112616] LOG: totaltableinsertiontime = 155068.244 ms 2020-09-23 22:33:57.615 JST [11426] LOG: totaltableinsertiontime = 42096.275 ms 2020-09-23 22:37:39.192 JST [43097] LOG: totaltableinsertiontime = 29135.262 ms 2020-09-23 22:38:56.389 JST [54205] LOG: totaltableinsertiontime = 38953.912 ms 2020-09-23 22:40:27.573 JST [66485] LOG: totaltableinsertiontime = 27895.326 ms 2020-09-23 22:41:34.948 JST [77523] LOG: totaltableinsertiontime = 28929.642 ms 2020-09-23 22:43:18.938 JST [89857] LOG: totaltableinsertiontime = 30625.015 ms 2020-09-23 22:44:21.938 JST [101372] LOG: totaltableinsertiontime = 24624.045 ms br_default_1_0.log: 2020-09-24 11:12:14.989 JST [56146] LOG: totaltableinsertiontime = 192068.350 ms 2020-09-24 11:13:38.228 JST [88455] LOG: totaltableinsertiontime = 30999.942 ms 2020-09-24 11:15:50.381 JST [108935] LOG: totaltableinsertiontime = 31673.204 ms 2020-09-24 11:17:14.260 JST [118541] LOG: totaltableinsertiontime = 31367.027 ms 2020-09-24 11:20:18.975 JST [17270] LOG: totaltableinsertiontime = 26858.924 ms 2020-09-24 11:22:17.822 JST [26852] LOG: totaltableinsertiontime = 66531.442 ms 2020-09-24 11:24:09.221 JST [47971] LOG: totaltableinsertiontime = 38943.384 ms 2020-09-24 11:25:30.955 JST [58849] LOG: totaltableinsertiontime = 28286.634 ms br_custom_0_1.log: 2020-09-24 10:29:44.956 JST [110477] LOG: totaltableinsertiontime = 20207.928 ms 2020-09-24 10:30:49.570 JST [120568] LOG: totaltableinsertiontime = 23360.006 ms 2020-09-24 10:32:31.659 JST [2753] LOG: totaltableinsertiontime = 19837.588 ms 2020-09-24 10:35:49.245 JST [31118] LOG: totaltableinsertiontime = 21759.253 ms 2020-09-24 10:36:54.834 JST [41763] LOG: totaltableinsertiontime = 23547.323 ms 2020-09-24 10:38:53.507 JST [56779] LOG: totaltableinsertiontime = 21543.984 ms 2020-09-24 10:39:58.713 JST [67489] LOG: totaltableinsertiontime = 25254.563 ms br_custom_1_0.log: 2020-09-24 10:49:03.242 JST [15308] LOG: totaltableinsertiontime = 16541.201 ms 2020-09-24 10:50:11.848 JST [23324] LOG: totaltableinsertiontime = 15076.577 ms 2020-09-24 10:51:24.497 JST [35394] LOG: totaltableinsertiontime = 16400.777 ms 2020-09-24 10:52:32.354 JST [42953] LOG: totaltableinsertiontime = 15591.051 ms 2020-09-24 10:54:30.327 JST [61136] LOG: totaltableinsertiontime = 16700.954 ms 2020-09-24 10:55:38.377 JST [68719] LOG: totaltableinsertiontime = 15435.150 ms 2020-09-24 10:57:08.927 JST [83335] LOG: totaltableinsertiontime = 17133.251 ms 2020-09-24 10:58:17.420 JST [90905] LOG: totaltableinsertiontime = 15352.753 ms > > Test results show that Parallel COPY with 1 worker is performing > better than normal COPY in the test scenarios run. > Good to know :) With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration
On Wed, Sep 23, 2020 at 08:11:59AM +0200, Peter Eisentraut wrote: > This patch mixes up unsigned int and uint32 in random ways. The variable is > uint32, but the format is %u and the max constant is UINT_MAX. > > I think just use unsigned int as the variable type. There is no need to use > the bit-exact types. Note that the argument of alarm() is of type unsigned > int. Makes sense, thanks. -- Michael diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore index f3b5932498..5eb5085f45 100644 --- a/src/bin/pg_test_fsync/.gitignore +++ b/src/bin/pg_test_fsync/.gitignore @@ -1 +1,3 @@ /pg_test_fsync + +/tmp_check/ diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile index 7632c94eb7..c4f9ae0664 100644 --- a/src/bin/pg_test_fsync/Makefile +++ b/src/bin/pg_test_fsync/Makefile @@ -22,6 +22,12 @@ install: all installdirs installdirs: $(MKDIR_P) '$(DESTDIR)$(bindir)' +check: + $(prove_check) + +installcheck: + $(prove_installcheck) + uninstall: rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)' diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c index 6e47293123..c66492eba4 100644 --- a/src/bin/pg_test_fsync/pg_test_fsync.c +++ b/src/bin/pg_test_fsync/pg_test_fsync.c @@ -5,6 +5,7 @@ #include "postgres_fe.h" +#include #include #include #include @@ -62,7 +63,7 @@ do { \ static const char *progname; -static int secs_per_test = 5; +static unsigned int secs_per_test = 5; static int needs_unlink = 0; static char full_buf[DEFAULT_XLOG_SEG_SIZE], *buf, @@ -148,6 +149,8 @@ handle_args(int argc, char *argv[]) int option; /* Command line option */ int optindex = 0; /* used by getopt_long */ + unsigned long optval; /* used for option parsing */ + char *endptr; if (argc > 1) { @@ -173,7 +176,24 @@ handle_args(int argc, char *argv[]) break; case 's': -secs_per_test = atoi(optarg); +errno = 0; +optval = strtoul(optarg, &endptr, 10); + +if (endptr == optarg || *endptr != '\0' || + errno != 0 || optval != (unsigned int) optval) +{ + pg_log_error("invalid argument for option %s", "--secs-per-test"); + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); + exit(1); +} + +secs_per_test = (unsigned int) optval; +if (secs_per_test == 0) +{ + pg_log_error("%s must be in range %u..%u", + "--secs-per-test", 1, UINT_MAX); + exit(1); +} break; default: @@ -193,8 +213,8 @@ handle_args(int argc, char *argv[]) exit(1); } - printf(ngettext("%d second per test\n", - "%d seconds per test\n", + printf(ngettext("%u second per test\n", + "%u seconds per test\n", secs_per_test), secs_per_test); #if PG_O_DIRECT != 0 diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl new file mode 100644 index 00..fe9c295c49 --- /dev/null +++ b/src/bin/pg_test_fsync/t/001_basic.pl @@ -0,0 +1,25 @@ +use strict; +use warnings; + +use Config; +use TestLib; +use Test::More tests => 12; + +# +# Basic checks + +program_help_ok('pg_test_fsync'); +program_version_ok('pg_test_fsync'); +program_options_handling_ok('pg_test_fsync'); + +# +# Test invalid option combinations + +command_fails_like( + [ 'pg_test_fsync', '--secs-per-test', 'a' ], + qr/\Qpg_test_fsync: error: invalid argument for option --secs-per-test\E/, + 'pg_test_fsync: invalid argument for option --secs-per-test'); +command_fails_like( + [ 'pg_test_fsync', '--secs-per-test', '0' ], + qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..4294967295\E/, + 'pg_test_fsync: --secs-per-test must be in range'); diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore index f6c664c765..e5aac2ab12 100644 --- a/src/bin/pg_test_timing/.gitignore +++ b/src/bin/pg_test_timing/.gitignore @@ -1 +1,3 @@ /pg_test_timing + +/tmp_check/ diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile index 334d6ff5c0..52994b4103 100644 --- a/src/bin/pg_test_timing/Makefile +++ b/src/bin/pg_test_timing/Makefile @@ -22,6 +22,12 @@ install: all installdirs installdirs: $(MKDIR_P) '$(DESTDIR)$(bindir)' +check: + $(prove_check) + +installcheck: + $(prove_installcheck) + uninstall: rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)' diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index e14802372b..894bb1d0d4 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -6,15 +6,17 @@ #include "postgres_fe.h" +#include + #include "getopt_long.h" #include "portability/instr_time.h" static const char *progname; -static int32 test_duration = 3; +static unsigned int test_duration = 3; static void handle_args(int argc, char *argv[]); -static uint64 test_timing(int32); +static uint6
Re: Online checksums patch - once again
> On 24 Sep 2020, at 06:27, Michael Paquier wrote: > > On Wed, Sep 23, 2020 at 02:34:36PM +0200, Daniel Gustafsson wrote: >> While in the patch I realized that the relationlist saved the relkind but the >> code wasn't actually using it, so I've gone ahead and removed that with a lot >> fewer palloc calls as a result. The attached v22 fixes that and the above. > > Some of the TAP tests are blowing up here, as the CF bot is telling: > t/003_standby_checksum.pl .. 1/11 # Looks like you planned 11 tests but ran 4. > # Looks like your test exited with 29 just after 4. > t/003_standby_checksum.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) Interesting, I've been unable to trigger a fault and the latest Travis build was green. I'll continue to try and see if I can shake something loose. cheers ./daniel
Re: [PATCH] Add features to pg_stat_statements
On 2020-09-24 14:15, legrand legrand wrote: Hi, Both are probably not needed. I would prefer it accessible in a view live Event | date | victims Eviction... Reset... Part reset ... As there are other needs to track reset times. Let me confirm one thing. Is the number of records of this view fixed to three? Or, will a new record be appended every time an event (Eviction or Reset or Part Reset) happens? Regards, Katsuragi Yuta
Re: A little enhancement for hashdisk testset
> On 24 Sep 2020, at 07:45, Hou, Zhijie wrote: > I found some tables is not dropped at the end of the sqlscript, > It does not hava any problem, but I think it's better to drop the table in > time. There is value in keeping a representative set of objects around after regression tests, as the database which is left after regression tests run is used as the input to the pg_regress tests. That being said, I don't think that was the intention for these relationns in commit 92c58fd94801, but in general we should avoid cleaning up too much to ensure that we stress pg_upgrade well enough (there is a delicate balance to keep test runtime short as well of course). cheers ./daniel
Re: 回复:how to create index concurrently on partitioned table
On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote: > Anyway, for now this is rebased on 83158f74d. I have not thought yet about all the details of CIC and DIC and what you said upthread, but I have gone through CLUSTER for now, as a start. So, the goal of the patch, as I would define it, is to give a way to CLUSTER to work on a partitioned table using a given partitioned index. Hence, we would perform CLUSTER automatically from the top-most parent for any partitions that have an index on the same partition tree as the partitioned index defined in USING, switching indisclustered automatically depending on the index used. +CREATE TABLE clstrpart3 PARTITION OF clstrpart DEFAULT PARTITION BY RANGE(a); +CREATE TABLE clstrpart33 PARTITION OF clstrpart3 DEFAULT; CREATE INDEX clstrpart_idx ON clstrpart (a); -ALTER TABLE clstrpart CLUSTER ON clstrpart_idx So.. For any testing of partitioned trees, we should be careful to check if relfilenodes have been changed or not as part of an operation, to see if the operation has actually done something. From what I can see, attempting to use a CLUSTER on a top-most partitioned table fails to work on child tables, but isn't the goal of the patch to make sure that if we attempt to do the operation on a partitioned table using a partitioned index, then the operation should be done as well on the partitions with the partition index part of the same partition tree as the parent partitioned index? If using CLUSTER on a new partitioned index with USING, it seems to me that we may want to check that indisclustered is correctly set for all the partitions concerned in the regression tests. It would be good also to check if we have a partition index tree that maps partially with a partition table tree (aka no all table partitions have a partition index), where these don't get clustered because there is no index to work on. + MemoryContext old_context = MemoryContextSwitchTo(cluster_context); + inhoids = find_all_inheritors(indexOid, NoLock, NULL); + MemoryContextSwitchTo(old_context); Er, isn't that incorrect? I would have assumed that what should be saved in the context of cluster is the list of RelToCluster items. And we should not do any lookup of the partitions in a different context, because this may do allocations of things we don't really need to keep around for the context dedicated to CLUSTER. Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on the parent? It seems to me that at least schema changes should be prevented with a ShareLock in the first transaction building the list of elements to cluster. + /* +* We have a full list of direct and indirect children, so skip +* partitioned tables and just handle their children. +*/ + if (get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE) + continue; It would be better to use RELKIND_HAS_STORAGE here. All this stuff needs to be documented clearly. -- Michael signature.asc Description: PGP signature
RE: Transactions involving multiple postgres foreign servers, take 2
From: Masahiko Sawada > So with your idea, I think we require FDW developers to not call > ereport(ERROR) as much as possible. If they need to use a function > including palloc, lappend etc that could call ereport(ERROR), they > need to use PG_TRY() and PG_CATCH() and return the control along with > the error message to the transaction manager rather than raising an > error. Then the transaction manager will emit the error message at an > error level lower than ERROR (e.g., WARNING), and call commit/rollback > API again. But normally we do some cleanup on error but in this case > the retrying commit/rollback is performed without any cleanup. Is that > right? I’m not sure it’s safe though. Yes. It's legitimate to require the FDW commit routine to return control, because the prepare of 2PC is a promise to commit successfully. The second-phase commit should avoid doing that could fail. For example, if some memory is needed for commit, it should be allocated in prepare or before. > IIUC aggregation functions can be pushed down to the foreign server > but I have not idea the normal UDF in the select list is pushed down. > I wonder if it isn't. Oh, that's the current situation. Understood. I thought the UDF call is also pushed down, as I saw Greenplum does so. (Reading the manual, Greenplum disallows data updates in the UDF when it's executed on the remote segment server.) (Aren't we overlooking something else that updates data on the remote server while the local server is unaware?) Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, September 24, 2020 1:27 PM, Tsunakawa-san wrote: > (1) > + for (cur_blk = firstDelBlock[j]; cur_blk < > nblocks; cur_blk++) > > The right side of "cur_blk <" should not be nblocks, because nblocks is not > the number of the relation fork anymore. Right. Fixed. It should be the total number of (n)blocks of relation. > (2) > + BlockNumber nblocks; > + nblocks = smgrnblocks(smgr_reln, forkNum[j]) - > firstDelBlock[j]; > > You should either: > > * Combine the two lines into one: BlockNumber nblocks = ...; > > or > > * Put an empty line between the two lines to separate declarations and > execution statements. Right. I separated them in the updated patch. And to prevent confusion, instead of nblocks, nTotalBlocks & nBlocksToInvalidate are used. /* Get the total number of blocks for the supplied relation's fork */ nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]); /* Get the total number of blocks to be invalidated for the specified fork */ nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j]; > After correcting these, I think you can check the recovery performance. I'll send performance measurement results in the next email. Thanks a lot for the reviews! Regards, Kirk Jamison v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch Description: v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Re: [Patch] Optimize dropping of relation buffers using dlist
Hello. At Wed, 23 Sep 2020 05:37:24 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Jamison, Kirk/ジャミソン カーク # Wow. I'm surprised to read it.. > > I revised the patch based from my understanding of Horiguchi-san's comment, > > but I could be wrong. > > Quoting: > > > > " > > + /* Get the number of blocks for the supplied relation's > > fork */ > > + nblocks = smgrnblocks(smgr_reln, > > forkNum[fork_num]); > > + Assert(BlockNumberIsValid(nblocks)); > > + > > + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD) > > > > As mentioned upthread, the criteria whether we do full-scan or > > lookup-drop is how large portion of NBUFFERS this relation-drop can be > > going to invalidate. So the nblocks above should be the sum of number > > of blocks to be truncated (not just the total number of blocks) of all > > designated forks. Then once we decided to do lookup-drop method, we > > do that for all forks." > > One takeaway from Horiguchi-san's comment is to use the number of blocks to > invalidate for comparison, instead of all blocks in the fork. That is, use > > nblocks = smgrnblocks(fork) - firstDelBlock[fork]; > > Does this make sense? > > What do you think is the reason for summing up all forks? I didn't > understand why. Typically, FSM and VM forks are very small. If the main > fork is larger than NBuffers / 500, then v14 scans the entire shared buffers > for the FSM and VM forks as well as the main fork, resulting in three scans > in total. I thought of summing up smgrnblocks(fork) - firstDelBlock[fork] of all folks. I don't mind omitting non-main forks but a comment to explain the reason or reasoning would be needed. reards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: I'd like to discuss scaleout at PGCon
From: MauMau > I intentionally have put little conclusion on our specification and > design. I'd like you to look at recent distributed databases, and > then think about and discuss what we want to aim for together. I feel > it's better to separate a thread per topic or group of topics. Finally, MariaDB is now equiped with a scale-out feature. How MariaDB achieves global scale with Xpand https://www.infoworld.com/article/3574077/how-mariadb-achieves-global-scale-with-xpand.html I haven't read its documentation, and am not planning to read it now, but the feature looks nice. I hope this will also be a good stimulus. I believe we should be aware of competitiveness when designing -- "If we adopt this architecture or design to simplify the first version, will it really naturally evolve to a competitive product in the future without distorting the concept, design, and interface?" Regards Takayuki Tsunakawa
Re: Parallel copy
> > > Have you tested your patch when encoding conversion is needed? If so, > > could you please point out the email that has the test results. > > > > We have not yet done encoding testing, we will do and post the results > separately in the coming days. > Hi Ashutosh, I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, with client and server encodings being UTF-8. Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB data. The results are of the triplet form (exec time in sec, number of workers, gain) Use case 1: 2 indexes on integer columns, 1 index on text column (1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4, 3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X), (134.511, 30, 8.72X) Use case 2: 1 gist index on text column (811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4, 3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X), (199.402, 30, 4.06) Use case 3: 3 indexes on integer columns (220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X) The results are similar to our earlier runs[2]. [1] shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off [2] https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Add features to pg_stat_statements
Not limited to 3, Like an history table. Will have to think if data is saved at shutdown. -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: problem with RETURNING and update row movement
On Thu, Sep 24, 2020 at 2:47 PM Amit Langote wrote: > On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita wrote: > BTW, the discussion so far on the other thread is oblivious to the > issue being discussed here, where we need to find a way to transfer > system attributes between a pair of partitions that are possibly > incompatible with each other in terms of what set of system attributes > they support. Yeah, we should discuss the two issues together. > Although, if we prevent accessing system attributes > when performing the operation on partitioned tables, like what you > seem to propose below, then we wouldn't really have that problem. Yeah, I think so. > > Yeah, but for the other issue, I started thinking that we should just > > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... > > When the command is being performed on a partitioned table you mean? Yes. One concern about that is triggers: IIUC, triggers on a partition as-is can or can not reference xmin/xmax/cmin/cmax depending on whether a dedicated tuple slot for the partition is used or not. We should do something about this if we go in that direction? > That is, it'd be okay to reference them when the command is being > performed directly on a leaf partition, although it's another thing > whether the leaf partitions themselves have sensible values to provide > for them. I think so too. Best regards, Etsuro Fujita
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar wrote: > > > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila > > wrote: > > > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > > by Dilip but I think we can declare them in define number order like > > > below: > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > Done this way. > > > > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM; > + options.proto.logical.proto_version = MySubscription->stream ? > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM; > > Here, I think instead of using MySubscription->stream, we should use > server/walrecv version number as we used at one place in tablesync.c. I am not sure how can we do this? If PG version number is 14 then we will always sent the LOGICALREP_PROTO_STREAM_VERSION_NUM? then again we will face the same error right? I think it should be strictly based on whether we have enabled the streaming or not. Because logical replication protocol is still the same, only if the streaming is enabled we expect the streaming protocol otherwise not. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 4:31 PM Dilip Kumar wrote: > > On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila wrote: > > > > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar wrote: > > > > > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila > > > wrote: > > > > > > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > > > by Dilip but I think we can declare them in define number order like > > > > below: > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > Done this way. > > > > > > > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM; > > + options.proto.logical.proto_version = MySubscription->stream ? > > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM; > > > > Here, I think instead of using MySubscription->stream, we should use > > server/walrecv version number as we used at one place in tablesync.c. > > I am not sure how can we do this? > Have you checked what will function walrcv_server_version() will return? I was thinking that if we know that subscriber is connected to Publisher version < 14 then we can send the right value. -- With Regards, Amit Kapila.
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila wrote: > > On Thu, Sep 24, 2020 at 4:31 PM Dilip Kumar wrote: > > > > On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila > > wrote: > > > > > > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar wrote: > > > > > > > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila > > > > wrote: > > > > > > > > > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > > > > by Dilip but I think we can declare them in define number order like > > > > > below: > > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > > > Done this way. > > > > > > > > > > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM; > > > + options.proto.logical.proto_version = MySubscription->stream ? > > > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM; > > > > > > Here, I think instead of using MySubscription->stream, we should use > > > server/walrecv version number as we used at one place in tablesync.c. > > > > I am not sure how can we do this? > > > > Have you checked what will function walrcv_server_version() will > return? I was thinking that if we know that subscriber is connected to > Publisher version < 14 then we can send the right value. But, suppose we know the publisher version is < 14 and user has set streaming on while creating subscriber then still we will send the right version? I think tablesync we are forming a query so we are forming as per the publisher's version. But here we are sending which protocol version we are expecting for the data transfer so I feel it should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the streaming transfer. Now let the publisher decide whether it supports the protocol we have asked for or not and if it doesn't support then let it give error. Am I missing something here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Report error position in partition bound check
On Thu, Sep 24, 2020 at 7:19 AM Tom Lane wrote: > I looked this over and pushed it with some minor adjustments. Thank you. > However, while I was looking at it I couldn't help noticing that > transformPartitionBoundValue's handling of collation concerns seems > less than sane. There are two things bugging me: > > 1. Why does it care about the expression's collation only when there's > a top-level CollateExpr? For example, that means we get an error for > > regression=# create table p (f1 text collate "C") partition by list(f1); > CREATE TABLE > regression=# create table c1 partition of p for values in ('a' collate > "POSIX"); > ERROR: collation of partition bound value for column "f1" does not match > partition key collation "C" > > but not this: > > regression=# create table c2 partition of p for values in ('a' || 'b' collate > "POSIX"); > CREATE TABLE > > Given that we will override the expression's collation with the partition > column's collation anyway, I don't see why we have this check at all, > so my preference is to just rip out the entire stanza beginning with > "if (IsA(value, CollateExpr))". If we keep it, though, I think it needs > to do something else that is more general. > > 2. Nothing is doing assign_expr_collations() on the partition expression. > This can trivially be shown to cause problems: > > regression=# create table p (f1 bool) partition by list(f1); > CREATE TABLE > regression=# create table cf partition of p for values in ('a' < 'b'); > ERROR: could not determine which collation to use for string comparison > HINT: Use the COLLATE clause to set the collation explicitly. > > > If we want to rip out the collation mismatch error altogether, then > fixing #2 would just require inserting assign_expr_collations() before > the expression_planner() call. The other direction that would make > sense to me is to perform assign_expr_collations() after > coerce_to_target_type(), and then to complain if exprCollation() > isn't default and doesn't match the partition collation. In any > case a specific test for a CollateExpr seems quite wrong. I tried implementing that as attached and one test failed: create table test_part_coll_posix (a text) partition by range (a collate "POSIX"); ... create table test_part_coll_cast2 partition of test_part_coll_posix for values from (name 's') to ('z'); +ERROR: collation of partition bound value for column "a" does not match partition key collation "POSIX" +LINE 1: ...ion of test_part_coll_posix for values from (name 's') to ('... I dug up the discussion which resulted in this test being added and found that Peter E had opined that this failure should not occur [1]. Maybe that is why I put that half-baked guard consisting of checking if the erroneous collation comes from an explicit COLLATE clause. Now I think maybe giving an error is alright but we should tell in the DETAIL message what the expression's collation is, like as follows: create table test_part_coll_cast2 partition of test_part_coll_posix for values from (name 's') to ('z'); +ERROR: collation of partition bound value for column "a" does not match partition key collation "POSIX" +LINE 1: ...ion of test_part_coll_posix for values from (name 's') to ('... + ^ +DETAIL: The collation of partition bound value is "C". -- Amit Langote EnterpriseDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/04661508-b6f5-177e-6f6b-c4cb8426b9f0%402ndquadrant.com partition-bound-collation-check.patch Description: Binary data
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar wrote: > > On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila wrote: > > > > > > Have you checked what will function walrcv_server_version() will > > return? I was thinking that if we know that subscriber is connected to > > Publisher version < 14 then we can send the right value. > > But, suppose we know the publisher version is < 14 and user has set > streaming on while creating subscriber then still we will send the > right version? > Yeah we can send the version depending on which server we are talking to? > I think tablesync we are forming a query so we are > forming as per the publisher's version. But here we are sending which > protocol version we are expecting for the data transfer so I feel it > should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming > transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the > streaming transfer. > I am not sure if this is the right strategy. See libpqrcv_startstreaming, even if the user asked for streaming unless the server supports it we don't send the streaming option to the user. Another thing is this check will become more complicated if we need another feature like decode prepared to send different version or even if it is the same version, we might need additional checks. Why do you want to send a streaming protocol version when we know the server doesn't support it, lets behave similar to what we do in libpqrcv_startstreaming. -- With Regards, Amit Kapila.
Re: Resetting spilled txn statistics in pg_stat_replication
On Sat, Sep 19, 2020 at 1:48 PM Amit Kapila wrote: > > On Tue, Sep 8, 2020 at 7:02 PM Amit Kapila wrote: > > > > On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada > > wrote: > > I have fixed these review comments in the attached patch. > > > Apart from the above, > (a) fixed one bug in ReorderBufferSerializeTXN() where we were > updating the stats even when we have not spilled anything. > (b) made changes in pgstat_read_db_statsfile_timestamp to return false > when the replication slot entry is corrupt. > (c) move the declaration and definitions in pgstat.c to make them > consistent with existing code > (d) made another couple of cosmetic fixes and changed a few comments > (e) Tested the patch by using a guc which allows spilling all the > changes. See v4-0001-guc-always-spill > I have found a way to write the test case for this patch. This is based on the idea we used in stats.sql. As of now, I have kept the test as a separate patch. We can decide to commit the test part separately as it is slightly timing dependent but OTOH as it is based on existing logic in stats.sql so there shouldn't be much problem. I have not changed anything apart from the test patch in this version. Note that the first patch is just a debugging kind of tool to test the patch. Thoughts? -- With Regards, Amit Kapila. v5-0001-guc-always-spill.patch Description: Binary data v5-0002-Track-statistics-for-spilling-of-changes-from-Reo.patch Description: Binary data v5-0003-Test-stats.patch Description: Binary data
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
Hi Amit, On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar wrote: > > > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila > > wrote: > > > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > > by Dilip but I think we can declare them in define number order like > > > below: > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > Done this way. > > > > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM; > + options.proto.logical.proto_version = MySubscription->stream ? > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM; > > Here, I think instead of using MySubscription->stream, we should use > server/walrecv version number as we used at one place in tablesync.c. Should subscribers be setting the LR protocol value based on what is the publisher version it is communicating with or should it be set based on whether streaming was enabled or not while creating that subscription? AFAIU if we set this value based on the publisher version (which is lets say >= 14), then it's quite possible that the subscriber will start streaming changes for the in-progress transactions even if the streaming was disabled while creating the subscription, won't it? Please let me know if I am missing something here. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: fixing old_snapshot_threshold's time->xid mapping
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: not tested Spec compliant: not tested Documentation:not tested Patch looks good to me.
Re: PostgreSQL 13 Release Timeline
On 9/10/20 1:08 PM, Jonathan S. Katz wrote: > Hi, > > On 9/2/20 2:13 PM, Jonathan S. Katz wrote: > >> * PostgreSQL 13 Release Candidate 1 (RC1) will be released on September >> 17, 2020. >> >> * In absence of any critical issues, PostgreSQL 13 will become generally >> available on September 24, 2020. >> >> The aim is to have all outstanding open items committed before RC1. >> Please ensure everything is committed by September 16, 2020 AoE. We will >> send out a reminder prior to that date. > > As a friendly reminder, RC1 is one week away. I also realized I made an > error in the commit date above for RC1. Sorry :( Corrected date below. > > Please ensure you have all of your commits in by **September 13, 2020 > AoE** so we can wrap RC1 and get it out the door. > > Presuming RC1 is successful, commits for GA must be in by **September > 20, 2020 AoE**. > > Please try to close out open items[1] before RC1. I am pleased to report that PostgreSQL 13 is now GA: https://www.postgresql.org/about/news/2077/ Thank you everyone for your hard work and congratulations on another successful release! Jonathan signature.asc Description: OpenPGP digital signature
Re: Parallel copy
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy wrote: > > > > > > Have you tested your patch when encoding conversion is needed? If so, > > > could you please point out the email that has the test results. > > > > > > > We have not yet done encoding testing, we will do and post the results > > separately in the coming days. > > > > Hi Ashutosh, > > I ran the tests ensuring pg_server_to_any() gets called from copy.c. I > specified the encoding option of COPY command, with client and server > encodings being UTF-8. > Thanks Bharath for the testing. The results look impressive. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: PostgreSQL 13 Release Timeline
> I am pleased to report that PostgreSQL 13 is now GA: > > https://www.postgresql.org/about/news/2077/ > > Thank you everyone for your hard work and congratulations on another > successful release! Well done team! -- Simon Riggshttp://www.2ndQuadrant.com/ Mission Critical Databases
Re: problem with RETURNING and update row movement
On Thu, Sep 24, 2020 at 7:30 PM Etsuro Fujita wrote: > On Thu, Sep 24, 2020 at 2:47 PM Amit Langote wrote: > > On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita > > wrote: > > > Yeah, but for the other issue, I started thinking that we should just > > > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... > > > > When the command is being performed on a partitioned table you mean? > > Yes. One concern about that is triggers: IIUC, triggers on a > partition as-is can or can not reference xmin/xmax/cmin/cmax depending > on whether a dedicated tuple slot for the partition is used or not. > We should do something about this if we go in that direction? Maybe I'm missing something, but assuming that we're talking about prohibiting system attribute access in the RETURNING clause, how does that affect what triggers can or cannot do? AFAICS, only AFTER row-level triggers may sensibly access system attributes and whether/how they can do so has not much to do with the slot that ExecInsert() gets the new tuple in. It seems that the AFTER trigger infrastructure remembers an affected tuple's ctid and fetches it just before calling trigger function by asking the result relation's (e.g., a partition's) access method. To illustrate, with HEAD: create table foo (a int, b int) partition by range (a); create table foo1 partition of foo for values from (1) to (2); create or replace function report_system_info () returns trigger language plpgsql as $$ begin raise notice 'ctid: %', new.ctid; raise notice 'xmin: %', new.xmin; raise notice 'xmax: %', new.xmax; raise notice 'cmin: %', new.cmin; raise notice 'cmax: %', new.cmax; raise notice 'tableoid: %', new.tableoid; return NULL; end; $$; create trigger foo_after_trig after insert on foo for each row execute function report_system_info(); begin; insert into foo values (1); NOTICE: ctid: (0,1) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 0 NOTICE: cmax: 0 NOTICE: tableoid: 16387 insert into foo values (1); NOTICE: ctid: (0,2) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 1 NOTICE: cmax: 1 NOTICE: tableoid: 16387 insert into foo values (1); NOTICE: ctid: (0,3) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 2 NOTICE: cmax: 2 NOTICE: tableoid: 16387 -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: PostgreSQL 13 Release Timeline
On Thu, Sep 24, 2020 at 9:30 AM Jonathan S. Katz wrote: > > I am pleased to report that PostgreSQL 13 is now GA: > > https://www.postgresql.org/about/news/2077/ I'm not sure if there was a "release announcement draft" thread (searching my folder didn't find one for GA, but it's also easily possible I missed it): I'm wondering if the link for incremental sort [1] should instead/also point to the examples in [2] which actually explain what incremental sort does. Actually, should the GUC page link there? There's not an obvious anchor tag on the page (that I can tell) to use for such a link though... If it'd be better to split off this to a new thread, I can do that too. James 1: https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-ENABLE-INCREMENTAL-SORT 2: https://www.postgresql.org/docs/13/using-explain.html
Re: PostgreSQL 13 Release Timeline
On 9/24/20 10:11 AM, James Coleman wrote: > On Thu, Sep 24, 2020 at 9:30 AM Jonathan S. Katz wrote: >> >> I am pleased to report that PostgreSQL 13 is now GA: >> >> https://www.postgresql.org/about/news/2077/ > > I'm not sure if there was a "release announcement draft" thread > (searching my folder didn't find one for GA, but it's also easily > possible I missed it) Yes, there always is[1] : I'm wondering if the link for incremental sort > [1] should instead/also point to the examples in [2] which actually > explain what incremental sort does. > > > Actually, should the GUC page link there? There's not an obvious > anchor tag on the page (that I can tell) to use for such a link > though... When researching it, I believe I chose the GUC because it was anchorable. > If it'd be better to split off this to a new thread, I can do that too. Probably, because it will trigger discussion ;) Jonathan [1] https://www.postgresql.org/message-id/dfc987e3-bb8c-96f5-3bd2-11b9dcc96f19%40postgresql.org signature.asc Description: OpenPGP digital signature
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
> On 24 Sep 2020, at 04:53, Michael Paquier wrote: > Enabling FIPS with OpenSSL 1.0.2 causes direct calls to the SHAXXX > routines to fail: > "Low level API call to digest SHA256 forbidden in fips mode" Confirmed by running 1.0.2s-fips with fips_mode=yes in the alg_section in openssl.cnf. > This got discussed back in 2018, but I never got back to it: > https://www.postgresql.org/message-id/20180911030250.ga27...@paquier.xyz > > One thing I did not like back in the past patch was that we did not > handle failures if one of OpenSSL's call failed, but this can easily > be handled by using a trick similar to jsonapi.c to fail hard if that > happens. > > It is worth noting that the low-level SHA routines are not recommended > for years in OpenSSL, and that these have been officially marked as > deprecated in 3.0.0. So, while the changes in sha2.h don't make this > stuff back-patchable per the ABI breakage it introduces, switching > sha2_openssl.c to use EVP is a better move in the long term, Agreed. Commit 5ff4a67f63f [0] moved contrib/pgcrypto from using low-level functions for pretty much exactly the same reasons: they are advised against and break FIPS. With your patch I can run the SSL tests successfully both with and without FIPS. I'm in favor of moving to the EVP API. > ..even if > that means that SCRAM+FIPS would not work with PG 10~13, so the > attached is something for HEAD, even if this would be possible to do > in older releases as the routines used in the attached are available > in versions of OpenSSL older than 1.0.1. Thats kind of a shame, but given the low volume of reports to -bugs and -hackers, the combination SCRAM and FIPS might not be all too common. Since OpenSSL FIPS is EOL it might also not increase until 3.0.0 comes with FIPS certification? If we really want to support it (which would require more evidence of it being a problem IMO), using the non-OpenSSL sha256 code would be one option I guess? cheers ./daniel [0] https://postgr.es/m/561274f1.1030...@iki.fi
Re: [patch] Fix checksum verification in base backups for zero page headers
On 22.09.2020 17:30, Michael Banck wrote: Hi, Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck: Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: I've looked through the previous discussion. As far as I got it, most of the controversy was about online checksums improvements. The warning about pd_upper inconsistency that you've added is a good addition. The patch is a bit messy, though, because a huge code block was shifted. Will it be different, if you just leave "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" block as it was, and add "else if (PageIsNew(page) && !PageIsZero(page))" ? Thanks, that does indeed look better as a patch and I think it's fine as-is for the code as well, I've attached a v2. Sorry, forgot to add you as reviewer in the proposed commit message, I've fixed that up now in V3. Michael Great. This version looks good to me. Thank you for answering my questions, I agree, that we can work on them in separate threads. So I mark this one as ReadyForCommitter. The only minor problem is a typo (?) in the proposed commit message. "If a page is all zero, consider that a checksum failure." It should be "If a page is NOT all zero...". -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Refactor pg_rewind code and make it work against a standby
Thanks for the review! I'll post a new version shortly, with your comments incorporated. More detailed response to a few of them below: On 18/09/2020 10:41, Kyotaro Horiguchi wrote: I don't think filemap_finalize needs to iterate over filemap twice. True, but I thought it's more clear that way, doing one thing at a time. hash_string_pointer is a copy of that of pg_verifybackup.c. Is it worth having in hashfn.h or .c? I think it's fine for now. Maybe in the future if more copies crop up. --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c ... + filemap_t *filemap; .. + filemap_init(); ... + filemap = filemap_finalize(); I'm a bit confused by this, and realized that what filemap_init initializes is *not* the filemap, but the filehash. So for example, the name of the functions might should be something like this? filehash_init() filemap = filehash_finalyze()/create_filemap() My thinking was that filemap_* is the prefix for the operations in filemap.c, hence filemap_init(). I can see the confusion, though, and I think you're right that renaming would be good. I renamed them to filehash_init(), and decide_file_actions(). I think those names make the calling code in pg_rewind.c quite clear. /* * Also check that full_page_writes is enabled. We can get torn pages if * a page is modified while we read it with pg_read_binary_file(), and we * rely on full page images to fix them. */ str = run_simple_query(conn, "SHOW full_page_writes"); if (strcmp(str, "on") != 0) pg_fatal("full_page_writes must be enabled in the source server"); pg_free(str); This is a part not changed by this patch set. But If we allow to connect to a standby, this check can be tricked by setting off on the primary and "on" on the standby (FWIW, though). Some protection measure might be necessary. Good point, the value in the primary is what matters. In fact, even when connected to the primary, the value might change while pg_rewind is running. I'm not sure how we could tighten that up. + if (chunksize > rq->length) + { + pg_fatal("received more than requested for file \"%s\"", +rq->path); + /* receiving less is OK, though */ Don't we need to truncate the target file, though? If a file is truncated in the source while pg_rewind is running, there should be a WAL record about the truncation that gets replayed when you start the server after pg_rewind has finished. We could truncate the file if we wanted to, but it's not necessary. I'll add a comment. - Heikki
Re: Report error position in partition bound check
[ cc'ing Peter, since his opinion seems to have got us here in the first place ] Amit Langote writes: > On Thu, Sep 24, 2020 at 7:19 AM Tom Lane wrote: >> However, while I was looking at it I couldn't help noticing that >> transformPartitionBoundValue's handling of collation concerns seems >> less than sane. There are two things bugging me: >> >> 1. Why does it care about the expression's collation only when there's >> a top-level CollateExpr? For example, that means we get an error for >> >> regression=# create table p (f1 text collate "C") partition by list(f1); >> CREATE TABLE >> regression=# create table c1 partition of p for values in ('a' collate >> "POSIX"); >> ERROR: collation of partition bound value for column "f1" does not match >> partition key collation "C" >> >> but not this: >> >> regression=# create table c2 partition of p for values in ('a' || 'b' >> collate "POSIX"); >> CREATE TABLE >> >> Given that we will override the expression's collation with the partition >> column's collation anyway, I don't see why we have this check at all, >> so my preference is to just rip out the entire stanza beginning with >> "if (IsA(value, CollateExpr))". If we keep it, though, I think it needs >> to do something else that is more general. > I dug up the discussion which resulted in this test being added and > found that Peter E had opined that this failure should not occur [1]. Well, I agree with Peter to that extent, but my opinion is that *none* of these cases ought to be errors. What we're doing here is performing an implicit assignment-level coercion of the expression to the type of the column, and changing the collation is allowed as part of that: regression=# create table foo (f1 text collate "C"); CREATE TABLE regression=# insert into foo values ('a' COLLATE "POSIX"); INSERT 0 1 regression=# update foo set f1 = 'b' COLLATE "POSIX"; UPDATE 1 So I find it completely inconsistent that the partitioning logic complains about equivalent cases. I think we should just rip the whole thing out, as per the attached draft. This causes several regression test results to change, but AFAICS those are only there to exercise the error tests that I think we should get rid of. regards, tom lane diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 164312d60e..0dc03dd984 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -4183,50 +4183,6 @@ transformPartitionBoundValue(ParseState *pstate, Node *val, */ Assert(!contain_var_clause(value)); - /* - * Check that the input expression's collation is compatible with one - * specified for the parent's partition key (partcollation). Don't throw - * an error if it's the default collation which we'll replace with the - * parent's collation anyway. - */ - if (IsA(value, CollateExpr)) - { - Oid exprCollOid = exprCollation(value); - - /* - * Check we have a collation iff it is a collatable type. The only - * expected failures here are (1) COLLATE applied to a noncollatable - * type, or (2) partition bound expression had an unresolved - * collation. But we might as well code this to be a complete - * consistency check. - */ - if (type_is_collatable(colType)) - { - if (!OidIsValid(exprCollOid)) -ereport(ERROR, - (errcode(ERRCODE_INDETERMINATE_COLLATION), - errmsg("could not determine which collation to use for partition bound expression"), - errhint("Use the COLLATE clause to set the collation explicitly."))); - } - else - { - if (OidIsValid(exprCollOid)) -ereport(ERROR, - (errcode(ERRCODE_DATATYPE_MISMATCH), - errmsg("collations are not supported by type %s", -format_type_be(colType; - } - - if (OidIsValid(exprCollOid) && - exprCollOid != DEFAULT_COLLATION_OID && - exprCollOid != partCollation) - ereport(ERROR, - (errcode(ERRCODE_DATATYPE_MISMATCH), - errmsg("collation of partition bound value for column \"%s\" does not match partition key collation \"%s\"", - colName, get_collation_name(partCollation)), - parser_errposition(pstate, exprLocation(value; - } - /* * Coerce to the correct type. This might cause an explicit coercion step * to be added on top of the expression, which must be evaluated before @@ -4253,6 +4209,7 @@ transformPartitionBoundValue(ParseState *pstate, Node *val, */ if (!IsA(value, Const)) { + assign_expr_collations(pstate, value); value = (Node *) expression_planner((Expr *) value); value = (Node *) evaluate_expr((Expr *) value, colType, colTypmod, partCollation);
Re: "cert" + clientcert=verify-ca in pg_hba.conf?
On Thu, Sep 24, 2020 at 12:44:01PM +0900, Michael Paquier wrote: > On Tue, Sep 01, 2020 at 10:27:03PM -0400, Bruce Momjian wrote: > > OK, good. Let's wait a few days and I will then apply it for PG 14. > > It has been a few days, and nothing has happened here. I have not > looked at the patch in details, so I cannot say if that's fine or not, > but please note that the patch fails to apply per the CF bot. I will handle it. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Custom options for building extensions with --with--llvm
Hi, In my extension I want to define some custom options for compiler. I do it in the following way: ifdef USE_DISK CUSTOM_COPT += -DIMCS_DISK_SUPPORT endif So if I want to enable disk support, I am building my extension as make USE_DISK=1 clean all install and it normally works... unless Postgres is built with --enable-llvm. In this case it tries to build llvm code using clang and is not using CUSTOM_COPTS: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O3 -Wall -pthread -DIMCS_DISK_SUPPORT -fPIC -I. -I. -I../../src/include -D_GNU_SOURCE -c -o disk.o disk.c vs. /usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I. -I../../src/include -D_GNU_SOURCE -flto=thin -emit-llvm -c -o disk.bc disk.c I wonder is there any way to pass custom compile options to clang? Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On 24/09/2020 17:21, Daniel Gustafsson wrote: If we really want to support it (which would require more evidence of it being a problem IMO), using the non-OpenSSL sha256 code would be one option I guess? That would technically work, but wouldn't it make the product as whole not FIPS compliant? I'm not a FIPS lawyer, but as I understand it the point of FIPS is that all the crypto code is encapsulated in a certified module. Having your own SHA-256 implementation would defeat that. - Heikki
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
> On 24 Sep 2020, at 18:21, Heikki Linnakangas wrote: > > On 24/09/2020 17:21, Daniel Gustafsson wrote: >> If we really want to support it (which would require more evidence of it >> being >> a problem IMO), using the non-OpenSSL sha256 code would be one option I >> guess? > > That would technically work, but wouldn't it make the product as whole not > FIPS compliant? I'm not a FIPS lawyer, but as I understand it the point of > FIPS is that all the crypto code is encapsulated in a certified module. > Having your own SHA-256 implementation would defeat that. Doh, of course, I blame a lack of caffeine this afternoon. Having a private local sha256 implementation using the EVP_* API inside scram-common would maintain FIPS compliance and ABI compatibility, but would also be rather ugly. cheers ./daniel
Incremental sort docs and release announcement
I'm breaking out a few questions I'd posed on another thread about the release timeline [1] into this new thread. I noticed on the PG13 release announcement that the link for incremental sort goes to the GUC docs [2] because (as Jonathan Katz confirmed in the linked thread) that page actually has helpful anchor tags. But I'm wondering if instead/also it should point to the examples in the EXPLAIN docs [3] which actually explain what incremental sort does. In the initial patch discussion we ended up putting the explanation there because there was a desire to keep the GUC descriptions short. But that raises a larger question: should the GUC page also link to the EXPLAIN examples? There's not an obvious anchor tag on the page (that I can tell) to use for such a link though...so that could get into an ever larger question about adding those anchors. It seems to me that we don't have a particularly great place for detail explanations in the docs of the algorithms/nodes/etc. we use...unless the new glossary section in the docs could fill that role? Any thoughts? James 1: https://www.postgresql.org/message-id/CAAaqYe_kiFZ2T_KhHf8%2BjR_f4YnH%3DhdnN3QDCA1D%2B%2BF0GuHpdg%40mail.gmail.com 2: https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-ENABLE-INCREMENTAL-SORT 3: https://www.postgresql.org/docs/13/using-explain.html
Re: Refactor pg_rewind code and make it work against a standby
On 20/09/2020 23:44, Soumyadeep Chakraborty wrote: Before getting into the code review for the patch, I wanted to know why we don't use a Bitmapset for target_modified_pages? Bitmapset is not available in client code. Perhaps it could be moved to src/common with some changes, but doesn't seem worth it until there's more client code that would need it. I'm not sure that a bitmap is the best data structure for tracking the changed blocks in the first place. A hash table might be better if there are only a few changed blocks, or something like src/backend/lib/integerset.c if there are many. But as long as the simple bitmap gets the job done, let's keep it simple. 2. Rename target_modified_pages to target_pages_to_overwrite? target_modified_pages can lead to confusion as to whether it includes pages that were modified on the target but not even present in the source and the other exclusionary cases. target_pages_to_overwrite is much clearer. Agreed, I'll rename it. Conceptually, while we're scanning source WAL, we're just making note of the modified blocks. The decision on what to do with them happens only later, in decide_file_action(). The difference is purely theoretical, though. There is no real decision to be made, all the modified blocks will be overwritten. So on the whole, I agree 'target_page_to_overwrite' is better. /* * If this is a relation file, copy the modified blocks. * * This is in addition to any other changes. */ iter = datapagemap_iterate(&entry->target_modified_pages); while (datapagemap_next(iter, &blkno)) { offset = blkno * BLCKSZ; source->queue_fetch_range(source, entry->path, offset, BLCKSZ); } pg_free(iter); Can we put this hunk into a static function overwrite_pages()? Meh, it's only about 10 lines of code, and one caller. 4. * block that have changed in the target system. It makes note of all the * changed blocks in the pagemap of the file. Can we replace the above with: * block that has changed in the target system. It decides if the given blkno in the target relfile needs to be overwritten from the source. Ok. Again conceptually though, process_target_wal_block_change() just collects information, and the decisions are made later. But you're right that we do leave out truncated-away blocks here, so we are doing more than just noting all the changed blocks. /* * Doesn't exist in either server. Why does it have an entry in the * first place?? */ return FILE_ACTION_NONE; Can we delete the above hunk and add the following assert to the very top of decide_file_action(): Assert(entry->target_exists || entry->source_exists); I would like to keep the check even when assertions are not enabled. I'll add an Assert(false) there. 7. Please address the FIXME for the symlink case: /* FIXME: Check if it points to the same target? */ It's not a new issue. Would be nice to fix, of course. I'm not sure what the right thing to do would be. If you have e.g. replaced postgresql.conf with a symlink that points outside the data directory, would it be appropriate to overwrite it? Or perhaps we should throw an error? We also throw an error if a file is a symlink in the source but a regular file in the target, or vice versa. 8. * it anyway. But there's no harm in copying it now.) and * copy them here. But we don't know which scenario we're * dealing with, and there's no harm in copying the missing * blocks now, so do it now. Could you add a line or two explaining why there is "no harm" in these two hunks above? The previous sentences explain that there's a WAL record covering them. So they will be overwritten by WAL replay anyway. Does it need more explanation? 14. queue_overwrite_range(), finish_overwrite() instead of queue_fetch_range(), finish_fetch()? Similarly update\ *_fetch_file_range() and *_finish_fetch() 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c Good idea! And fetch.h -> rewind_source.h. I also moved the code in execute_file_actions() function to pg_rewind.c, into a new function: perform_rewind(). It felt a bit silly to have just execute_file_actions() in a file of its own. perform_rewind() now does all the modifications to the data directory, writing the backup file. Except for writing the recovery config: that also needs to be when there's no rewind to do, so it makes sense to keep it separate. What do you think? 16. conn = PQconnectdb(connstr_source); if (PQstatus(conn) == CONNECTION_BAD) pg_fatal("could not connect to server: %s", PQerrorMessage(conn)); if (showprogress) pg_log_info("connected to server"); The above hunk should be part of init_libpq_source(). Consequently, init_libpq_source() should take a connection string instead of a conn. The libpq connection is also needed by WriteRecoveryConfig(), that's why it's not fully encapsulated in libpq_source. 19. typedef struct { const char *path; /* path relative to data directory root */ uint64
Re: proposal: possibility to read dumped table's name from file
Hi rebase + minor change - using pg_get_line_buf instead pg_get_line_append Regards Pavel diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index e09ed0a4c3..9d7ebaf922 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -757,6 +757,56 @@ PostgreSQL documentation + + --filter=filename + + +Read objects filters from the specified file. +If you use "-" as a filename, the filters are read from stdin. +The lines of this file must have the following format: + +(+|-)[tnfd] objectname + + + + +The first character specifies whether the object is to be included +(+) or excluded (-), and the +second character specifies the type of object to be filtered: +t (table), +n (schema), +f (foreign server), +d (table data). + + + +With the following filter file, the dump would include table +mytable1 and data from foreign tables of +some_foreign_server foreign server, but exclude data +from table mytable2. + ++t mytable1 ++f some_foreign_server +-d mytable2 + + + + +The lines starting with symbol # are ignored. +Previous white chars (spaces, tabs) are not allowed. These +lines can be used for comments, notes. Empty lines are ignored too. + + + +The --filter option works just like the other +options to include or exclude tables, schemas, table data, or foreign +tables, and both forms may be combined. Note that there are no options +to exclude a specific foreign table or to include a specific table's +data. + + + + --if-exists diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index f021bb72f4..a7824c5bf6 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -53,9 +53,11 @@ #include "catalog/pg_trigger_d.h" #include "catalog/pg_type_d.h" #include "common/connect.h" +#include "common/string.h" #include "dumputils.h" #include "fe_utils/string_utils.h" #include "getopt_long.h" +#include "lib/stringinfo.h" #include "libpq/libpq-fs.h" #include "parallel.h" #include "pg_backup_db.h" @@ -290,6 +292,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, static char *get_synchronized_snapshot(Archive *fout); static void setupDumpWorker(Archive *AHX); static TableInfo *getRootTableInfo(TableInfo *tbinfo); +static void read_patterns_from_file(char *filename, DumpOptions *dopt); int @@ -364,6 +367,7 @@ main(int argc, char **argv) {"enable-row-security", no_argument, &dopt.enable_row_security, 1}, {"exclude-table-data", required_argument, NULL, 4}, {"extra-float-digits", required_argument, NULL, 8}, + {"filter", required_argument, NULL, 12}, {"if-exists", no_argument, &dopt.if_exists, 1}, {"inserts", no_argument, NULL, 9}, {"lock-wait-timeout", required_argument, NULL, 2}, @@ -603,6 +607,10 @@ main(int argc, char **argv) optarg); break; + case 12: /* filter implementation */ +read_patterns_from_file(optarg, &dopt); +break; + default: fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); exit_nicely(1); @@ -1022,6 +1030,8 @@ help(const char *progname) " access to)\n")); printf(_(" --exclude-table-data=PATTERN do NOT dump data for the specified table(s)\n")); printf(_(" --extra-float-digits=NUM override default setting for extra_float_digits\n")); + printf(_(" --filter=FILENAMEdump objects and data based on the filter expressions\n" + " from the filter file\n")); printf(_(" --if-exists use IF EXISTS when dropping objects\n")); printf(_(" --include-foreign-data=PATTERN\n" " include data of foreign tables on foreign\n" @@ -18470,3 +18480,160 @@ appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, if (!res) pg_log_warning("could not parse reloptions array"); } + +/* + * Print error message and exit. + */ +static void +exit_invalid_filter_format(FILE *fp, char *filename, char *message, char *line, int lineno) +{ + pg_log_error("invalid format of filter file \"%s\": %s", + filename, + message); + + fprintf(stderr, "%d: %s\n", lineno, line); + + if (fp != stdin) + fclose(fp); + + exit_nicely(-1); +} + +/* + * Read dumped object specification from file + */ +static void +read_patterns_from_file(char *filename, DumpOptions *dopt) +{ + FILE *fp; + int lineno = 0; + StringInfoData line; + + /* use "-" as symbol for stdin */ + if (strcmp(filename, "-") != 0) + { + fp = fopen(filename, "r"); + if (!fp) + fatal("could not open the input file \"%s\": %m", + filename); + } + else + fp = stdin; + + initStringInfo(&line); + + while (pg_ge
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On 2020-09-24 18:21, Heikki Linnakangas wrote: That would technically work, but wouldn't it make the product as whole not FIPS compliant? I'm not a FIPS lawyer, but as I understand it the point of FIPS is that all the crypto code is encapsulated in a certified module. Having your own SHA-256 implementation would defeat that. Depends on what one considers to be covered by FIPS. The entire rest of SCRAM is custom code, so running it on top of the world's greatest SHA-256 implementation isn't going to make the end product any more trustworthy. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Custom options for building extensions with --with--llvm
Hi, On 2020-09-24 19:15:22 +0300, Konstantin Knizhnik wrote: > In my extension I want to define some custom options for compiler. > I do it in the following way: > > ifdef USE_DISK > CUSTOM_COPT += -DIMCS_DISK_SUPPORT > endif Why aren't you adding it to PG_CPPFLAGS? That should work, and I think that's what several contrib modules are using. My understanding of CUSTOM_COPT is that it's for use in Makefile.custom, not for contrib modules etc? > I wonder is there any way to pass custom compile options to clang? > Thanks in advance, It probably wouldn't hurt to add COPT that to the respective rules in Makefile.global.in in some way. Shouldn't be too hard to do, if you want to write a patch. Probably just apply it to BITCODE_CFLAGS as well. Greetings, Andres Freund
Re: proposal: schema variables
čt 24. 9. 2020 v 5:58 odesílatel Pavel Stehule napsal: > > > čt 24. 9. 2020 v 5:56 odesílatel Michael Paquier > napsal: > >> On Sat, Jul 11, 2020 at 06:44:24AM +0200, Pavel Stehule wrote: >> > rebase >> >> Per the CF bot, this needs an extra rebase as it does not apply >> anymore. This has not attracted much the attention of committers as >> well. >> > > I'll fix it today > fixed patch attached Regards Pavel > -- >> Michael >> > schema-variables-20200924.patch.gz Description: application/gzip
[patch] Concurrent table reindex per-index progress reporting
Hi, While working on a PG12-instance I noticed that the progress reporting of concurrent index creation for non-index relations fails to update the index/relation OIDs that it's currently working on in the pg_stat_progress_create_index view after creating the indexes. Please find attached a patch which properly sets these values at the appropriate places. Any thoughts? Matthias van de Meent From f41da096b1f36118917fe345e2a6fc89530a40c9 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Thu, 24 Sep 2020 20:41:10 +0200 Subject: [PATCH] Report the active index for reindex table concurrently The pgstat_progress reporting for the multi-index path of reindexing failed to set the index and relation OIDs correctly for various stages, which resulted in a pg_stat_progress_create_index view that did not accurately represent the index that the progress was being reported on. This commit tags the correct index and relation for each of the concurrent index creation stages. Signed-off-by: Matthias van de Meent --- src/backend/commands/indexcmds.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index f1b5f87e6a..b2f04012a4 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -3431,6 +3431,15 @@ ReindexRelationConcurrently(Oid relationOid, int options) heapId = indexRel->rd_index->indrelid; index_close(indexRel, NoLock); + /* + * Configure progress reporting to report for this index. + * While we're at it, reset the phase as it is cleared by start_command. + */ + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE, + PROGRESS_CREATEIDX_PHASE_WAIT_1); + /* Perform concurrent build of new index */ index_concurrently_build(heapId, newIndexId); @@ -3477,6 +3486,15 @@ ReindexRelationConcurrently(Oid relationOid, int options) snapshot = RegisterSnapshot(GetTransactionSnapshot()); PushActiveSnapshot(snapshot); + /* + * Configure progress reporting to report for this index. + * While we're at it, reset the phase as it is cleared by start_command. + */ + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE, + PROGRESS_CREATEIDX_PHASE_WAIT_2); + validate_index(heapId, newIndexId, snapshot); /* -- 2.20.1
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut wrote: > Depends on what one considers to be covered by FIPS. The entire rest of > SCRAM is custom code, so running it on top of the world's greatest > SHA-256 implementation isn't going to make the end product any more > trustworthy. I mean, the issue here, as is so often the case, is not what is actually more secure, but what meets the terms of some security standard. At least in the US, FIPS 140-2 compliance is a reasonably common need, so if we can make it easier for people who have that need to be compliant, they are more likely to use PostgreSQL, which seems like something that we should want. Our opinions about that standard do not matter to the users who are legally required to comply with it; the opinions of their lawyers and auditors do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Compatible defaults for LEAD/LAG
út 22. 9. 2020 v 2:33 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > I see few possibilities how to finish and close this issue: > > 1. use anycompatible type and add note to documentation so expression of > > optional argument can change a result type, and so this is Postgres > > specific - other databases and ANSI SQL disallow this. > > It is the most simple solution, and it is good enough for this case, that > > is not extra important. > > 2. choose the correct type somewhere inside the parser - for these two > > functions. > > 3. introduce and implement some generic solution for this case - it can > be > > implemented on C level via some function helper or on syntax level > >CREATE OR REPLACE FUNCTION lag(a anyelement, offset int, default > defval > > a%type) ... > > "defval argname%type" means for caller's expression "CAST(defval AS > > typeof(argname))" > > I continue to feel that the spec's definition of this is not so > obviously right that we should jump through hoops to duplicate it. > In fact, I don't even agree that we need a disclaimer in the docs > saying that it's not quite the same. Only the most nitpicky > language lawyers would ever even notice. > > In hopes of moving this along a bit, I experimented with converting > the other functions I listed to use anycompatible. I soon found that > touching any functions/operators that are listed in operator classes > would open a can of worms far larger than I'd previously supposed. > To maintain consistency, we'd have to propagate the datatype changes > to *every* function/operator associated with those opfamilies --- many > of which only have one any-foo input and thus aren't on my original > list. (An example here is that extending btree array_ops will require > changing the max(anyarray) and min(anyarray) aggregates too.) We'd > then end up with a situation that would be rather confusing from a > user's standpoint, in that it'd be quite un-obvious why some array > functions use anyarray while other ones use anycompatiblearray. > > So I backed off to just changing the functions/operators that have > no opclass connections, such as array_cat. Even that has some > downsides --- for example, in the attached patch, it's necessary > to change some polymorphism.sql examples that explicitly reference > array_cat(anyarray). I wonder whether this change would break a > significant number of user-defined aggregates or operators. > > (Note that I think we'd have to resist the temptation to fix that > by letting CREATE AGGREGATE et al accept support functions that > take anyarray/anycompatiblearray (etc) interchangeably. A lot of > the security analysis that went into CVE-2020-14350 depended on > the assumption that utility commands only do exact lookups of > support functions. If we tried to be lax about this, we'd > re-introduce the possibility for hostile capture of function > references in extension scripts.) > > Another interesting issue, not seen in the attached but which > came up while I was experimenting with the more aggressive patch, > was this failure in the polymorphism test: > > select max(histogram_bounds) from pg_stats where tablename = 'pg_am'; > -ERROR: cannot compare arrays of different element types > +ERROR: function max(anyarray) does not exist > > That's because we don't accept pg_stats.histogram_bounds (of > declared type anyarray) as a valid input for anycompatiblearray. > I wonder if that isn't a bug we need to fix in the anycompatible > patch, independently of anything else. We'd not be able to deduce > an actual element type from such an input, but we already cannot > do so when we match it to an anyarray parameter. > > Anyway, attached find > > 0001 - updates Vik's original patch to HEAD and tweaks the > grammar in the docs a bit. > > 0002 - add-on patch to convert array_append, array_prepend, > array_cat, array_position, array_positions, array_remove, > array_replace, and width_bucket to use anycompatiblearray. > > I think 0001 is committable, but 0002 is just WIP since > I didn't touch the docs. I'm slightly discouraged about > whether 0002 is worth proceeding with. Any thoughts? > I think so 0002 has sense - more than doc I miss related regress tests, but it is partially covered by anycompatible tests Anyway I tested both patches and there is not problem with compilation, building doc, and make check-world passed I'll mark this patch as ready for committer Best regards Pavel > > regards, tom lane > >
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
> On 24 Sep 2020, at 21:22, Robert Haas wrote: > > On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut > wrote: >> Depends on what one considers to be covered by FIPS. The entire rest of >> SCRAM is custom code, so running it on top of the world's greatest >> SHA-256 implementation isn't going to make the end product any more >> trustworthy. > > I mean, the issue here, as is so often the case, is not what is > actually more secure, but what meets the terms of some security > standard. Correct, IIUC in order to be FIPS compliant all cryptographic modules used must be FIPS certified. > At least in the US, FIPS 140-2 compliance is a reasonably > common need, so if we can make it easier for people who have that need > to be compliant, they are more likely to use PostgreSQL, which seems > like something that we should want. The proposed patch makes SCRAM+FIPS work for 14, question is if we need/want to try and address v10-13. cheers ./daniel
Re: fixing old_snapshot_threshold's time->xid mapping
On Wed, Sep 23, 2020 at 9:16 PM Thomas Munro wrote: > LGTM. Committed. Thomas, with respect to your part of this patch set, I wonder if we can make the functions that you're using to write tests safe enough that we could add them to contrib/old_snapshot and let users run them if they want. As you have them, they are hedged around with vague and scary warnings, but is that really justified? And if so, can it be fixed? It would be nicer not to end up with two loadable modules here, and maybe the right sorts of functions could even have some practical use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Custom options for building extensions with --with--llvm
On 24.09.2020 21:37, Andres Freund wrote: Hi, On 2020-09-24 19:15:22 +0300, Konstantin Knizhnik wrote: In my extension I want to define some custom options for compiler. I do it in the following way: ifdef USE_DISK CUSTOM_COPT += -DIMCS_DISK_SUPPORT endif Why aren't you adding it to PG_CPPFLAGS? That should work, and I think that's what several contrib modules are using. Thank you. PG_CPPFLAGS works correctly.
Re: Inconsistent Japanese name order in v13 contributors list
On Mon, Sep 21, 2020 at 10:31:31AM -0300, Alvaro Herrera wrote: > On 2020-Sep-18, Bruce Momjian wrote: > > > This thread from 2015 is the most comprehensive discussion I remember of > > Japanese name ordering, including a suggestion to use small caps: > > > > > > https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2 > > > > I have been following this guidance ever since. > > Right. > > About smallcaps, we didn't do it then because there was no way known to > us to use them in our then-current toolchain. But we've changed now to > XML and apparently it is possible to use them -- we could try something > like and define a CSS rule like > > .caps_lastname {font-variant: small-caps;} Yes, that's what I use for CSS. > (Apparently you also need to set 'emphasis.propagates.style' to 1 in the > stylesheet). This does it for HTML. You also need some FO trick to > cover the PDF (probably something like what a042750646db did to change > catalog_table_entry formatting.) Yeah, we have PDF output to worry about too. It is easy in LaTeX. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [PATCH] Automatic HASH and LIST partition creation
On 24.09.2020 06:27, Michael Paquier wrote: On Mon, Sep 14, 2020 at 02:38:56PM +0300, Anastasia Lubennikova wrote: Fixed. This was also caught by cfbot. This version should pass it clean. Please note that regression tests are failing, because of 6b2c4e59. -- Michael Thank you. Updated patch is attached. Open issues for review: - new syntax; - generation of partition names; - overall patch review and testing, especially with complex partitioning clauses. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company commit faa5805b839effd9d8220eff787fb0995276c370 Author: anastasia Date: Mon Sep 14 11:34:42 2020 +0300 Auto generated HASH and LIST partitions. New syntax: CREATE TABLE tbl_hash (i int) PARTITION BY HASH (i) CONFIGURATION (modulus 3); CREATE TABLE tbl_list (i int) PARTITION BY LIST (i) CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION tbl_default); With documentation draft. diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 087cad184c..ff9a7eda09 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -29,6 +29,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI ] ) [ INHERITS ( parent_table [, ... ] ) ] [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ] +[ CONFIGURATION ( partition_bound_auto_spec ) ] [ USING method ] [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ] [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ] @@ -41,6 +42,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI [, ... ] ) ] [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ] +[ CONFIGURATION ( partition_bound_auto_spec ) ] [ USING method ] [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ] [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ] @@ -53,6 +55,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI [, ... ] ) ] { FOR VALUES partition_bound_spec | DEFAULT } [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ] +[ CONFIGURATION ( partition_bound_auto_spec ) ] [ USING method ] [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ] [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ] @@ -96,6 +99,11 @@ FROM ( { partition_bound_expr | MIN TO ( { partition_bound_expr | MINVALUE | MAXVALUE } [, ...] ) | WITH ( MODULUS numeric_literal, REMAINDER numeric_literal ) +and partition_bound_auto_spec is: + +VALUES IN ( partition_bound_expr [, ...] ), [( partition_bound_expr [, ...] )] [, ...] [DEFAULT PARTITION defailt_part_name] +MODULUS numeric_literal + index_parameters in UNIQUE, PRIMARY KEY, and EXCLUDE constraints are: [ INCLUDE ( column_name [, ... ] ) ] @@ -383,6 +391,11 @@ WITH ( MODULUS numeric_literal, REM however, you can define these constraints on individual partitions. + + Range and list partitioning also support automatic creation of partitions + with an optional CONFIGURATION clause. + + See for more discussion on table partitioning. @@ -391,6 +404,38 @@ WITH ( MODULUS numeric_literal, REM + +CONFIGURATION ( partition_bound_auto_spec ) ] + + + The optional CONFIGURATION clause used together + with PARTITION BY specifies a rule of generating bounds + for partitions of the partitioned table. All partitions are created automatically + along with the parent table. + + Any indexes, constraints and user-defined row-level triggers that exist + in the parent table are cloned on the new partitions. + + + + The partition_bound_auto_spec + must correspond to the partitioning method and partition key of the + parent table, and must not overlap with any existing partition of that + parent. The form with VALUES IN is used for list partitioning + and the form with MODULUS is used for hash partitioning. + List partitioning can also provide a default partition using + DEFAULT PARTITION. + + + + Automatic range partitioning is not supported yet. + + + + + + + PARTITION OF parent_table { FOR VALUES partition_bound_spec | DEFAULT } diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 0409a40b82..6893fa5495 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4628,6 +4628,7 @@ _copyPartitionSpec(const PartitionSpec *from) COPY_STRING_FIELD(strategy); COPY_NODE_FIELD(partParams); + COPY_NODE_FIELD(autopart); COPY_LOCATION_FIELD(location); return newnode; @@ -4650,6 +465
Re: WIP: BRIN multi-range indexes
On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra wrote: > > On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote: > >While playing around with the numbers I had an epiphany: At the > >defaults, the filter already takes up ~4.3kB, over half the page. > >There is no room for another tuple, so if we're only indexing one > >column, we might as well take up the whole page. > > Hmm, yeah. I may be wrong but IIRC indexes don't support external > storage but compression is still allowed. So even if those defaults are > a bit higher than needed that should make the bloom filters a bit more > compressible, and thus fit multiple BRIN tuples on a single page. > Not sure about how much we want to rely on these optimizations, though, > considering multi-column indexes kinda break this. Yeah. Okay, then it sounds like we should go in the other direction, as the block comment at the top of brin_bloom.c implies. Indexes with multiple bloom-indexed columns already don't fit in one 8kB page, so I think every documented example should have a much lower pages_per_range. Using 32 pages per range with max tuples gives n = 928. With default p, that's about 1.1 kB per brin tuple, so one brin page can index 224 pages, much more than with the default 128. Hmm, how ugly would it be to change the default range size depending on the opclass? If indexes don't support external storage, that sounds like a pain to add. Also, with very small fpr, you can easily get into many megabytes of filter space, which kind of defeats the purpose of brin in the first place. There is already this item from the brin readme: * Different-size page ranges? In the current design, each "index entry" in a BRIN index covers the same number of pages. There's no hard reason for this; it might make sense to allow the index to self-tune so that some index entries cover smaller page ranges, if this allows the summary values to be more compact. This would incur larger BRIN overhead for the index itself, but might allow better pruning of page ranges during scan. In the limit of one index tuple per page, the index itself would occupy too much space, even though we would be able to skip reading the most heap pages, because the summary values are tight; in the opposite limit of a single tuple that summarizes the whole table, we wouldn't be able to prune anything even though the index is very small. This can probably be made to work by using the range map as an index in itself. This sounds like a lot of work, but would be robust. Anyway, given that this is a general problem and not specific to the prime partition algorithm, I'll leave that out of the attached patch, named as a .txt to avoid confusing the cfbot. > >We could also generate the primes via a sieve instead, which is really > >fast (and more code). That would be more robust, but that would require > >the filter to store the actual primes used, so 20 more bytes at max k = > >10. We could hard-code space for that, or to keep from hard-coding > >maximum k and thus lowest possible false positive rate, we'd need more > >bookkeeping. > > > > I don't think the efficiency of this code matters too much - it's only > used once when creating the index, so the simpler the better. Certainly > for now, while testing the partitioning approach. To check my understanding, isn't bloom_init() called for every tuple? Agreed on simplicity so done this way. > >So, with the partition approach, we'd likely have to set in stone > >either max nbits, or min target false positive rate. The latter option > >seems more principled, not only for the block size, but also since the > >target fp rate is already fixed by the reloption, and as I've > >demonstrated, we can often go above and beyond the reloption even > >without changing k. > > > > That seems like a rather annoying limitation, TBH. I don't think the latter is that bad. I've capped k at 10 for demonstration's sake.: (928 is from using 32 pages per range) nk m p 928 7 8895 0.01 928 10 13343 0.001 (lowest p supported in patch set) 928 13 17790 0.0001 928 10 18280 0.0001 (still works with lower k, needs higher m) 928 10 17790 0.00012 (even keeping m from row #3, capping k doesn't degrade p much) Also, k seems pretty robust against small changes as long as m isn't artificially constrained and as long as p is small. So I *think* it's okay to cap k at 10 or 12, and not bother with adjusting m, which worsens space issues. As I found before, lowering k raises target fpr, but seems more robust to overshooting ndistinct. In any case, we only need k * 2 bytes to store the partition lengths. The only way I see to avoid any limitation is to make the array of primes variable length, which could be done by putting the filter offset calculation into a macro. But having two variable-length arrays seems messy. > >Hmm, I'm not sure I understand you. I can see not caring to trim wasted > >bits, but we can't set/read off the end
Re: Add session statistics to pg_stat_database
Hello Laurenz, Thanks for submitting this! Please find my feedback below. * Are we trying to capture ONLY client initiated disconnects in m_aborted (we are not handling other disconnects by not accounting for EOF..like if psql was killed)? If yes, why? * pgstat_send_connstats(): How about renaming the "force" argument to "disconnected"? * > static TimestampTz pgStatActiveStart = DT_NOBEGIN; > static PgStat_Counter pgStatActiveTime = 0; > static TimestampTz pgStatTransactionIdleStart = DT_NOBEGIN; > static PgStat_Counter pgStatTransactionIdleTime = 0; > static bool pgStatSessionReported = false; > bool pgStatSessionDisconnected = false; I think we can house all of these globals inside PgBackendStatus and can follow the protocol for reading/writing fields in PgBackendStatus. Refer: PGSTAT_{BEGIN|END}_WRITE_ACTIVITY Also, some of these fields are not required: I don't think we need pgStatActiveStart and pgStatTransactionIdleStart - instead of these two we could use PgBackendStatus.st_state_start_timestamp which marks the beginning TS of the backend's current state (st_state). We can look at that field along with the current and to-be-transitioned-to states inside pgstat_report_activity() when there is a transition away from STATE_RUNNING, STATE_IDLEINTRANSACTION or STATE_IDLEINTRANSACTION_ABORTED, in order to update pgStatActiveTime and pgStatTransactionIdleTime. We would also need to update those counters on disconnect/PGSTAT_STAT_INTERVAL timeout if the backend's current state was STATE_RUNNING, STATE_IDLEINTRANSACTION or STATE_IDLEINTRANSACTION_ABORTED (in pgstat_send_connstats()) pgStatSessionDisconnected is not required as it can be determined if a session has been disconnected by looking at the force argument to pgstat_report_stat() [unless we would want to distinguish between client-initiated disconnects, which I am not sure why, as I have brought up above]. pgStatSessionReported is not required. We can glean this information by checking if the function local static last_report in pgstat_report_stat() is 0 and passing this on as another param "first_report" to pgstat_send_connstats(). * PGSTAT_FILE_FORMAT_ID needs to be updated when a stats collector data structure changes and we had a change in PgStat_StatDBEntry. * We can directly use PgBackendStatus.st_proc_start_timestamp for calculating m_session_time. We can also choose to report session uptime even when the report is for the not-disconnect case (PGSTAT_STAT_INTERVAL elapsed). No reason why not. Then we would need to pass in the value of last_report to pgstat_send_connstats() -> calculate m_session_time to be number of time units from PgBackendStatus.st_proc_start_timestamp for the first report and then number of time units from the last_report for all subsequent reports. * We would need to bump the catalog version since we have made changes to system views. Refer: #define CATALOG_VERSION_NO Regards, Soumyadeep (VMware)
RE: Transactions involving multiple postgres foreign servers, take 2
From: Ashutosh Bapat > The way I am looking at is to put the parallelism in the resolution > worker and not in the FDW. If we use multiple resolution workers, they > can fire commit/abort on multiple foreign servers at a time. From a single session's view, yes. However, the requests from multiple sessions are processed one at a time within each resolver, because the resolver has to call the synchronous FDW prepare/commit routines and wait for the response from the remote server. That's too limiting. > But if we want parallelism within a single resolution worker, we will > need a separate FDW APIs for firing asynchronous commit/abort prepared > txn and fetching their results resp. But given the variety of FDWs, > not all of them will support asynchronous API, so we have to support > synchronous API anyway, which is what can be targeted in the first > version. I agree in that most FDWs will be unlikely to have asynchronous prepare/commit functions, as demonstrated by the fact that even Oracle and Db2 don't implement XA asynchronous APIs. That's one problem of using FDW for Postgres scale-out. When we enhance FDW, we have to take care of other DBMSs to make the FDW interface practical. OTOH, we want to make maximum use of Postgres features, such as libpq asynchronous API, to make Postgres scale-out as performant as possible. But the scale-out design is bound by the FDW interface. I don't feel accepting such less performant design is an attitude of this community, as people here are strict against even 1 or 2 percent performance drop. > Thinking more about it, the core may support an API which accepts a > list of prepared transactions, their foreign servers and user mappings > and let FDW resolve all those either in parallel or one by one. So > parallelism is responsibility of FDW and not the core. But then we > loose parallelism across FDWs, which may not be a common case. Hmm, I understand asynchronous FDW relation scan is being developed now, in the form of cooperation between the FDW and the executor. If we make just the FDW responsible for prepare/commit parallelism, the design becomes asymmetric. As you say, I'm not sure if the parallelism is wanted among different types, say, Postgres and Oracle. In fact, major DBMSs don't implement XA asynchronous API. But such lack of parallelism may be one cause of the bad reputation that 2PC (of XA) is slow. > Given the complications around this, I think we should go ahead > supporting synchronous API first and in second version introduce > optional asynchronous API. How about the following? * Add synchronous and asynchronous versions of prepare/commit/abort routines and a routine to wait for completion of asynchronous execution in FdwRoutine. They are optional. postgres_fdw can implement the asynchronous routines using libpq asynchronous functions. Other DBMSs can implement XA asynchronous API for them in theory. * The client backend uses asynchronous FDW routines if available: /* Issue asynchronous prepare | commit | rollback to FDWs that support it */ foreach (per each foreign server used in the transaction) { if (fdwroutine->{prepare | commit | rollback}_async_func) fdwroutine->{prepare | commit | rollback}_async_func(...); } /* Wait for completion of asynchronous prepare | commit | rollback */ foreach (per each foreign server used in the transaction) { if (fdwroutine->{prepare | commit | rollback}_async_func) ret = fdwroutine->wait_for_completion(...); } /* Issue synchronous prepare | commit | rollback to FDWs that don't support it */ foreach (per each foreign server used in the transaction) { if (fdwroutine->{prepare | commit | rollback}_async_func == NULL) ret = fdwroutine->{prepare | commit | rollback}_func(...); } * The client backend asks the resolver to commit or rollback the remote transaction only when the remote transaction fails (due to the failure of remote server or network.) That is, the resolver is not involved during normal operation. This will not be complex, and can be included in the first version, if we really want to use FDW for Postgres scale-out. Regards Takayuki Tsunakawa
Re: Fix inconsistency in jsonpath .datetime()
On Sun, Sep 20, 2020 at 2:23 AM Nikita Glukhov wrote: > The beta-tester of PG13 reported a inconsistency in our current jsonpath > datetime() method implementation. By the standard format strings in > datetime() > allows only characters "-./,':; " to be used as separators in format strings. > But our to_json[b]() serializes timestamps into XSD format with "T" separator > between date and time, so the serialized data cannot be parsed back by > jsonpath > and it looks inconsistent: > > =# SELECT to_jsonb('2020-09-19 23:45:06'::timestamp); >to_jsonb > --- > "2020-09-19T23:45:06" > (1 row) > > =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp), >'$.datetime()'); > ERROR: datetime format is not recognized: "2020-09-19T23:45:06" > HINT: Use a datetime template argument to specify the input data format. > > =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp), >'$.datetime("-mm-dd HH:MI:SS")'); > ERROR: unmatched format separator " " > > =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp), >'$.datetime("-mm-dd\"T\"HH:MI:SS")'); > ERROR: invalid datetime format separator: """ > > > > Excerpt from SQL-2916 standard (5.3 , page 197): > > ::= > > > ::= >[ ] > > ::= > > > > > Attached patch #2 tries to fix this problem by enabling escaped characters in > standard mode. I'm not sure is it better to enable the whole set of text > separators or only the problematic "T" character, allow only quoted text > separators or not. > > Patch #1 is a more simple fix (so it comes first) removing excess space > between > time and timezone fields in built-in format strings used for datetime type > recognition. (It seemed to work as expected with extra space in earlier > version of the patch in which standard mode has not yet been introduced). Jsonpath .datetime() was developed as an implementation of corresponding parts of SQL Standard. Patch #1 fixes inconsistency between our implementation and Standard. I'm going to backpatch it to v13. There is also inconsistency among to_json[b]() and jsonpath .datetime(). In this case, I wouldn't say the problem is on the jsonpath side. to_json[b]() makes special exceptions for datetime types and converts them not using standard output function, but using javascript-compatible format (see f30015b6d7). Luckily, our input function for timestamp[tz] datatypes doesn't use strict format parsing, so it can work with output of to_json[b](). But according to SQL Standard, jsonpath .datetime() implements strict format parsing, so it can't work with output of to_json[b](). So, I wouldn't say in this case it's an inconsistency in the jsonpath .datetime() method. But, given now it's not an appropriate time for redesigning to_json[b](), we should probably improve jsonpath .datetime() method to understand more formats. So, patch #2 is probably acceptable, and even might be backpatched v13. One thing I don't particularly like is "In standard mode format string characters are strictly matched or matched to spaces." Instead, I would like to just strictly match characters and just add more options to fmt_str[]. Other opinions? -- Regards, Alexander Korotkov
Re: history file on replica and double switchover
Hi, My understanding is that the "archiver" won't even start if "archive_mode = on" has been set on a "replica". Therefore, either (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != ARCHIVE_MODE_OFF) will produce the same result. Please see how the "archiver" is started in src/backend/postmaster/postmaster.c 5227 /* 5228 * Start the archiver if we're responsible for (re-)archiving received 5229 * files. 5230 */ 5231 Assert(PgArchPID == 0); 5232 if (XLogArchivingAlways()) 5233 PgArchPID = pgarch_start(); I did run the nice script "double_switchover.sh" using either "always" or "on" on patch v1 and v2. They all generate the same results below. In other words, whether history file (0003.history) is archived or not depends on "archive_mode" settings. echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:40 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:40 0002000A -rw--- 1 david david 41 Sep 24 14:40 0002.history -rw--- 1 david david 83 Sep 24 14:40 0003.history echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:47 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:47 0002000A -rw--- 1 david david 41 Sep 24 14:47 0002.history Personally, I prefer patch v2 since it allies to the statement here, https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY "If archive_mode is set to on, the archiver is not enabled during recovery or standby mode. If the standby server is promoted, it will start archiving after the promotion, but will not archive any WAL it did not generate itself." By the way, I think the last part of the sentence should be changed to something like below: "but will not archive any WAL, which was not generated by itself." Best regards, David On 2020-09-17 10:18 a.m., Fujii Masao wrote: On 2020/09/04 13:53, Fujii Masao wrote: On 2020/09/04 8:29, Anastasia Lubennikova wrote: On 27.08.2020 16:02, Grigory Smolkin wrote: Hello! I`ve noticed, that when running switchover replica to master and back to replica, new history file is streamed to replica, but not archived, which is not great, because it breaks PITR if archiving is running on replica. The fix looks trivial. Bash script to reproduce the problem and patch are attached. Thanks for the report. I agree that it looks like a bug. +1 + /* Mark history file as ready for archiving */ + if (XLogArchiveMode != ARCHIVE_MODE_OFF) + XLogArchiveNotify(fname); I agree that history file should be archived in the standby when archive_mode=always. But why do we need to do when archive_mode=on? I'm just concerned about the case where the primary and standby have the shared archive area, and archive_mode is on. So I updated the patch so that walreceiver marks the streamed history file as ready for archiving only when archive_mode=always. Patch attached. Thought? Regards, -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro wrote: > As for the ShutdownXXX() functions, I haven't yet come up with any > reason for this code to exist. Emboldened by a colleague's inability > to explain to me what that code is doing for us, here is a new version > that just rips it all out. Rebased. Tom, do you have any thoughts on ShutdownCLOG() etc? From 07196368e6cd36df5910b536e93570c9acff4ff2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 25 Sep 2020 11:16:03 +1200 Subject: [PATCH v7] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 36 +++--- src/backend/access/transam/commit_ts.c | 32 +++-- src/backend/access/transam/multixact.c | 53 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +--- src/backend/access/transam/xlog.c | 18 ++- src/backend/commands/async.c | 5 +- src/backend/storage/buffer/bufmgr.c| 7 -- src/backend/storage/lmgr/predicate.c | 8 +- src/backend/storage/sync/sync.c| 28 - src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 14 ++- src/include/storage/sync.h | 7 +- src/tools/pgindent/typedefs.list | 1 + 16 files changed, 233 insertions(+), 165 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 9e352d2658..c4c20f508b 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -808,34 +810,15 @@ TrimCLOG(void) LWLockRelease(XactSLRULock); } -/* - * This must be called ONCE during postmaster or standalone-backend shutdown - */ -void -ShutdownCLOG(void) -{ - /* Flush dirty CLOG pages to disk */ - TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false); - SimpleLruFlush(XactCtl, false); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false); -} - /* * Perform a checkpoint --- either during shutdown, or on-the-fly */ void CheckPointCLOG(void) { - /* Flush dirty CLOG pages to disk */ + /* Write dirty CLOG pages to disk */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } @@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return SlruSyncFileTag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index f6a7329ba3..af5d9baae5 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -798,30 +799,14 @@ DeactivateCommitTs(void) LWLockRelease(CommitTsSLRULock); } -/* - * This must be called ONCE during postmaster or st
Re: WIP: BRIN multi-range indexes
On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote: On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra wrote: On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote: >While playing around with the numbers I had an epiphany: At the >defaults, the filter already takes up ~4.3kB, over half the page. >There is no room for another tuple, so if we're only indexing one >column, we might as well take up the whole page. Hmm, yeah. I may be wrong but IIRC indexes don't support external storage but compression is still allowed. So even if those defaults are a bit higher than needed that should make the bloom filters a bit more compressible, and thus fit multiple BRIN tuples on a single page. Not sure about how much we want to rely on these optimizations, though, considering multi-column indexes kinda break this. Yeah. Okay, then it sounds like we should go in the other direction, as the block comment at the top of brin_bloom.c implies. Indexes with multiple bloom-indexed columns already don't fit in one 8kB page, so I think every documented example should have a much lower pages_per_range. Using 32 pages per range with max tuples gives n = 928. With default p, that's about 1.1 kB per brin tuple, so one brin page can index 224 pages, much more than with the default 128. Hmm, how ugly would it be to change the default range size depending on the opclass? Not sure. What would happen for multi-column BRIN indexes with different opclasses? If indexes don't support external storage, that sounds like a pain to add. Also, with very small fpr, you can easily get into many megabytes of filter space, which kind of defeats the purpose of brin in the first place. True. There is already this item from the brin readme: * Different-size page ranges? In the current design, each "index entry" in a BRIN index covers the same number of pages. There's no hard reason for this; it might make sense to allow the index to self-tune so that some index entries cover smaller page ranges, if this allows the summary values to be more compact. This would incur larger BRIN overhead for the index itself, but might allow better pruning of page ranges during scan. In the limit of one index tuple per page, the index itself would occupy too much space, even though we would be able to skip reading the most heap pages, because the summary values are tight; in the opposite limit of a single tuple that summarizes the whole table, we wouldn't be able to prune anything even though the index is very small. This can probably be made to work by using the range map as an index in itself. This sounds like a lot of work, but would be robust. Yeah. I think it's a fairly independent / orthogonal project. Anyway, given that this is a general problem and not specific to the prime partition algorithm, I'll leave that out of the attached patch, named as a .txt to avoid confusing the cfbot. >We could also generate the primes via a sieve instead, which is really >fast (and more code). That would be more robust, but that would require >the filter to store the actual primes used, so 20 more bytes at max k = >10. We could hard-code space for that, or to keep from hard-coding >maximum k and thus lowest possible false positive rate, we'd need more >bookkeeping. > I don't think the efficiency of this code matters too much - it's only used once when creating the index, so the simpler the better. Certainly for now, while testing the partitioning approach. To check my understanding, isn't bloom_init() called for every tuple? Agreed on simplicity so done this way. No, it's only called for the first non-NULL value in the page range (unless I made a boo boo when writing that code). >So, with the partition approach, we'd likely have to set in stone >either max nbits, or min target false positive rate. The latter option >seems more principled, not only for the block size, but also since the >target fp rate is already fixed by the reloption, and as I've >demonstrated, we can often go above and beyond the reloption even >without changing k. > That seems like a rather annoying limitation, TBH. I don't think the latter is that bad. I've capped k at 10 for demonstration's sake.: (928 is from using 32 pages per range) nk m p 928 7 8895 0.01 928 10 13343 0.001 (lowest p supported in patch set) 928 13 17790 0.0001 928 10 18280 0.0001 (still works with lower k, needs higher m) 928 10 17790 0.00012 (even keeping m from row #3, capping k doesn't degrade p much) Also, k seems pretty robust against small changes as long as m isn't artificially constrained and as long as p is small. So I *think* it's okay to cap k at 10 or 12, and not bother with adjusting m, which worsens space issues. As I found before, lowering k raises target fpr, but seems more robust to overshooting ndistinct. In any case, we only need k * 2 bytes to store the partition lengths. The only way I see to avoid any limitation is to m
Re: Refactor pg_rewind code and make it work against a standby
On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas wrote: > >> /* > >> * If this is a relation file, copy the modified blocks. > >> * > >> * This is in addition to any other changes. > >> */ > >> iter = datapagemap_iterate(&entry->target_modified_pages); > >> while (datapagemap_next(iter, &blkno)) > >> { > >> offset = blkno * BLCKSZ; > >> > >> source->queue_fetch_range(source, entry->path, offset, BLCKSZ); > >> } > >> pg_free(iter); > > > > Can we put this hunk into a static function overwrite_pages()? > > Meh, it's only about 10 lines of code, and one caller. Fair. > > > 7. Please address the FIXME for the symlink case: > > /* FIXME: Check if it points to the same target? */ > > It's not a new issue. Would be nice to fix, of course. I'm not sure what > the right thing to do would be. If you have e.g. replaced > postgresql.conf with a symlink that points outside the data directory, > would it be appropriate to overwrite it? Or perhaps we should throw an > error? We also throw an error if a file is a symlink in the source but a > regular file in the target, or vice versa. > Hmm, I can imagine a use case for 2 different symlink targets on the source and target clusters. For example the primary's pg_wal directory can have a different symlink target as compared to a standby's (different mount points on the same network maybe?). An end user might not desire pg_rewind meddling with that setup or may desire pg_rewind to treat the source as a source-of-truth with respect to this as well and would want pg_rewind to overwrite the target's symlink. Maybe doing a check and emitting a warning with hint/detail is prudent here while taking no action. > > 8. > > > > * it anyway. But there's no harm in copying it now.) > > > > and > > > > * copy them here. But we don't know which scenario we're > > * dealing with, and there's no harm in copying the missing > > * blocks now, so do it now. > > > > Could you add a line or two explaining why there is "no harm" in these > > two hunks above? > > The previous sentences explain that there's a WAL record covering them. > So they will be overwritten by WAL replay anyway. Does it need more > explanation? Yeah you are right, that is reason enough. > > 14. queue_overwrite_range(), finish_overwrite() instead of > > queue_fetch_range(), finish_fetch()? Similarly update\ > > *_fetch_file_range() and *_finish_fetch() > > > > > > 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c > > Good idea! And fetch.h -> rewind_source.h. +1. You might have missed the changes to rename "fetch" -> "overwrite" as was mentioned in 14. > > I also moved the code in execute_file_actions() function to pg_rewind.c, > into a new function: perform_rewind(). It felt a bit silly to have just > execute_file_actions() in a file of its own. perform_rewind() now does > all the modifications to the data directory, writing the backup file. > Except for writing the recovery config: that also needs to be when > there's no rewind to do, so it makes sense to keep it separate. What do > you think? I don't mind inlining that functionality into perform_rewind(). +1. Heads up: The function declaration for execute_file_actions() is still there in rewind_source.h. > > 16. > > > >> conn = PQconnectdb(connstr_source); > >> > >> if (PQstatus(conn) == CONNECTION_BAD) > >> pg_fatal("could not connect to server: %s", > >> PQerrorMessage(conn)); > >> > >> if (showprogress) > >> pg_log_info("connected to server"); > > > > > > The above hunk should be part of init_libpq_source(). Consequently, > > init_libpq_source() should take a connection string instead of a conn. > > The libpq connection is also needed by WriteRecoveryConfig(), that's why > it's not fully encapsulated in libpq_source. Ah. I find it pretty weird why we need to specify --source-server to have write-recovery-conf work. From the code, we only need the conn for calling PQserverVersion(), something we can easily get by slurping pg_controldata on the source side? Maybe we can remove this limitation? Regards, Soumyadeep (VMware)
Re: history file on replica and double switchover
On 2020/09/25 8:15, David Zhang wrote: Hi, My understanding is that the "archiver" won't even start if "archive_mode = on" has been set on a "replica". Therefore, either (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != ARCHIVE_MODE_OFF) will produce the same result. Yes, the archiver isn't invoked in the standby when archive_mode=on. But, with the original patch, walreceiver creates .ready archive status file even when archive_mode=on and no archiver is running. This causes the history file to be archived after the standby is promoted and the archiver is invoked. With my patch, walreceiver creates .ready archive status for the history file only when archive_mode=always, like it does for the streamed WAL files. This is the difference between those two patches. To prevent the streamed timeline history file from being archived, IMO we should use the condition archive_mode==always in the walreceiver. Please see how the "archiver" is started in src/backend/postmaster/postmaster.c 5227 /* 5228 * Start the archiver if we're responsible for (re-)archiving received 5229 * files. 5230 */ 5231 Assert(PgArchPID == 0); 5232 if (XLogArchivingAlways()) 5233 PgArchPID = pgarch_start(); I did run the nice script "double_switchover.sh" using either "always" or "on" on patch v1 and v2. They all generate the same results below. In other words, whether history file (0003.history) is archived or not depends on "archive_mode" settings. echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:40 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:40 0002000A -rw--- 1 david david 41 Sep 24 14:40 0002.history -rw--- 1 david david 83 Sep 24 14:40 0003.history echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:47 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:47 0002000A -rw--- 1 david david 41 Sep 24 14:47 0002.history Personally, I prefer patch v2 since it allies to the statement here, https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY "If archive_mode is set to on, the archiver is not enabled during recovery or standby mode. If the standby server is promoted, it will start archiving after the promotion, but will not archive any WAL it did not generate itself." By the way, I think the last part of the sentence should be changed to something like below: "but will not archive any WAL, which was not generated by itself." I'm not sure if this change is an improvement or not. But if we apply the patch I proposed, maybe we should mention also history file here. For example, "but will not archive any WAL or timeline history files that it did not generate itself". Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Handing off SLRU fsyncs to the checkpointer
Thomas Munro writes: > Tom, do you have any thoughts on ShutdownCLOG() etc? Hm, if we cannot reach that without first completing a shutdown checkpoint, it does seem a little pointless. It'd likely be a good idea to add a comment to CheckPointCLOG et al explaining that we expect $what-exactly to fsync the data we are writing before the checkpoint is considered done. regards, tom lane
Re: "cert" + clientcert=verify-ca in pg_hba.conf?
At Thu, 24 Sep 2020 11:43:40 -0400, Bruce Momjian wrote in > On Thu, Sep 24, 2020 at 12:44:01PM +0900, Michael Paquier wrote: > > On Tue, Sep 01, 2020 at 10:27:03PM -0400, Bruce Momjian wrote: > > > OK, good. Let's wait a few days and I will then apply it for PG 14. > > > > It has been a few days, and nothing has happened here. I have not > > looked at the patch in details, so I cannot say if that's fine or not, > > but please note that the patch fails to apply per the CF bot. > > I will handle it. Thank you Bruce, Michael. This is a rebased version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 2978479ada887284eae0ed36c8acf29f1a002feb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 21 Jul 2020 23:01:27 +0900 Subject: [PATCH v2] Allow directory name for GUC ssl_crl_file and connection option sslcrl X509_STORE_load_locations accepts a directory, which leads to on-demand loading method with which method only relevant CRLs are loaded. --- doc/src/sgml/config.sgml | 10 ++-- doc/src/sgml/libpq.sgml | 8 ++-- doc/src/sgml/runtime.sgml | 30 src/backend/libpq/be-secure-openssl.c | 12 - src/interfaces/libpq/fe-secure-openssl.c | 12 - src/test/ssl/Makefile | 20 +++- src/test/ssl/ssl/both-cas-1.crt | 46 +-- src/test/ssl/ssl/both-cas-2.crt | 46 +-- src/test/ssl/ssl/client+client_ca.crt | 28 +-- src/test/ssl/ssl/client-crldir/9bb9e3c3.r0| 11 + src/test/ssl/ssl/client-revoked.crt | 14 +++--- src/test/ssl/ssl/client.crl | 16 +++ src/test/ssl/ssl/client.crt | 14 +++--- src/test/ssl/ssl/client_ca.crt| 14 +++--- .../ssl/ssl/root+client-crldir/9bb9e3c3.r0| 11 + .../ssl/ssl/root+client-crldir/a3d11bff.r0| 11 + src/test/ssl/ssl/root+client.crl | 32 ++--- src/test/ssl/ssl/root+client_ca.crt | 32 ++--- .../ssl/ssl/root+server-crldir/a3d11bff.r0| 11 + .../ssl/ssl/root+server-crldir/a836cc2d.r0| 11 + src/test/ssl/ssl/root+server.crl | 32 ++--- src/test/ssl/ssl/root+server_ca.crt | 32 ++--- src/test/ssl/ssl/root.crl | 16 +++ src/test/ssl/ssl/root_ca.crt | 18 src/test/ssl/ssl/server-cn-and-alt-names.crt | 14 +++--- src/test/ssl/ssl/server-cn-only.crt | 14 +++--- src/test/ssl/ssl/server-crldir/a836cc2d.r0| 11 + .../ssl/ssl/server-multiple-alt-names.crt | 14 +++--- src/test/ssl/ssl/server-no-names.crt | 16 +++ src/test/ssl/ssl/server-revoked.crt | 14 +++--- src/test/ssl/ssl/server-single-alt-name.crt | 16 +++ src/test/ssl/ssl/server-ss.crt| 18 src/test/ssl/ssl/server.crl | 16 +++ src/test/ssl/ssl/server_ca.crt| 14 +++--- src/test/ssl/t/001_ssltests.pl| 31 - src/test/ssl/t/SSLServer.pm | 5 +- 36 files changed, 418 insertions(+), 252 deletions(-) create mode 100644 src/test/ssl/ssl/client-crldir/9bb9e3c3.r0 create mode 100644 src/test/ssl/ssl/root+client-crldir/9bb9e3c3.r0 create mode 100644 src/test/ssl/ssl/root+client-crldir/a3d11bff.r0 create mode 100644 src/test/ssl/ssl/root+server-crldir/a3d11bff.r0 create mode 100644 src/test/ssl/ssl/root+server-crldir/a836cc2d.r0 create mode 100644 src/test/ssl/ssl/server-crldir/a836cc2d.r0 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8eabf93834..1ec4b54fcd 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1182,11 +1182,11 @@ include_dir 'conf.d' Specifies the name of the file containing the SSL server certificate -revocation list (CRL). -Relative paths are relative to the data directory. -This parameter can only be set in the postgresql.conf -file or on the server command line. -The default is empty, meaning no CRL file is loaded. +revocation list (CRL) or the directory containing CRL files. Relative +paths are relative to the data directory. This parameter can only be +set in the postgresql.conf file or on the server +command line. The default is empty, meaning no CRL file is +loaded. See for details. diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 3315f1dd05..1de85612f3 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -1711,10 +1711,12 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname This parameter specifies the file name of the SSL certificate -revocation list (CRL). Certificates listed in this file, if it -exists, will be rejected while attempting to aut
Re: Handing off SLRU fsyncs to the checkpointer
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane wrote: > Thomas Munro writes: > > Tom, do you have any thoughts on ShutdownCLOG() etc? > > Hm, if we cannot reach that without first completing a shutdown checkpoint, > it does seem a little pointless. Thanks for the sanity check. > It'd likely be a good idea to add a comment to CheckPointCLOG et al > explaining that we expect $what-exactly to fsync the data we are writing > before the checkpoint is considered done. Good point. Done like this: + /* +* Write dirty CLOG pages to disk. This may result in sync requests queued +* for later handling by ProcessSyncRequests(), as part of the checkpoint. +*/ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); Here's a new version. The final thing I'm contemplating before pushing this is whether there may be hidden magical dependencies in the order of operations in CheckPointGuts(), which I've changed around. Andres, any comments? From a722d71c4296e5fae778a783eb8140be7887eced Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 25 Sep 2020 11:16:03 +1200 Subject: [PATCH v8] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Reviewed-by: Tom Lane (ShutdownXXX removal part) Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 39 +++ src/backend/access/transam/commit_ts.c | 36 +++--- src/backend/access/transam/multixact.c | 57 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +--- src/backend/access/transam/xlog.c | 18 ++- src/backend/commands/async.c | 5 +- src/backend/storage/buffer/bufmgr.c| 7 -- src/backend/storage/lmgr/predicate.c | 8 +- src/backend/storage/sync/sync.c| 28 - src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 14 ++- src/include/storage/sync.h | 7 +- src/tools/pgindent/typedefs.list | 1 + 16 files changed, 244 insertions(+), 165 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 9e352d2658..2ec6a924db 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -808,34 +810,18 @@ TrimCLOG(void) LWLockRelease(XactSLRULock); } -/* - * This must be called ONCE during postmaster or standalone-backend shutdown - */ -void -ShutdownCLOG(void) -{ - /* Flush dirty CLOG pages to disk */ - TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false); - SimpleLruFlush(XactCtl, false); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false); -} - /* * Perform a checkpoint --- either during shutdown, or on-the-fly */ void CheckPointCLOG(void) { - /* Flush dirty CLOG pages to disk */ + /* + * Write dirty CLOG pages to disk. This may result in sync requests queued + * for later handling by ProcessSyncRequests(), as part of the checkpoint. + */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } @@ -1026,3 +1012,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + *
pg_upgrade: fail early if a tablespace dir already exists for new cluster version
This should be caught during --check, or earlier in the upgrade, rather than only after dumping the schema. Typically, the tablespace dir would be left behind by a previous failed upgrade attempt, causing a cascade of failured upgrades. I guess I may not be the only one with a 3 year old wrapper which has a hack to check for this. |rm -fr pgsql12.dat pgsql13.dat |/usr/pgsql-12/bin/initdb -D pgsql12.dat --no-sync |/usr/pgsql-13/bin/initdb -D pgsql13.dat --no-sync |/usr/pgsql-12/bin/postgres -D pgsql12.dat -c port=5678 -k /tmp |mkdir tblspc tblspc/PG_13_202007203 |psql -h /tmp -p 5678 postgres -c "CREATE TABLESPACE one LOCATION '/home/pryzbyj/tblspc'" |/usr/pgsql-13/bin/pg_upgrade -D pgsql13.dat -d pgsql12.dat -b /usr/pgsql-12/bin |pg_upgrade_utility.log: |CREATE TABLESPACE "one" OWNER "pryzbyj" LOCATION '/home/pryzbyj/tblspc'; |psql:pg_upgrade_dump_globals.sql:27: ERROR: directory "/home/pryzbyj/tblspc/PG_13_202007201" already in use as a tablespace >From 3df104610d73833da60bffcc54756a6ecb2f390f Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 24 Sep 2020 19:49:40 -0500 Subject: [PATCH v1] pg_upgrade --check to avoid tablespace failure mode --- src/bin/pg_upgrade/check.c | 32 1 file changed, 32 insertions(+) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 2f7aa632c5..b07e63dab2 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -13,6 +13,7 @@ #include "fe_utils/string_utils.h" #include "mb/pg_wchar.h" #include "pg_upgrade.h" +#include "common/relpath.h" static void check_new_cluster_is_empty(void); static void check_databases_are_compatible(void); @@ -27,6 +28,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); +static void check_for_existing_tablespace_dirs(ClusterInfo *new_cluster); static char *get_canonical_locale_name(int category, const char *locale); @@ -187,6 +189,8 @@ check_new_cluster(void) check_is_install_user(&new_cluster); check_for_prepared_transactions(&new_cluster); + + check_for_existing_tablespace_dirs(&new_cluster); } @@ -541,6 +545,34 @@ create_script_for_cluster_analyze(char **analyze_script_file_name) check_ok(); } +/* + * Check that tablespace dirs for new cluster do not already exist. + * If they did, they'd cause an error while restoring global objects. + * This allows failing earlier rather than only after dumping schema. + */ +static void +check_for_existing_tablespace_dirs(ClusterInfo *new_cluster) +{ + char old_tablespace_dir[MAXPGPATH]; + + prep_status("Checking for pre-existing tablespace directories for new cluster version"); + +// if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804) + for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++) + { + struct stat statbuf; + snprintf(old_tablespace_dir, MAXPGPATH, "%s%c%s", +os_info.old_tablespaces[tblnum], +PATH_SEPARATOR, TABLESPACE_VERSION_DIRECTORY); + + canonicalize_path(old_tablespace_dir); + + // XXX: lstat ? + if (stat(old_tablespace_dir, &statbuf) == 0) + pg_fatal("tablespace directory already exists for new cluster version: \"%s\"\n", + old_tablespace_dir); + } +} /* * create_script_for_old_cluster_deletion() -- 2.17.0
Re: "cert" + clientcert=verify-ca in pg_hba.conf?
On Thu, Sep 24, 2020 at 11:43:40AM -0400, Bruce Momjian wrote: > I will handle it. Thanks. I have switched the patch as waiting on author due to the complaint of the CF bot for now, but if you feel that this does not require an extra round of review after the new rebase, of course please feel free. -- Michael signature.asc Description: PGP signature
Re: Memory allocation abstraction in pgcrypto
On Wed, Sep 23, 2020 at 04:57:44PM +0900, Michael Paquier wrote: > I doubt that anybody has been compiling with PX_OWN_ALLOC in the last > years, so let's remove this abstraction. And +1 for your patch. So, I have looked at that this morning again, and applied the thing. -- Michael signature.asc Description: PGP signature
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 5:31 PM Amit Kapila wrote: > > On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar wrote: > > > > On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila wrote: > > > > > > > > > Have you checked what will function walrcv_server_version() will > > > return? I was thinking that if we know that subscriber is connected to > > > Publisher version < 14 then we can send the right value. > > > > But, suppose we know the publisher version is < 14 and user has set > > streaming on while creating subscriber then still we will send the > > right version? > > > > Yeah we can send the version depending on which server we are talking to? Ok > > I think tablesync we are forming a query so we are > > forming as per the publisher's version. But here we are sending which > > protocol version we are expecting for the data transfer so I feel it > > should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming > > transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the > > streaming transfer. > > > > I am not sure if this is the right strategy. See > libpqrcv_startstreaming, even if the user asked for streaming unless > the server supports it we don't send the streaming option to the user. > Another thing is this check will become more complicated if we need > another feature like decode prepared to send different version or even > if it is the same version, we might need additional checks. Why do you > want to send a streaming protocol version when we know the server > doesn't support it, lets behave similar to what we do in > libpqrcv_startstreaming. Okay, so even if the streaming is enabled we are sending the streaming on to the server versions which are >= 14 which make sense. But I still doubt that it is right thing to send the protocol version based on the server version. For example suppose in future PG20 we change the streaming protocol version to 3, that mean from PG14 to PG20 we may not support the streaming transmission but we still be able to support the normal transamission. Now if streaming is off then ideally we should send the LOGICALREP_PROTO_VERSION_NUM and that is still supporetd by the publisher but if we send the LOGICALREP_PROTO_STREAM_VERSION_NUM then it will error out because the streaming protocol changed in the latest subscriber and the same is not supported by the older publisher (14). So now if we want non streaming transmission from 14 to 20 and that should be allowed but if based on the server version we always send the LOGICALREP_PROTO_STREAM_VERSION_NUM then that will be a problem because our streaming version has changed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: fixing old_snapshot_threshold's time->xid mapping
On Thu, Sep 24, 2020 at 03:46:14PM -0400, Robert Haas wrote: > Committed. Cool, thanks. > Thomas, with respect to your part of this patch set, I wonder if we > can make the functions that you're using to write tests safe enough > that we could add them to contrib/old_snapshot and let users run them > if they want. As you have them, they are hedged around with vague and > scary warnings, but is that really justified? And if so, can it be > fixed? It would be nicer not to end up with two loadable modules here, > and maybe the right sorts of functions could even have some practical > use. I have switched this item as waiting on author in the CF app then, as we are not completely done yet. -- Michael signature.asc Description: PGP signature
Re: "cert" + clientcert=verify-ca in pg_hba.conf?
Kyotaro Horiguchi writes: > Thank you Bruce, Michael. This is a rebased version. I really strongly object to all the encoded data in this patch. One cannot read it, one cannot even easily figure out how long it is until the tests break by virtue of the certificates expiring. One can, however, be entirely certain that they *will* break at some point. I don't like the idea of time bombs in our test suite. That being the case, it'd likely be better to drop all the pre-made certificates and have the test scripts create them on the fly. That'd remove both the documentation problem (i.e., having readable info as to how the certificates were made) and the expiration problem. regards, tom lane
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Fri, Sep 25, 2020 at 7:21 AM Dilip Kumar wrote: > > On Thu, Sep 24, 2020 at 5:31 PM Amit Kapila wrote: > > > > On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar wrote: > > > > > > On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila > > > wrote: > > > > > > > > > > > > Have you checked what will function walrcv_server_version() will > > > > return? I was thinking that if we know that subscriber is connected to > > > > Publisher version < 14 then we can send the right value. > > > > > > But, suppose we know the publisher version is < 14 and user has set > > > streaming on while creating subscriber then still we will send the > > > right version? > > > > > > > Yeah we can send the version depending on which server we are talking to? > > Ok > > > > I think tablesync we are forming a query so we are > > > forming as per the publisher's version. But here we are sending which > > > protocol version we are expecting for the data transfer so I feel it > > > should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming > > > transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the > > > streaming transfer. > > > > > > > I am not sure if this is the right strategy. See > > libpqrcv_startstreaming, even if the user asked for streaming unless > > the server supports it we don't send the streaming option to the user. > > Another thing is this check will become more complicated if we need > > another feature like decode prepared to send different version or even > > if it is the same version, we might need additional checks. Why do you > > want to send a streaming protocol version when we know the server > > doesn't support it, lets behave similar to what we do in > > libpqrcv_startstreaming. > > Okay, so even if the streaming is enabled we are sending the streaming > on to the server versions which are >= 14 which make sense. But I > still doubt that it is right thing to send the protocol version based > on the server version. For example suppose in future PG20 we change > the streaming protocol version to 3, that mean from PG14 to PG20 we > may not support the streaming transmission but we still be able to > support the normal transamission. Now if streaming is off then > ideally we should send the LOGICALREP_PROTO_VERSION_NUM and that is > still supporetd by the publisher but if we send the > LOGICALREP_PROTO_STREAM_VERSION_NUM then it will error out because the > streaming protocol changed in the latest subscriber and the same is > not supported by the older publisher (14). > No that won't happen, we will send the version based on what the server version supports (in this case it would be 2 when we are talking to publisher 14). We can define a new macro or something like that. I feel the protocol should depend on whether the server supports it or not rather than based on some user specified option because it will make our life easier if tomorrow we want to enable streaming by default rather than based on option specified by the user. You can consider it this way that how would you handle this if we wouldn't have a user specified option (streaming) because that is generally the way it should work. -- With Regards, Amit Kapila.
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Thu, Sep 24, 2020 at 6:33 PM Ashutosh Sharma wrote: > > Hi Amit, > > > Here, I think instead of using MySubscription->stream, we should use > > server/walrecv version number as we used at one place in tablesync.c. > > Should subscribers be setting the LR protocol value based on what is > the publisher version it is communicating with or should it be set > based on whether streaming was enabled or not while creating that > subscription? AFAIU if we set this value based on the publisher > version (which is lets say >= 14), then it's quite possible that the > subscriber will start streaming changes for the in-progress > transactions even if the streaming was disabled while creating the > subscription, won't it? > No that won't happen because we send this option to the server (publisher in this case) only when version is >=14 and user has specified this option. See the below check in function libpqrcv_startstreaming() { .. if (options->proto.logical.streaming && PQserverVersion(conn->streamConn) >= 14) appendStringInfo(&cmd, ", streaming 'on'"); .. } -- With Regards, Amit Kapila.
Re: New statistics for tuning WAL buffer size
On 2020-09-18 11:11, Kyotaro Horiguchi wrote: At Fri, 18 Sep 2020 09:40:11 +0900, Masahiro Ikeda wrote in Thanks. I confirmed that it causes HOT pruning or killing of dead index tuple if DecodeCommit() is called. As you said, DecodeCommit() may access the system table. ... The wals are generated only when logical replication is performed. So, I added pgstat_send_wal() in XLogSendLogical(). But, I concerned that it causes poor performance since pgstat_send_wal() is called per wal record, I think that's too frequent. If we want to send any stats to the collector, it is usually done at commit time using pgstat_report_stat(), and the function avoids sending stats too frequently. For logrep-worker, apply_handle_commit() is calling it. It seems to be the place if we want to send the wal stats. Or it may be better to call pgstat_send_wal() via pgstat_report_stat(), like pg_stat_slru(). Thanks for your comments. Since I changed to use pgstat_report_stat() and DecodeCommit() is calling it, the frequency to send statistics is not so high. Currently logrep-laucher, logrep-worker and autovac-launcher (and some other processes?) don't seem (AFAICS) sending scan stats at all but according to the discussion here, we should let such processes send stats. I added pgstat_report_stat() to logrep-laucher and autovac-launcher. As you said, logrep-worker already calls apply_handle_commit() and pgstat_report_stat(). The checkpointer doesn't seem to call pgstat_report_stat() currently, but since there is a possibility to send wal statistics, I added pgstat_report_stat(). Regards, -- Masahiro Ikeda NTT DATA CORPORATIONdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4e0193a967..dd292fe27c 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -424,6 +424,13 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + pg_stat_walpg_stat_wal + One row only, showing statistics about the WAL writing activity. See +for details. + + + pg_stat_databasepg_stat_database One row per database, showing database-wide statistics. See @@ -3280,6 +3287,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + pg_stat_wal + + + pg_stat_wal + + + + The pg_stat_wal view will always have a + single row, containing data about the WAL writing activity of the cluster. + + + + pg_stat_wal View + + + + + Column Type + + + Description + + + + + + + + wal_buffers_full bigint + + + Number of WAL writes when the are full + + + + + + stats_reset timestamp with time zone + + + Time at which these statistics were last reset + + + + + + + + pg_stat_database @@ -4668,8 +4725,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i argument. The argument can be bgwriter to reset all the counters shown in the pg_stat_bgwriter -view, or archiver to reset all the counters shown in -the pg_stat_archiver view. +view, archiver to reset all the counters shown in +the pg_stat_archiver view ,or wal +to reset all the counters shown in the pg_stat_wal view. This function is restricted to superusers by default, but other users diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 61754312e2..3a06cacefb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2195,6 +2195,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) WriteRqst.Flush = 0; XLogWrite(WriteRqst, false); LWLockRelease(WALWriteLock); + WalStats.m_wal_buffers_full++; TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } /* Re-acquire WALBufMappingLock and retry */ diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index ed4f3f142d..643445c189 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -979,6 +979,11 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_wal AS +SELECT +pg_stat_get_wal_buffers_full() AS wal_buffers_full, +pg_stat_get_wal_stat_reset_time() AS stats_reset; + CREATE VIEW pg_stat_progress_analyze AS SELECT S.pid AS pid, S.datid AS datid, D.datname AS datname, diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 2cef56f115..1b224d479d 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -795,6 +795,8 @@ AutoVacLauncherMain(int argc, char *argv[
Re: Parallel INSERT (INTO ... SELECT ...)
On Thu, Sep 24, 2020 at 7:51 AM Andres Freund wrote: > > On 2020-09-22 14:55:21 +1000, Greg Nancarrow wrote: > > > > diff --git a/src/backend/access/heap/heapam.c > > b/src/backend/access/heap/heapam.c > > index 1585861..94c8507 100644 > > --- a/src/backend/access/heap/heapam.c > > +++ b/src/backend/access/heap/heapam.c > > @@ -2049,11 +2049,6 @@ heap_prepare_insert(Relation relation, HeapTuple > > tup, TransactionId xid, > >* inserts in general except for the cases where inserts generate a > > new > >* CommandId (eg. inserts into a table having a foreign key column). > >*/ > > - if (IsParallelWorker()) > > - ereport(ERROR, > > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > > - errmsg("cannot insert tuples in a parallel > > worker"))); > > - > > I'm afraid that this weakens our checks more than I'd like. > I think we need to change/remove this check to allow inserts by parallel workers. I am not sure but maybe we can add an Assert to ensure that it is safe to perform insert via parallel worker. > What if this > ends up being invoked from inside C code? > I think it shouldn't be a problem unless one is trying to do something like insert into foreign key table. So, probably we can have an Assert to catch it if possible. Do you have any other idea? -- With Regards, Amit Kapila.
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Thu, Sep 24, 2020 at 06:28:25PM +0200, Daniel Gustafsson wrote: > Doh, of course, I blame a lack of caffeine this afternoon. Having a private > local sha256 implementation using the EVP_* API inside scram-common would > maintain FIPS compliance and ABI compatibility, but would also be rather ugly. Even if we'd try to force our internal implementation of SHA256 on already-released branches instead of the one of OpenSSL, this would be an ABI break for compiled modules expected to work on this released branch as OpenSSL's internal SHA structures don't exactly match with our own implementation (think just about sizeof() or such). -- Michael signature.asc Description: PGP signature
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Thu, Sep 24, 2020 at 09:44:57PM +0200, Daniel Gustafsson wrote: > On 24 Sep 2020, at 21:22, Robert Haas wrote: >> I mean, the issue here, as is so often the case, is not what is >> actually more secure, but what meets the terms of some security >> standard. > > Correct, IIUC in order to be FIPS compliant all cryptographic modules used > must > be FIPS certified. /me whitles, thinking about not using src/common/md5.c when building with OpenSSL to actually get a complain if FIPS gets used. >> At least in the US, FIPS 140-2 compliance is a reasonably >> common need, so if we can make it easier for people who have that need >> to be compliant, they are more likely to use PostgreSQL, which seems >> like something that we should want. > > The proposed patch makes SCRAM+FIPS work for 14, question is if we need/want > to > try and address v10-13. Unfortunately, I don't have a good answer for that, except for the answers involving an ABI breakage. FWIW, the only users of those APIs I can find in the open wild are pgpool, which actually bundles a copy of the code in src/common/ so it does not matter, and pgbouncer, that uses a different compatibility layer to make the code compilable: https://sources.debian.org/src/pgbouncer/1.14.0-1/include/common/postgres_compat.h/?hl=26#L26 But it is not really possible to say if there is any closed code relying on that, so I'd like to fix that just on HEAD, about which I guess there would be no objections? -- Michael signature.asc Description: PGP signature
Re: Parallel INSERT (INTO ... SELECT ...)
> > What if this > > ends up being invoked from inside C code? > > > > I think it shouldn't be a problem unless one is trying to do something > like insert into foreign key table. So, probably we can have an Assert > to catch it if possible. Do you have any other idea? > Note that the planner code updated by the patch does avoid creating a Parallel INSERT plan in the case of inserting into a table with a foreign key (so commandIds won't be created in the parallel-worker code). I'm not sure how to distinguish the "invoked from inside C code" case though. Regards, Greg Nancarrow Fujitsu Australia
Re: history file on replica and double switchover
On 2020-09-24 5:00 p.m., Fujii Masao wrote: On 2020/09/25 8:15, David Zhang wrote: Hi, My understanding is that the "archiver" won't even start if "archive_mode = on" has been set on a "replica". Therefore, either (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != ARCHIVE_MODE_OFF) will produce the same result. Yes, the archiver isn't invoked in the standby when archive_mode=on. But, with the original patch, walreceiver creates .ready archive status file even when archive_mode=on and no archiver is running. This causes the history file to be archived after the standby is promoted and the archiver is invoked. With my patch, walreceiver creates .ready archive status for the history file only when archive_mode=always, like it does for the streamed WAL files. This is the difference between those two patches. To prevent the streamed timeline history file from being archived, IMO we should use the condition archive_mode==always in the walreceiver. +1 Please see how the "archiver" is started in src/backend/postmaster/postmaster.c 5227 /* 5228 * Start the archiver if we're responsible for (re-)archiving received 5229 * files. 5230 */ 5231 Assert(PgArchPID == 0); 5232 if (XLogArchivingAlways()) 5233 PgArchPID = pgarch_start(); I did run the nice script "double_switchover.sh" using either "always" or "on" on patch v1 and v2. They all generate the same results below. In other words, whether history file (0003.history) is archived or not depends on "archive_mode" settings. echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:40 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:40 0002000A -rw--- 1 david david 41 Sep 24 14:40 0002.history -rw--- 1 david david 83 Sep 24 14:40 0003.history echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf $ ls -l archive -rw--- 1 david david 16777216 Sep 24 14:47 00010002 ... ... -rw--- 1 david david 16777216 Sep 24 14:47 0002000A -rw--- 1 david david 41 Sep 24 14:47 0002.history Personally, I prefer patch v2 since it allies to the statement here, https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY "If archive_mode is set to on, the archiver is not enabled during recovery or standby mode. If the standby server is promoted, it will start archiving after the promotion, but will not archive any WAL it did not generate itself." By the way, I think the last part of the sentence should be changed to something like below: "but will not archive any WAL, which was not generated by itself." I'm not sure if this change is an improvement or not. But if we apply the patch I proposed, maybe we should mention also history file here. For example, "but will not archive any WAL or timeline history files that it did not generate itself". make sense for me Regards, -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Fri, Sep 25, 2020 at 12:19:44PM +0900, Michael Paquier wrote: > Even if we'd try to force our internal implementation of SHA256 on > already-released branches instead of the one of OpenSSL, this would be > an ABI break for compiled modules expected to work on this released > branch as OpenSSL's internal SHA structures don't exactly match with > our own implementation (think just about sizeof() or such). Well, we could as well add one extra SHA API layer pointing to the EVP structures and APIs with new names, leaving the original ones in place, and then have SCRAM use the new ones, but I'd rather not go down that road for the back-branches. -- Michael signature.asc Description: PGP signature
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
Michael Paquier writes: > On Fri, Sep 25, 2020 at 12:19:44PM +0900, Michael Paquier wrote: >> Even if we'd try to force our internal implementation of SHA256 on >> already-released branches instead of the one of OpenSSL, this would be >> an ABI break for compiled modules expected to work on this released >> branch as OpenSSL's internal SHA structures don't exactly match with >> our own implementation (think just about sizeof() or such). > Well, we could as well add one extra SHA API layer pointing to the EVP > structures and APIs with new names, leaving the original ones in > place, and then have SCRAM use the new ones, but I'd rather not go > down that road for the back-branches. Given the tiny number of complaints to date, it seems sufficient to me to deal with this in HEAD. regards, tom lane
Re: Parallel INSERT (INTO ... SELECT ...)
Hi Andres, On Thu, Sep 24, 2020 at 12:21 PM Andres Freund wrote: > >I think it'd be good if you outlined what your approach is to make this >safe. Some prior work has already been done to establish the necessary infrastructure to allow parallel INSERTs, in general, to be safe, except for cases where new commandIds would be generated in the parallel-worker code (such as inserts into a table having a foreign key) - these cases need to be avoided. See the following commits. 85f6b49 Allow relation extension lock to conflict among parallel group members 3ba59cc Allow page lock to conflict among parallel group members The planner code updated by the patch avoids creating a Parallel INSERT plan in the case of inserting into a table that has a foreign key. >> For cases where it can't be allowed (e.g. INSERT into a table with >> foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE >> ...") it at least allows parallelism of the SELECT part. > >I think it'd be good to do this part separately and first, independent >of whether the insert part can be parallelized. OK then, I'll try to extract that as a separate patch. >> Obviously I've had to update the planner and executor and >> parallel-worker code to make this happen, hopefully not breaking too >> many things along the way. > >Hm, it looks like you've removed a fair bit of checks, it's not clear to >me why that's safe in each instance. It should be safe for Parallel INSERT - but you are right, these are brute force removals (for the purpose of a POC patch) that should be tightened up wherever possible to disallow unsafe paths into that code. Problem is, currently there's not a lot of context information available to easily allow that, so some work needs to be done. >> - Currently only for "INSERT INTO ... SELECT ...". To support "INSERT >> INTO ... VALUES ..." would need additional Table AM functions for >> dividing up the INSERT work amongst the workers (currently only exists >> for scans). > >Hm, not entirely following. What precisely are you thinking of here? All I was saying is that for SELECTs, the work done by each parallel worker is effectively divided up by parallel-worker-related functions in tableam.c and indexam.c, and no such technology currently exists for dividing up work for the "INSERT ... VALUES" case. >I doubt it's really worth adding parallelism support for INSERT >... VALUES, the cost of spawning workers will almost always higher than >the benefit. You're probably right in doubting any benefit, but I wasn't entirely sure. >> @@ -116,7 +117,7 @@ toast_save_datum(Relation rel, Datum value, >> TupleDesc toasttupDesc; >> Datum t_values[3]; >> boolt_isnull[3]; >> - CommandId mycid = GetCurrentCommandId(true); >> + CommandId mycid = GetCurrentCommandId(!IsParallelWorker()); >> struct varlena *result; >> struct varatt_external toast_pointer; >> union > >Hm? Why do we need this in the various places you have made this change? It's because for Parallel INSERT, we're assigning the same command-id to each worker up-front during worker initialization (the commandId has been retrieved by the leader and passed through to each worker) and "currentCommandIdUsed" has been set true. See the AssignCommandIdForWorker() function in the patch. If you see the code of GetCurrentCommandId(), you'll see it Assert that it's not being run by a parallel worker if the parameter is true. I didn't want to remove yet another check, without being able to know the context of the caller, because only for Parallel INSERT do I know that "currentCommandIdUsed was already true at the start of the parallel operation". See the comment in that function. Anyway, that's why I'm passing "false" to relevant GetCurrentCommandId() calls if they're being run by a parallel (INSERT) worker. >> @@ -822,19 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, >> Relation NewHeap, >> isdead = false; >> break; >> case HEAPTUPLE_INSERT_IN_PROGRESS: >> - >> /* >>* Since we hold exclusive lock on the >> relation, normally the >>* only way to see this is if it was inserted >> earlier in our >>* own transaction. However, it can happen in >> system >>* catalogs, since we tend to release write >> lock before >commit >> - * there. Give a warning if neither case >> applies; but in any >> - * case we had better copy it. >> + * there. In any case we had better copy it. >>*/ >> - if (!is_system_catalog && >> - >> !TransactionIdIsCurrentTransactionId>(HeapTuple
Feature improvement of tab completion for DEALLOCATE
Hello! I’d like to improve the deallocate tab completion. The deallocate tab completion can’t complement “ALL”. Therefore, this patch fixes the problem. Regards, Naoki Nakamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 9c6f5ecb6a..d877cc86f0 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -2911,7 +2911,7 @@ psql_completion(const char *text, int start, int end) /* DEALLOCATE */ else if (Matches("DEALLOCATE")) - COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements); + COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements " UNION SELECT 'ALL'"); /* DECLARE */ else if (Matches("DECLARE", MatchAny))
Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
On Wed, Sep 23, 2020 at 8:19 PM Fujii Masao wrote: > > > Please let me know if okay with the above agreed points, I will work on the > > new patch. > > Yes, please work on the patch! Thanks! I may revisit the above points later, > though ;) > Thanks, attaching v4 patch. Please consider this for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v4-Retry-Cached-Remote-Connections-For-postgres_fdw.patch Description: Binary data
Feature improvement for FETCH tab completion
Hello! I’d like to improve the FETCH tab completion. The FETCH tab completion . Therefore, this patch fixes the problem. Previous function completed one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, but now it completes one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, ALL, NEXT, PRIOR, FIRST, LAST and Corresponded to later IN and FROM clauses. Regards, Naoki Nakamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index d877cc86f0..eb103b784e 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -3076,19 +3076,22 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", "DECLARE"); /* FETCH && MOVE */ - /* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE */ + /* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE, ALL, NEXT, PRIOR, FIRST, LAST */ else if (Matches("FETCH|MOVE")) - COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE"); - /* Complete FETCH with one of ALL, NEXT, PRIOR */ - else if (Matches("FETCH|MOVE", MatchAny)) - COMPLETE_WITH("ALL", "NEXT", "PRIOR"); + COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE", "ALL", "NEXT", "PRIOR", "FIRST", "LAST"); + /* Complete FETCH BACKWARD or FORWARD with one of ALL */ + else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD")) + COMPLETE_WITH("ALL"); /* - * Complete FETCH with "FROM" or "IN". These are equivalent, + * Complete FETCH with "FROM" or "IN". These are equivalent, * but we may as well tab-complete both: perhaps some users prefer one * variant or the other. */ - else if (Matches("FETCH|MOVE", MatchAny, MatchAny)) + else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE", MatchAny)) + COMPLETE_WITH("FROM", "IN"); + + else if (Matches("FETCH|MOVE", "ALL|NEXT|PRIOR|FIRST|LAST")) COMPLETE_WITH("FROM", "IN"); /* FOREIGN DATA WRAPPER */
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On 2020-09-24 21:44, Daniel Gustafsson wrote: On 24 Sep 2020, at 21:22, Robert Haas wrote: On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut wrote: Depends on what one considers to be covered by FIPS. The entire rest of SCRAM is custom code, so running it on top of the world's greatest SHA-256 implementation isn't going to make the end product any more trustworthy. I mean, the issue here, as is so often the case, is not what is actually more secure, but what meets the terms of some security standard. Correct, IIUC in order to be FIPS compliant all cryptographic modules used must be FIPS certified. As I read FIPS 140-2, it just specifies what must be true of cryptographic modules that claim to follow that standard, it doesn't say that all cryptographic activity in an application or platform must only use such modules. (Notably, it doesn't even seem to define "cryptographic".) The latter may well be a requirement of a user or customer on top of the actual standard. However, again, the SCRAM implementation would already appear to fail that requirement because it uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a covered algorithm. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Feature improvement of tab completion for DEALLOCATE
On 2020/09/25 13:45, btnakamichin wrote: Hello! I’d like to improve the deallocate tab completion. The deallocate tab completion can’t complement “ALL”. Therefore, this patch fixes the problem. Thanks for the patch! It looks basically good, but I think it's better to add newline just before UNION to avoid long line, as follows. - COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements); + COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements + " UNION SELECT 'ALL'"); Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
Peter Eisentraut writes: > On 2020-09-24 21:44, Daniel Gustafsson wrote: >> Correct, IIUC in order to be FIPS compliant all cryptographic modules used >> must >> be FIPS certified. > As I read FIPS 140-2, it just specifies what must be true of > cryptographic modules that claim to follow that standard, it doesn't say > that all cryptographic activity in an application or platform must only > use such modules. (Notably, it doesn't even seem to define > "cryptographic".) The latter may well be a requirement of a user or > customer on top of the actual standard. Hm ... AFAICT, the intent of the "FIPS mode" in Red Hat's implementation, and probably other Linux distros, is exactly that thou shalt not use any non-FIPS-approved crypto code. By your reading above, there would be no need at all for a kernel-level enforcement switch. > However, again, the SCRAM > implementation would already appear to fail that requirement because it > uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a > covered algorithm. Ugh. But is there any available FIPS-approved library code that could be used instead? regards, tom lane
Re: vs formatting in the docs
On 2020-07-10 17:05, Peter Eisentraut wrote: On 2020-06-21 18:57, Dagfinn Ilmari Mannsåker wrote: Attached are two patches: the first adds the missing tags, the second adds to all the SQL commands (specifically anything with 7). I have committed the first one. I have some concerns about the second one. If you look at the diff of the source of the man pages before and after, it creates a bit of a mess, even though the man pages look okay when rendered. I'll need to think about this a bit more. I asked about this on a DocBook discussion list. While the general answer is that you can do anything you want, it was clear that putting markup into title elements requires more careful additional customization and testing, and it's preferable to handle appearance issues on the link source side. (See also us dialing back the number of xreflabels recently.) This is also the direction that DocBook 5 appears to be taking, where you can put linkend attributes into most inline tags, so you could maybe do . This doesn't work for us yet, but the equivalent of that would be . So, based on that, I think the patch proposed here is not the right one, and we should instead be marking up the link sources appropriately. (This also implies that the already committed 0001 patch should be reverted.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Add features to pg_stat_statements
On 2020-09-24 18:55, legrand legrand wrote: Not limited to 3, Like an history table. Will have to think if data is saved at shutdown. Thanks. This design might introduce some additional complexity and things to be considered. For example, the maximum size of "history table", how to evict entries from the history table and how to manage the information maintained by evicted entries. Also, providing a history table looks similar to logging. Providing the original view (# of dealloc and last_dealloc ts) and optional logging looks a simple and better way. Regards, Katsuragi Yuta
Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration
On 2020-09-24 09:12, Michael Paquier wrote: On Wed, Sep 23, 2020 at 08:11:59AM +0200, Peter Eisentraut wrote: This patch mixes up unsigned int and uint32 in random ways. The variable is uint32, but the format is %u and the max constant is UINT_MAX. I think just use unsigned int as the variable type. There is no need to use the bit-exact types. Note that the argument of alarm() is of type unsigned int. Makes sense, thanks. looks good to me -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Fri, Sep 25, 2020 at 8:12 AM Amit Kapila wrote: > > On Thu, Sep 24, 2020 at 6:33 PM Ashutosh Sharma wrote: > > > > Hi Amit, > > > > > Here, I think instead of using MySubscription->stream, we should use > > > server/walrecv version number as we used at one place in tablesync.c. > > > > Should subscribers be setting the LR protocol value based on what is > > the publisher version it is communicating with or should it be set > > based on whether streaming was enabled or not while creating that > > subscription? AFAIU if we set this value based on the publisher > > version (which is lets say >= 14), then it's quite possible that the > > subscriber will start streaming changes for the in-progress > > transactions even if the streaming was disabled while creating the > > subscription, won't it? > > > > No that won't happen because we send this option to the server > (publisher in this case) only when version is >=14 and user has > specified this option. See the below check in function > libpqrcv_startstreaming() > { > .. > if (options->proto.logical.streaming && > PQserverVersion(conn->streamConn) >= 14) > appendStringInfo(&cmd, ", streaming 'on'"); > .. > } > Ah, okay, understood, thanks. So, that means we can use the streaming protocol version if the server (publisher) supports it, regardless of whether the client (subscriber) has the streaming option enabled or not. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Feature improvement of tab completion for DEALLOCATE
2020-09-25 14:30 に Fujii Masao さんは書きました: On 2020/09/25 13:45, btnakamichin wrote: Hello! I’d like to improve the deallocate tab completion. The deallocate tab completion can’t complement “ALL”. Therefore, this patch fixes the problem. Thanks for the patch! It looks basically good, but I think it's better to add newline just before UNION to avoid long line, as follows. - COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements); + COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements + " UNION SELECT 'ALL'"); Regards, Thank you, I appreciate your comment. I have attached pattch with newline. Regards, NaokiNskamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index d877cc86f0..75f81d66dc 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -2911,7 +2911,8 @@ psql_completion(const char *text, int start, int end) /* DEALLOCATE */ else if (Matches("DEALLOCATE")) - COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements " UNION SELECT 'ALL'"); + COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements + " UNION SELECT 'ALL'"); /* DECLARE */ else if (Matches("DECLARE", MatchAny))
Problem of ko.po on branch REL9_5_STABLE
Hi, When I checked the encoding of the Po files, I noticed that src/bin/pg_config/po/ko.po on REL9_5_STABLE branch seemed a little different. The ‘Content-Type’ of this file and file’s encoding are different from those under other modules, as follows: src/bin/pg_config/po/ko.po: "Content-Type: text/plain; charset=euc-kr\n" src/bin/initdb/po/ko.po: "Content-Type: text/plain; charset=UTF-8\n" These even different from the other branches, as follows: REL9_5_STABLE src/bin/pg_config/po/ko.po: "Content-Type: text/plain; charset=euc-kr\n" REL9_6_STABLE src/bin/pg_config/po/ko.po "Content-Type: text/plain; charset=UTF-8\n" Is there any particular reason for doing this? Or are there any challenges for compatible problems? Thanks in advance. -- Yang Rong Development Department II Software Division II Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST) ADDR.: No.6 Wenzhu Road, Software Avenue, Nanjing, 210012, China TEL : +86+25-86630566-9431 COINS: 7998-9431 FAX : +86+25-83317685 MAIL: yangr.f...@cn.fujitsu.com --
Re: Load TIME fields - proposed performance improvement
The patch has been re-implemented based on previous advice. Please see attached. ~ Test: A test table was created and 20 million rows inserted as follows: test=# create table t1 (id int, a timestamp, b time without time zone default '01:02:03', c date default CURRENT_DATE, d time with time zone default CURRENT_TIME, e time with time zone default LOCALTIME); CREATE TABLE $ time psql -d test -c "insert into t1(id, a) values(generate_series(1,2000), timestamp 'now');" ~ Observations: BEFORE PATCH perf results 6.18% GetSQLCurrentTime 5.73% GetSQLCurrentDate 5.20% GetSQLLocalTime 4.67% GetCurrentDateTime -.--% GetCurrentTimeUsec elapsed time Run1 1m57s Run2 1m58s Run3 2m00s AFTER PATCH perf results 1.77% GetSQLCurrentTime 0.12% GetSQLCurrentDate 0.50% GetSQLLocalTime 0.36% GetCurrentDateTime -.--% GetCurrentTimeUsec elapsed time Run1 1m36s Run2 1m36s Run3 1m36s (represents 19% improvement for this worst case table/data) ~ Note: I patched the function GetCurrentTimeUsec consistently with the others, but actually I was not able to discover any SQL syntax which could cause that function to be invoked multiple times. Perhaps the patch for that function should be removed? --- Kind Regards, Peter Smith Fujitsu Australia On Tue, Sep 22, 2020 at 1:06 PM Peter Smith wrote: > > Hi Tom. > > Thanks for your feedback. > > On Tue, Sep 22, 2020 at 12:44 PM Tom Lane wrote: > > > Still, for the size of the patch I'm envisioning, it'd be well > > worth the trouble. > > The OP patch I gave was just a POC to test the effect and to see if > the idea was judged as worthwhile... > > I will rewrite/fix it based on your suggestions. > > Kind Regards, > Peter Smith. > Fujitsu Australia. PS_cache_pg_tm-v01.patch Description: Binary data
Re: Feature improvement for FETCH tab completion
On 2020/09/25 14:24, btnakamichin wrote: Hello! I’d like to improve the FETCH tab completion. The FETCH tab completion . Therefore, this patch fixes the problem. Previous function completed one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, but now it completes one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, ALL, NEXT, PRIOR, FIRST, LAST and Corresponded to later IN and FROM clauses. Thanks for the patch! Here are review comments. + /* Complete FETCH BACKWARD or FORWARD with one of ALL */ + else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD")) + COMPLETE_WITH("ALL"); Not only "ALL" but also "FROM" and "IN" should be displayed here because they also can follow "BACKWARD" and "FORWARD"? else if (Matches("FETCH|MOVE", MatchAny, MatchAny)) + else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE", MatchAny)) + COMPLETE_WITH("FROM", "IN"); This change seems to cause "FETCH FORWARD FROM " to display "FROM" and "IN". To avoid this confusing tab-completion, we should use something like MatchAnyExcept("FROM|IN") here, instead? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: [patch] Concurrent table reindex per-index progress reporting
On Thu, Sep 24, 2020 at 09:19:18PM +0200, Matthias van de Meent wrote: > While working on a PG12-instance I noticed that the progress reporting of > concurrent index creation for non-index relations fails to update the > index/relation OIDs that it's currently working on in the > pg_stat_progress_create_index view after creating the indexes. Please find > attached a patch which properly sets these values at the appropriate places. > > Any thoughts? I agree that this is a bug and that we had better report correctly the heap and index IDs worked on, as these could also be part of a toast relation from the parent table reindexed. However, your patch is visibly incorrect in the two places you are updating. > + * Configure progress reporting to report for this index. > + * While we're at it, reset the phase as it is cleared by > start_command. > + */ > + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, > heapId); > + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, > newIndexId); > + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE, > + > PROGRESS_CREATEIDX_PHASE_WAIT_1); First, updating PROGRESS_CREATEIDX_PHASE here to WAIT_1 makes no sense, because this is not a wait phase, and index_build() called within index_concurrently_build() will set this state correctly a bit after so IMO it is fine to leave that unset here, and the build phase is by far the bulk of the operation that needs tracking. I think that you are also missing to update PROGRESS_CREATEIDX_COMMAND to PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID, similarly to reindex_index(). > + /* > + * Configure progress reporting to report for this index. > + * While we're at it, reset the phase as it is cleared by > start_command. > + */ > + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, > heapId); > + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, > newIndexId); > + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE, > + > PROGRESS_CREATEIDX_PHASE_WAIT_2); > + > validate_index(heapId, newIndexId, snapshot); Basically the same three comments here: PROGRESS_CREATEIDX_PHASE set to WAIT_2 makes no real sense, and validate_index() would update the phase as it should be. This should set PROGRESS_CREATEIDX_COMMAND to PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID. While reviewing this code, I also noticed that the wait state set just before dropping the indexes was wrong. The code was using WAIT_4, but this has to be WAIT_5. And this gives me the attached. This also means that we still set the table and relation OID to the last index worked on for WAIT_2, WAIT_4 and WAIT_5, but we cannot set the command start to relationOid either as given in input of ReindexRelationConcurrently() as this could be a single index given for REINDEX INDEX CONCURRENTLY. Matthias, does that look right to you? -- Michael diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index f1b5f87e6a..d43051aea9 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -3409,6 +3409,7 @@ ReindexRelationConcurrently(Oid relationOid, int options) Oid oldIndexId = lfirst_oid(lc); Oid newIndexId = lfirst_oid(lc2); Oid heapId; + Oid indexam; /* Start new transaction for this index's concurrent build */ StartTransactionCommand(); @@ -3429,8 +3430,21 @@ ReindexRelationConcurrently(Oid relationOid, int options) */ indexRel = index_open(oldIndexId, ShareUpdateExclusiveLock); heapId = indexRel->rd_index->indrelid; + /* The access method of the old and new indexes match */ + indexam = indexRel->rd_rel->relam; index_close(indexRel, NoLock); + /* + * Update progress for the index to build, with the correct parent + * table involved. + */ + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_COMMAND, + PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY); + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_ACCESS_METHOD_OID, + indexam); + /* Perform concurrent build of new index */ index_concurrently_build(heapId, newIndexId); @@ -3458,6 +3472,8 @@ ReindexRelationConcurrently(Oid relationOid, int options) Oid heapId; TransactionId limitXmin; Snapshot snapshot; + Relation newIndexRel; + Oid indexam; StartTransactionCommand(); @@ -3468,8 +3484,6 @@ ReindexRelationConcurrently(Oid relationOid, in
Re: [patch] Fix checksum verification in base backups for zero page headers
Hi, Am Donnerstag, den 24.09.2020, 17:24 +0300 schrieb Anastasia Lubennikova: > So I mark this one as ReadyForCommitter. Thanks! > The only minor problem is a typo (?) in the proposed commit message. > "If a page is all zero, consider that a checksum failure." It should be > "If a page is NOT all zero...". Oh right, I've fixed up the commit message in the attached V4. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz From af83f4a42403e8a994e101086affafa86d67b52a Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Tue, 22 Sep 2020 16:09:36 +0200 Subject: [PATCH] Fix checksum verification in base backups for zero page headers We currently do not checksum a page if it is considered new by PageIsNew(). However, this means that if the whole page header is zero, PageIsNew() will consider that page new due to the pd_upper field being zero and no subsequent checksum verification is done. To fix, factor out the all-zeroes check from PageIsVerified() into a new method PageIsZero() and call that for pages considered new by PageIsNew(). If a page is not totally empty, consider that a checksum failure. Add one test to the pg_basebackup TAP tests to check this error. Reported-By: Andres Freund Reviewed-By: Asif Rehman, Anastasia Lubennikova Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de --- src/backend/replication/basebackup.c | 30 + src/backend/storage/page/bufpage.c | 35 +++- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +- src/include/storage/bufpage.h| 11 +++--- 4 files changed, 73 insertions(+), 21 deletions(-) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index b89df01fa7..c82765a429 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename, "be reported", readfilename))); } } +else +{ + /* + * We skip completely new pages after checking they are + * all-zero, since they don't have a checksum yet. + */ + if (PageIsNew(page) && !PageIsZero(page)) + { + /* + * pd_upper is zero, but the page is not all zero. We + * cannot run pg_checksum_page() on the page as it + * would throw an assertion failure. Consider this a + * checksum failure. + */ + checksum_failures++; + + if (checksum_failures <= 5) + ereport(WARNING, + (errmsg("checksum verification failed in " + "file \"%s\", block %d: pd_upper " + "is zero but page is not all-zero", + readfilename, blkno))); + if (checksum_failures == 5) + ereport(WARNING, + (errmsg("further checksum verification " + "failures in file \"%s\" will not " + "be reported", readfilename))); + } +} + block_retry = false; blkno++; } diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 4bc2bf955d..8be3cd6190 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -82,11 +82,8 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - size_t *pagebytes; - int i; bool checksum_failure = false; bool header_sane = false; - bool all_zeroes = false; uint16 checksum = 0; /* @@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno) } /* Check all-zeroes case */ - all_zeroes = true; - pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - - if (all_zeroes) + if (PageIsZero(page)) return true; /* @@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno) return false; } +/* + * PageIsZero + * Check that the page consists only of zero bytes. + * + */ +bool +PageIsZero(Page page) +{ + int i; + size_t *pagebytes = (size_t *) page; + + for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) + { + if (pagebytes[i] != 0) + return false; + } + + return true; +} /* * PageAddItemExtended diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index f674a7c94e..f5735569c5 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -6,7 +6,7 @@ use File::Basename qw(basename dirname); use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 109; +use Test::Mo
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Fri, Sep 25, 2020 at 01:36:44AM -0400, Tom Lane wrote: > Peter Eisentraut writes: >> However, again, the SCRAM >> implementation would already appear to fail that requirement because it >> uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a >> covered algorithm. > > Ugh. But is there any available FIPS-approved library code that could be > used instead? That's a good point, and I think that this falls down to use OpenSSL's HMAC_* interface for this job when building with OpenSSL: https://www.openssl.org/docs/man1.1.1/man3/HMAC.html Worth noting that these have been deprecated in 3.0.0 as per the rather-recent commit dbde472, where they recommend the use of EVP_MAC_*() instead. -- Michael signature.asc Description: PGP signature