Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Tue, Jan 26, 2021 at 9:18 AM japin wrote: > > > Hi, hackers > > When I read the discussion in [1], I found that update subscription's > publications > is complicated. > > For example, I have 5 publications in subscription. > > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 > dbname=postgres' > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; > > If I want to drop "mypub4", we should use the following command: > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5; > > Also, if I want to add "mypub7" and "mypub8", it will use: > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5, > mypub7, mypub8; > > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, > for the above > two cases, we can use the following: > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; > > I think it's more convenient. Any thoughts? +1 for the idea -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] remove pg_standby
On 2021/01/27 14:32, Thomas Munro wrote: On Wed, Jan 27, 2021 at 6:06 PM Michael Paquier wrote: On Wed, Jan 27, 2021 at 04:13:24PM +1300, Thomas Munro wrote: I would like to commit this, because "waiting restore commands" have confusing interactions with my proposed prefetching-during-recovery patch[1]. Here's a version that fixes an error when building the docs (there was a stray remaining ), and adds a commit message. Any objections? I agree with this direction (i.e, remove pg_standby). BTW last month when I gave the talk about possible retire of pg_standby at PostgreSQL Unconference Tokyo, no one in audience complained about that retire. But one question is; shouldn't we follow "usual" way to retire the feature instead of dropping that immediately? That is, mark pg_standby as obsolete, announce that pg_standby will be dropped after several releases, and then drop pg_standby. This seems safe because there might be some users. While it's been marked as obsolete, maybe WAL prefetch feature doesn't work with pg_standby, but we can live with that because it's obsolete. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Parallel Inserts in CREATE TABLE AS
On Wed, Jan 27, 2021 at 1:25 PM Tang, Haiying wrote: > I choose 5 cases which pick parallel insert plan in CTAS to measure the > patched performance. Each case run 30 times. > > Most of the tests execution become faster with this patch. > > However, Test NO 4(create table xxx as table xxx.) appears performance > degradation. I tested various table size(2/10/20 millions), they all have a > 6%-10% declines. I think it may need some check at this problem. > > > > Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached > test_ctas.sql file. > > reg%=(patched-master)/master > > Test NO | Test Case |reg% | patched(ms) | master(ms) > > ||--|--|- > > 1 | CTAS select from table| -9% | 16709.50477 | 18370.76660 > > 2 | Append plan | -14% | 16542.97807 | 19305.86600 > > 3 | initial plan under Gather node| -5% | 13374.27187 | 14120.02633 > > 4 | CTAS table| 10% | 20835.48800 | 18986.40350 > > 5 | CTAS select from execute | -6% | 16973.73890 | 18008.59789 > > > > About Test NO 4: > > In master(HEAD), this test case picks serial seq scan. > > query plan likes: > > -- > > Seq Scan on public.tenk1 (cost=0.00..444828.12 rows=1012 width=244) > (actual time=0.005..1675.268 rows=1000 loops=1) > >Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, > twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 > Planning Time: 0.053 ms Execution Time: 20165.023 ms > > > > With this patch, it will choose parallel seq scan and parallel insert. > > query plan likes: > > -- > > Gather (cost=1000.00..370828.03 rows=1012 width=244) (actual > time=20428.823..20437.143 rows=0 loops=1) > >Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, > twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 > >Workers Planned: 4 > >Workers Launched: 4 > > -> Create test > >-> Parallel Seq Scan on public.tenk1 (cost=0.00..369828.03 rows=253 > width=244) (actual time=0.021..411.094 rows=200 loops=5) > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, > twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 > > Worker 0: actual time=0.023..390.856 rows=1858407 loops=1 > > Worker 1: actual time=0.024..468.587 rows=2264494 loops=1 > > Worker 2: actual time=0.023..473.170 rows=2286580 loops=1 > > Worker 3: actual time=0.027..373.727 rows=1853216 loops=1 Planning > Time: 0.053 ms Execution Time: 20437.643 ms > > > > test machine spec: > > CentOS 8.2, 128G RAM, 40 processors, disk SAS Thanks a lot for the performance tests and test cases. I will analyze why the performance is degrading one case and respond soon. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [PoC] Non-volatile WAL buffer
Hi, Now I have caught up with this thread. I see that many of you are interested in performance profiling. I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile WAL buffer patchset. I found that the time for XLogWrite and locking/unlocking WALWriteLock were eliminated by the patchset. Instead, XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM. For details, please see the slides. Best regards, Takashi [1] https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020 2021年1月26日(火) 18:50 Takashi Menjo : > Dear everyone, Tomas, > > First of all, the "v4" patchset for non-volatile WAL buffer attached to > the previous mail is actually v5... Please read "v4" as "v5." > > Then, to Tomas: > Thank you for your crash report you gave on Nov 27, 2020, regarding msync > patchset. I applied the latest msync patchset v3 attached to the previous > to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when > pgbench -i -s 500. Please try it if necessary. > > Best regards, > Takashi > > > 2021年1月26日(火) 17:52 Takashi Menjo : > >> Dear everyone, >> >> Sorry but I forgot to attach my patchsets... Please see the files >> attached to this mail. Please also note that they contain some fixes. >> >> Best regards, >> Takashi >> >> >> 2021年1月26日(火) 17:46 Takashi Menjo : >> >>> Dear everyone, >>> >>> I'm sorry for the late reply. I rebase my two patchsets onto the latest >>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL >>> buffer; the other prefixed with v3 is for msync. >>> >>> I will reply to your thankful feedbacks one by one within days. Please >>> wait for a moment. >>> >>> Best regards, >>> Takashi >>> >>> >>> 01/25/2021(Mon) 11:56 Masahiko Sawada : >>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra wrote: > > > > On 1/21/21 3:17 AM, Masahiko Sawada wrote: > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra > > wrote: > >> > >> Hi, > >> > >> I think I've managed to get the 0002 patch [1] rebased to master and > >> working (with help from Masahiko Sawada). It's not clear to me how it > >> could have worked as submitted - my theory is that an incomplete patch > >> was submitted by mistake, or something like that. > >> > >> Unfortunately, the benchmark results were kinda disappointing. For a > >> pgbench on scale 500 (fits into shared buffers), an average of three > >> 5-minute runs looks like this: > >> > >> branch 1163264 96 > >> > >> master 7291 87704165310150437 224186 > >> ntt 7912106095213206212410 237819 > >> simple-no-buffers 7654 96544115416 95828 103065 > >> > >> NTT refers to the patch from September 10, pre-allocating a large WAL > >> file on PMEM, and simple-no-buffers is the simpler patch simply removing > >> the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM. > >> > >> Note: The patch is just replacing the old implementation with mmap. > >> That's good enough for experiments like this, but we probably want to > >> keep the old one for setups without PMEM. But it's good enough for > >> testing, benchmarking etc. > >> > >> Unfortunately, the results for this simple approach are pretty bad. Not > >> only compared to the "ntt" patch, but even to master. I'm not entirely > >> sure what's the root cause, but I have a couple hypotheses: > >> > >> 1) bug in the patch - That's clearly a possibility, although I've tried > >> tried to eliminate this possibility. > >> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster than > >> NVMe storage, but still much slower than DRAM (both in terms of latency > >> and bandwidth, see [2] for some data). It's not terrible, but the > >> latency is maybe 2-3x higher - not a huge difference, but may matter for > >> WAL buffers? > >> > >> 3) PMEM does not handle parallel writes well - If you look at [2], > >> Figure 4(b), you'll see that the throughput actually *drops" as the > >> number of threads increase. That's pretty strange / annoying, because > >> that's how we write into WAL buffers - each thread writes it's own data, > >> so parallelism is not something we can get rid of. > >> > >> I've added some simple profiling, to measure number of calls / time for > >> each operation (use -DX
Re: simplifying foreign key/RI checks
At Sun, 24 Jan 2021 20:51:39 +0900, Amit Langote wrote in > Here's v5. At Mon, 25 Jan 2021 18:19:56 +0900, Amit Langote wrote in > > Anybody else want to look this patch over before I mark it Ready For > > Committer? > > Would be nice to have others look it over. Thanks. This nice improvement. 0001 just looks fine. 0002: /* RI query type codes */ -/* these queries are executed against the PK (referenced) table: */ +/* + * 1 and 2 are no longer used, because PK (referenced) table is looked up + * directly using ri_ReferencedKeyExists(). #define RI_PLAN_CHECK_LOOKUPPK 1 #define RI_PLAN_CHECK_LOOKUPPK_FROM_PK 2 #define RI_PLAN_LAST_ON_PK RI_PLAN_CHECK_LOOKUPPK_FROM_PK However, this patch does. + if (!ri_ReferencedKeyExists(pk_rel, fk_rel, newslot, riinfo)) + ri_ReportViolation(riinfo, + pk_rel, fk_rel, + newslot, + NULL, + RI_PLAN_CHECK_LOOKUPPK, false); It seems to me 1 (RI_PLAN_CHECK_LOOKUPPK) is still alive. (Yeah, I know that doesn't mean the usefulness of the macro but the mechanism the macro suggests, but it is confusing.) On the other hand, RI_PLAN_CHECK_LOOKUPPK_FROM_PK and RI_PLAN_LAST_ON_PK seem to be no longer used. (Couldn't we remove them?) (about the latter, we can rewrite the only use of it "if (qkey->constr_queryno <= RI_PLAN_LAST_ON_PK)" not to use the macro.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Is Recovery actually paused?
On Wed, 27 Jan 2021 13:29:23 +0530 Dilip Kumar wrote: > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada > wrote: > > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas wrote: > > > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy > > > wrote: > > > > +1 to just show the recovery pause state in the output of > > > > pg_is_wal_replay_paused. But, should the function name > > > > "pg_is_wal_replay_paused" be something like > > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists > > > > in a function, I expect a boolean output. Others may have better > > > > thoughts. > > > > > > Maybe we should leave the existing function pg_is_wal_replay_paused() > > > alone and add a new one with the name you suggest that returns text. > > > That would create less burden for tool authors. > > > > +1 > > > > Yeah, we can do that, I will send an updated patch soon. This means pg_is_wal_replay_paused is left without any change and this returns whether pause is requested or not? If so, it seems good to modify the documentation of this function in order to note that this could not return the actual pause state. Regards, Yugo Nagata -- Yugo NAGATA
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Thanks for the review, please see the replies below. > On Jan 5, 2021, at 9:07 AM, Kyotaro Horiguchi wrote: > > At Wed, 8 Jul 2020 12:56:44 +, Paul Guo wrote in >> On 2020/01/15 19:18, Paul Guo wrote: >> I further fixed the last test failure (due to a small bug in the test, not >> in code). Attached are the new patch series. Let's see the CI pipeline >> result. >> >> Thanks for updating the patches! >> >> I started reading the 0003 patch. >> >> The approach that the 0003 patch uses is not the perfect solution. >> If the standby crashes after tblspc_redo() removes the directory and before >> its subsequent COMMIT record is replayed, PANIC error would occur since >> there can be some unresolved missing directory entries when we reach the >> consistent state. The problem would very rarely happen, though... >> Just idea; calling XLogFlush() to update the minimum recovery point just >> before tblspc_redo() performs destroy_tablespace_directories() may be >> safe and helpful for the problem? > > It seems to me that what the current patch does is too complex. What > we need to do here is to remember every invalid operation then forget > it when the prerequisite object is dropped. > > When a table space is dropped before consistency is established, we > don't need to care what has been performed inside the tablespace. In > this perspective, it is enough to remember tablespace ids when failed > to do something inside it due to the absence of the tablespace and > then forget it when we remove it. We could remember individual > database id to show them in error messages, but I'm not sure it's > useful. The reason log_invalid_page records block numbers is to allow > the machinery handle partial table truncations, but this is not the > case since dropping tablespace cannot leave some of containing > databases. > > As the result, we won't see an unresolved invalid operations in a > dropped tablespace. > > Am I missing something? Yes, removing the database id from the hash key in the log/forget code should be usually fine, but the previous code does stricter/safer checking. Consider the scenario: CREATE DATABASE newdb1 TEMPLATE template_db1; CREATE DATABASE newdb2 TEMPLATE template_db2; <- in case the template_db2 database directory is missing abnormally somehow. DROP DATABASE template_db1; The previous code could detect this but if we remove the database id in the code, this bad scenario is skipped. > > > dbase_redo: > + if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) > + { > +XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path); > > This means "record the belonging table space directory if it is not > found OR it is not a directory". The former can be valid but the > latter is unconditionally can not (I don't think we bother considering > symlinks there). Again this is a safer check, in the case the parent_path is a file for example somehow, we should panic finally for the case and let the user checks and then does recovery again. > > +/* > + * Source directory may be missing. E.g. the template database used > + * for creating this database may have been dropped, due to reasons > + * noted above. Moving a database from one tablespace may also be a > + * partner in the crime. > + */ > +if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode))) > +{ > + XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, > src_path); > > This is a part of *creation* of the target directory. Lack of the > source directory cannot be valid even if the source directory is > dropped afterwards in the WAL stream and we can allow that if the > *target* tablespace is dropped afterwards. As the result, as I > mentioned above, we don't need to record about the database directory. > > By the way the name XLogLogMiss.. is somewhat confusing. How about > XLogReportMissingDir (named after report_invalid_page). Agree with you. Also your words remind me that we should skip the checking if the consistency point is reached. Here is a git diff against the previous patch. I’ll send out the new rebased patches after the consensus is reached. diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 7ade385965..c8fe3fe228 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -90,7 +90,7 @@ typedef struct xl_missing_dir static HTAB *missing_dir_tab = NULL; void -XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path) +XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path) { xl_missing_dir_key key; bool found; @@ -103,16 +103,6 @@ XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path) */ Assert(OidIsValid(spcNode)); - if (reachedConsistency) - { - if (dbNode == InvalidOid) - elog(PANIC, "cannot find directory %s (tablespace %d)", -
Re: Is Recovery actually paused?
On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA wrote: > > On Wed, 27 Jan 2021 13:29:23 +0530 > Dilip Kumar wrote: > > > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada > > wrote: > > > > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas wrote: > > > > > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy > > > > wrote: > > > > > +1 to just show the recovery pause state in the output of > > > > > pg_is_wal_replay_paused. But, should the function name > > > > > "pg_is_wal_replay_paused" be something like > > > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists > > > > > in a function, I expect a boolean output. Others may have better > > > > > thoughts. > > > > > > > > Maybe we should leave the existing function pg_is_wal_replay_paused() > > > > alone and add a new one with the name you suggest that returns text. > > > > That would create less burden for tool authors. > > > > > > +1 > > > > > > > Yeah, we can do that, I will send an updated patch soon. > > This means pg_is_wal_replay_paused is left without any change and this > returns whether pause is requested or not? If so, it seems good to modify > the documentation of this function in order to note that this could not > return the actual pause state. Yes, we can say that it will return true if the replay pause is requested. I am changing that in my new patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Tue, Jan 26, 2021 at 9:18 AM japin wrote: > > > When I read the discussion in [1], I found that update subscription's > publications > is complicated. > > For example, I have 5 publications in subscription. > > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 > dbname=postgres' > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; > > If I want to drop "mypub4", we should use the following command: > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5; > > Also, if I want to add "mypub7" and "mypub8", it will use: > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5, > mypub7, mypub8; > > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, > for the above > two cases, we can use the following: > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; > > I think it's more convenient. Any thoughts? > While the new proposed syntax does seem to provide some ease for users but it has nothing which we can't do with current syntax. Also, in the current syntax, there is an additional provision for refreshing the existing publications as well. So, if the user has to change the existing subscription such that it has to (a) add new publication(s), (b) remove some publication(s), (c) refresh existing publication(s) then all can be done in one command whereas with your new proposed syntax user has to write three separate commands. Having said that, I don't deny the appeal of having separate commands for each of (a), (b), and (c). -- With Regards, Amit Kapila.
RE: ECPG: proposal for new DECLARE STATEMENT
Dear Hackers, I know I'm asking a big favor, but could you review(and commit) the patch? The status has become RFC last Nov., but no one checked this after that. Maybe Meskes is quite busy and have no time to review it. The main part of the patch is about 200 lines(It means not so long), and maybe I have reviewed other patches more than it. I will review more, so I'm happy if this commits until the end of next CF. Best Regards, Hayato Kuroda FUJITSU LIMITED
Two patches to speed up pg_rewind.
While reading pg_rewind code I found two things could speed up pg_rewind. Attached are the patches. First one: pg_rewind would fsync the whole pgdata directory on the target by default, but that is a waste since usually just part of the files/directories on the target are modified. Other files on the target should have been flushed since pg_rewind requires a clean shutdown before doing the real work. This would help the scenario that the target postgres instance includes millions of files, which has been seen in a real environment. There are several things that may need further discussions: 1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”, but now the code guarded by it is just pre_sync_fname() relevant so we might want to rename it as HAVE_PRE_SYNC kind of name? 2. Pre_sync_fname() implementation The code looks like this: #if defined(HAVE_SYNC_FILE_RANGE) (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED. I did not check the Linux Kernel code but according to the man page I suspect that this option might cause the kernel tends to evict the related kernel pages from the page cache, which might not be something we expect. This is not a big issue since sync_file_range() should exist on many widely used Linux. Also I’m not sure how much we could benefit from the pre_sync code. Also note if the directory has a lot of files or the IO is fast, pre_sync_fname() might slow down the process instead. The reasons are: If there are a lot of files it is possible that we need to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background dirty page flush quickly, pre_sync_fname() might just waste cpu cycles. A better solution might be launch a separate pthread and do fsync one by one when pg_rewind finishes handling one file. pg_basebackup could use the solution also. Anyway this is independent of this patch. Second one is use copy_file_range() for the local rewind case to replace read()+write(). This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other code could use copy_file_range() if needed. copy_file_range() was introduced In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap() might be better than read()+write() but copy_file_range() is more interesting given that it could skip the data copying in some file systems - this could benefit more on Linux fs on network-based block storage. Regards, Paul 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch Description: 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch Description: 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 2:30 PM Amit Kapila wrote: > > On Tue, Jan 26, 2021 at 9:18 AM japin wrote: > > > > > > When I read the discussion in [1], I found that update subscription's > > publications > > is complicated. > > > > For example, I have 5 publications in subscription. > > > > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 > > dbname=postgres' > > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; > > > > If I want to drop "mypub4", we should use the following command: > > > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > > mypub5; > > > > Also, if I want to add "mypub7" and "mypub8", it will use: > > > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > > mypub5, mypub7, mypub8; > > > > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, > > for the above > > two cases, we can use the following: > > > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; > > > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; > > > > I think it's more convenient. Any thoughts? > > > > While the new proposed syntax does seem to provide some ease for users > but it has nothing which we can't do with current syntax. Also, in the > current syntax, there is an additional provision for refreshing the > existing publications as well. So, if the user has to change the > existing subscription such that it has to (a) add new publication(s), > (b) remove some publication(s), (c) refresh existing publication(s) > then all can be done in one command whereas with your new proposed > syntax user has to write three separate commands. IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this option enough to achieve what we can with ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2 WITH (refresh = true);? Am I missing something here? > Having said that, I don't deny the appeal of having separate commands > for each of (a), (b), and (c). for (c) i.e. refresh existing publication do we need something like ALTER SUBSCRIPTION mysub1 REFRESH SUBSCRIPTION or some other syntax that only refreshes the subscription similar to ALTER SUBSCRIPTION mysub1 REFRESH PUBLICATION? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Freeze the inserted tuples during CTAS?
Here is the simple patch, diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index dce882012e..0391699423 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) myState->rel = intoRelationDesc; myState->reladdr = intoRelationAddr; myState->output_cid = GetCurrentCommandId(true); - myState->ti_options = TABLE_INSERT_SKIP_FSM; + myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; MatView code already does this and COPY does this if specified. I’m not sure how does the community think about this. Actually personally I expect more about the all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index only scan if we create an index and table use CTAS, else people have to use index only scan after vacuum. If people do not expect freeze could we at least introduce a option to specify the visibility during inserting? Regards, Paul
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 2:57 PM Bharath Rupireddy wrote: > > On Wed, Jan 27, 2021 at 2:30 PM Amit Kapila wrote: > > > > On Tue, Jan 26, 2021 at 9:18 AM japin wrote: > > > > > > > > > When I read the discussion in [1], I found that update subscription's > > > publications > > > is complicated. > > > > > > For example, I have 5 publications in subscription. > > > > > > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 > > > dbname=postgres' > > > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; > > > > > > If I want to drop "mypub4", we should use the following command: > > > > > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > > > mypub5; > > > > > > Also, if I want to add "mypub7" and "mypub8", it will use: > > > > > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > > > mypub5, mypub7, mypub8; > > > > > > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... > > > syntax, for the above > > > two cases, we can use the following: > > > > > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; > > > > > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; > > > > > > I think it's more convenient. Any thoughts? > > > > > > > While the new proposed syntax does seem to provide some ease for users > > but it has nothing which we can't do with current syntax. Also, in the > > current syntax, there is an additional provision for refreshing the > > existing publications as well. So, if the user has to change the > > existing subscription such that it has to (a) add new publication(s), > > (b) remove some publication(s), (c) refresh existing publication(s) > > then all can be done in one command whereas with your new proposed > > syntax user has to write three separate commands. > > IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION > mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this > option enough to achieve what we can with ALTER SUBSCRIPTION mysub1 > SET PUBLICATION mypub1, mypub2 WITH (refresh = true);? Am I missing > something here? > I feel the SET syntax would allow refreshing existing publications as well whereas, in Add, it will be only for new Publications. -- With Regards, Amit Kapila.
Re: Freeze the inserted tuples during CTAS?
Hi, I confirm that my analytic workflows often do the CTAS and VACUUM of the relation right after, before the index creation, to mark stuff as all-visible for IOS to work. Freezing and marking as visible will help. On Wed, Jan 27, 2021 at 12:29 PM Paul Guo wrote: > Here is the simple patch, > > diff --git a/src/backend/commands/createas.c > b/src/backend/commands/createas.c > index dce882012e..0391699423 100644 > --- a/src/backend/commands/createas.c > +++ b/src/backend/commands/createas.c > @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, > TupleDesc typeinfo) > myState->rel = intoRelationDesc; > myState->reladdr = intoRelationAddr; > myState->output_cid = GetCurrentCommandId(true); > - myState->ti_options = TABLE_INSERT_SKIP_FSM; > + myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; > > MatView code already does this and COPY does this if specified. I’m not > sure how > does the community think about this. Actually personally I expect more > about the > all-visible setting due to TABLE_INSERT_FROZEN since I could easier use > index only scan > if we create an index and table use CTAS, else people have to use index > only scan > after vacuum. If people do not expect freeze could we at least introduce a > option to > specify the visibility during inserting? > > Regards, > Paul -- Darafei "Komяpa" Praliaskouski OSM BY Team - http://openstreetmap.by/
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: > > > While the new proposed syntax does seem to provide some ease for users > > > but it has nothing which we can't do with current syntax. Also, in the > > > current syntax, there is an additional provision for refreshing the > > > existing publications as well. So, if the user has to change the > > > existing subscription such that it has to (a) add new publication(s), > > > (b) remove some publication(s), (c) refresh existing publication(s) > > > then all can be done in one command whereas with your new proposed > > > syntax user has to write three separate commands. > > > > IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION > > mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this > > option enough to achieve what we can with ALTER SUBSCRIPTION mysub1 > > SET PUBLICATION mypub1, mypub2 WITH (refresh = true);? Am I missing > > something here? > > > > I feel the SET syntax would allow refreshing existing publications as > well whereas, in Add, it will be only for new Publications. I think the patch v2-0001 will refresh all the publications, I mean existing and newly added publications. IIUC the patch, it first fetches all the available publications with the subscriptions and it sees if that list has the given publication [1], if not, then adds it to the existing publications list and returns that list [2]. If the refresh option is specified as true with ALTER SUBSCRIPTION ... ADD PUBLICATION, then it refreshes all the returned publications [3]. I believe this is also true with ALTER SUBSCRIPTION ... DROP PUBLICATION. So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION will refresh the new and existing publications. [1] + +/* + * merge current subpublications and user specified by add/drop publications. + * + * If addpub == true, we will add the list of publications into current + * subpublications. Otherwise, we will delete the list of publications from + * current subpublications. + */ +static List * +merge_subpublications(HeapTuple tuple, TupleDesc tupledesc, + List *publications, bool addpub) [2] +publications = merge_subpublications(tup, RelationGetDescr(rel), [3] +/* Refresh if user asked us to. */ +if (refresh) +{ +if (!sub->enabled) +ereport(ERROR, +(errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"), + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false)."))); + +/* Make sure refresh sees the new list of publications. */ +sub->publications = publications; + +AlterSubscription_refresh(sub, copy_data); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, 27 Jan 2021 at 17:46, Bharath Rupireddy wrote: > On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: >> > > While the new proposed syntax does seem to provide some ease for users >> > > but it has nothing which we can't do with current syntax. Also, in the >> > > current syntax, there is an additional provision for refreshing the >> > > existing publications as well. So, if the user has to change the >> > > existing subscription such that it has to (a) add new publication(s), >> > > (b) remove some publication(s), (c) refresh existing publication(s) >> > > then all can be done in one command whereas with your new proposed >> > > syntax user has to write three separate commands. >> > >> > IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION >> > mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this >> > option enough to achieve what we can with ALTER SUBSCRIPTION mysub1 >> > SET PUBLICATION mypub1, mypub2 WITH (refresh = true);? Am I missing >> > something here? >> > >> >> I feel the SET syntax would allow refreshing existing publications as >> well whereas, in Add, it will be only for new Publications. > > I think the patch v2-0001 will refresh all the publications, I mean > existing and newly added publications. IIUC the patch, it first > fetches all the available publications with the subscriptions and it > sees if that list has the given publication [1], if not, then adds it > to the existing publications list and returns that list [2]. If the > refresh option is specified as true with ALTER SUBSCRIPTION ... ADD > PUBLICATION, then it refreshes all the returned publications [3]. I > believe this is also true with ALTER SUBSCRIPTION ... DROP > PUBLICATION. > > So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION > will refresh the new and existing publications. > Yes! It will refresh the new and existing publications. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, 27 Jan 2021 at 16:59, Amit Kapila wrote: > On Tue, Jan 26, 2021 at 9:18 AM japin wrote: >> >> >> When I read the discussion in [1], I found that update subscription's >> publications >> is complicated. >> >> For example, I have 5 publications in subscription. >> >> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 >> dbname=postgres' >> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; >> >> If I want to drop "mypub4", we should use the following command: >> >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5; >> >> Also, if I want to add "mypub7" and "mypub8", it will use: >> >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, >> mypub5, mypub7, mypub8; >> >> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, >> for the above >> two cases, we can use the following: >> >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; >> >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; >> >> I think it's more convenient. Any thoughts? >> > > While the new proposed syntax does seem to provide some ease for users > but it has nothing which we can't do with current syntax. Also, in the > current syntax, there is an additional provision for refreshing the > existing publications as well. So, if the user has to change the > existing subscription such that it has to (a) add new publication(s), > (b) remove some publication(s), (c) refresh existing publication(s) > then all can be done in one command whereas with your new proposed > syntax user has to write three separate commands. > If we want add and drop some publications, we can use SET PUBLICATION, it is more convenient than ADD and DROP PUBLICATION, however if we just want to add (or drop) publication into (or from) subcription which has much publications, then the new syntax is more convenient IMO. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi Joel, As always, great catch! Kindly find the updated patch (v15) below Changelog: - v15 (compatible with current master 2021-01-27, commit e42b3c3bd6a9c6233ac4c8a2e9b040367ba2f97c) * remove "EACH ELEMENT OF" from violation messages * accommodate tests accordingly Keep up the good work testing this patch. /Mark On Wed, Jan 27, 2021 at 5:11 AM Joel Jacobson wrote: > On Tue, Jan 26, 2021, at 12:59, Mark Rofail wrote: > > Please don't hesitate to give your feedback. > > The error message for insert or update violations looks fine: > > UPDATE catalog_clone.pg_extension SET extconfig = ARRAY[12345] WHERE oid = > 10; > ERROR: insert or update on table "pg_extension" violates foreign key > constraint "pg_extension_extconfig_fkey" > DETAIL: Key (EACH ELEMENT OF extconfig)=(12345) is not present in table > "pg_class". > > But when trying to delete a still referenced row, > the column mentioned in the "EACH ELEMENT OF" sentence > is not the array column, but the column in the referenced table: > > DELETE FROM catalog_clone.pg_class WHERE oid = 10; > ERROR: update or delete on table "pg_class" violates foreign key > constraint "pg_extension_extconfig_fkey" on table "pg_extension" > DETAIL: Key (EACH ELEMENT OF oid)=(10) is still referenced from table > "pg_extension". > > I think either the "EACH ELEMENT OF" part of the latter error message > should be dropped, > or the column name for the array column should be used. > > /Joel >
Re: poc - possibility to write window function in PL languages
st 20. 1. 2021 v 21:14 odesílatel Pavel Stehule napsal: > > > st 20. 1. 2021 v 21:07 odesílatel Tom Lane napsal: > >> Pavel Stehule writes: >> > The second question is work with partition context value. This should be >> > only one value, and of only one but of any type per function. In this >> case >> > we cannot use GET statements. I had an idea of enhancing declaration. >> Some >> > like >> >> > DECLARE >> > pcx PARTITION CONTEXT (int); -- read partition context >> > BEGIN >> > pcx := 10; -- set partition context >> >> > What do you think about it? >> >> Uh, what? I don't understand what this "partition context" is. >> > > It was my name for an access to window partition local memory - > WinGetPartitionLocalMemory > > We need some interface for this cache > I have to think more about declarative syntax. When I try to transform our WindowObject API directly, then it looks like Cobol. It needs a different concept to be user friendly. Regards Pavel > Regards > > Pavel > > > > > > > >> >> regards, tom lane >> >
Re: [PATCH] Disable bgworkers during servers start in pg_upgrade
Hi, Andres Freund a écrit : > On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote: > > We found an issue in pg_upgrade on a cluster with a third-party > > background worker. The upgrade goes fine, but the new cluster is then in > > an inconsistent state. The background worker comes from the PoWA > > extension but the issue does not appear to related to this particular > > code. > > Well, it does imply that that backgrounder did something, as the pure > existence of a bgworker shouldn't affect > > anything. Presumably the issue is that the bgworker actually does > transactional writes, which causes problems because the xids / > multixacts from early during pg_upgrade won't actually be valid after we > do pg_resetxlog etc. > > > > As a solution, it seems that, for similar reasons that we restrict > > socket access to prevent accidental connections (from commit > > f763b77193), we should also prevent background workers to start at this > > step. > > I think that'd have quite the potential for negative impact - imagine > extensions that refuse to be loaded outside of shared_preload_libraries > (e.g. because they need to allocate shared memory) but that are required > during the course of pg_upgrade (e.g. because it's a tableam, a PL or > such). Those libraries will then tried to be loaded during the upgrade > (due to the _PG_init() hook being called when functions from the > extension are needed, e.g. the tableam or PL handler). > > Nor is it clear to me that the only way this would be problematic is > with shared_preload_libraries. A library in local_preload_libraries, or > a demand loaded library can trigger bgworkers (or database writes in > some other form) as well. Thank you for those insights! > I wonder if we could > > a) set default_transaction_read_only to true, and explicitly change it >in the sessions that need that. > b) when in binary upgrade mode / -b, error out on all wal writes in >sessions that don't explicitly set a session-level GUC to allow >writes. Solution "a" appears to be enough to solve the problem described in my initial email. See attached patch implementing this. Cheers, Denis >From d8b74f9220775736917b7fc08bbe397d7e2eedcd Mon Sep 17 00:00:00 2001 From: Denis Laxalde Date: Wed, 20 Jan 2021 17:25:58 +0100 Subject: [PATCH] Set default transactions to read-only at servers start in pg_upgrade In essence, this is for a similar reason that we use a restricted socket access from f763b77193b04eba03a1f4ce46df34dc0348419e because background workers may produce undesired activities during the upgrade. Author: Denis Laxalde Co-authored-by: Jehan-Guillaume de Rorthais --- src/bin/pg_upgrade/server.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c index 31b1425202..b17f348a5b 100644 --- a/src/bin/pg_upgrade/server.c +++ b/src/bin/pg_upgrade/server.c @@ -244,7 +244,7 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error) * vacuumdb --freeze actually freezes the tuples. */ snprintf(cmd, sizeof(cmd), - "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s\" start", + "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s -c default_transaction_read_only=on\" start", cluster->bindir, SERVER_LOG_FILE, cluster->pgconfig, cluster->port, (cluster->controldata.cat_ver >= BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" : -- 2.20.1
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 3:26 PM japin wrote: > > > On Wed, 27 Jan 2021 at 16:59, Amit Kapila wrote: > > On Tue, Jan 26, 2021 at 9:18 AM japin wrote: > >> > >> > >> When I read the discussion in [1], I found that update subscription's > >> publications > >> is complicated. > >> > >> For example, I have 5 publications in subscription. > >> > >> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 > >> dbname=postgres' > >> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5; > >> > >> If I want to drop "mypub4", we should use the following command: > >> > >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > >> mypub5; > >> > >> Also, if I want to add "mypub7" and "mypub8", it will use: > >> > >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, > >> mypub5, mypub7, mypub8; > >> > >> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, > >> for the above > >> two cases, we can use the following: > >> > >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4; > >> > >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8; > >> > >> I think it's more convenient. Any thoughts? > >> > > > > While the new proposed syntax does seem to provide some ease for users > > but it has nothing which we can't do with current syntax. Also, in the > > current syntax, there is an additional provision for refreshing the > > existing publications as well. So, if the user has to change the > > existing subscription such that it has to (a) add new publication(s), > > (b) remove some publication(s), (c) refresh existing publication(s) > > then all can be done in one command whereas with your new proposed > > syntax user has to write three separate commands. > > > > If we want add and drop some publications, we can use SET PUBLICATION, it > is more convenient than ADD and DROP PUBLICATION, however if we just want to > add (or drop) publication into (or from) subcription which has much > publications, > then the new syntax is more convenient IMO. > I agree with you that if we just want to add or remove a few publications in the existing subscription then providing the complete list is not convenient. The new syntax is way better, although I am not sure how frequently users need to add/remove publication in the subscription. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Wrong usage of RelationNeedsWAL
On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote: > On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote: > > At Thu, 21 Jan 2021 00:19:58 -0800, Noah Misch wrote in > > > On Thu, Jan 21, 2021 at 12:28:44AM +0900, Kyotaro Horiguchi wrote: > > > > However, with the previous patch, two existing tests sto_using_cursor > > > > and sto_using_select behaves differently from the master. That change > > > > is coming from the omission of actual LSN check in TestForOldSnapshot > > > > while wal_level=minimal. So we have no choice other than actually > > > > updating page LSN. > > > > > > > > In the scenario under discussion there are two places we need to do > > > > that. one is heap_page_prune, and the other is RelationCopyStorge. As > > > > a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the > > > > attached third file. > > > > > > Fake LSNs make the system harder to understand, so I prefer not to spread > > > fake > > > LSNs to more access methods. What I had in mind is to simply suppress > > > early > > > pruning when the relation is skipping WAL. Attached. Is this > > > reasonable? It > > > passes the older tests. While it changes the sto_wal_optimized.spec > > > output, I > > > think it preserves the old_snapshot_threshold behavior contract. > > > > Perhaps I'm missing something, but the patch doesn't pass the v5-0001 > > test with wal_level=minimal? > > Correct. The case we must avoid is letting an old snapshot read an > early-pruned page without error. v5-0001 expects "ERROR: snapshot too old". > The patch suspends early pruning, so that error is not applicable. I think the attached version is ready. The changes since v6nm are cosmetic: - Wrote log messages - Split into two patches, since the user-visible bugs are materially different - Fixed typos - Ran perltidy Is it okay if I push these on Saturday, or would you like more time to investigate? Author: Noah Misch Commit: Noah Misch Fix error with CREATE PUBLICATION, wal_level=minimal, and new tables. CREATE PUBLICATION has failed spuriously when applied to a permanent relation created or rewritten in the current transaction. Make the same change to another site having the same semantic intent; the second instance has no user-visible consequences. Back-patch to v13, where commit c6b92041d38512a4176ed76ad06f713d2e6c01a8 broke this. Kyotaro Horiguchi Discussion: https://postgr.es/m/20210113.160705.2225256954956139776.horikyota@gmail.com diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 5f8e1c6..84d2efc 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel) errdetail("System tables cannot be added to publications."))); /* UNLOGGED and TEMP relations cannot be part of publication. */ - if (!RelationNeedsWAL(targetrel)) + if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("table \"%s\" cannot be replicated", diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index da322b4..177e6e3 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, relation = table_open(relationObjectId, NoLock); /* Temporary and unlogged relations are inaccessible during recovery. */ - if (!RelationNeedsWAL(relation) && RecoveryInProgress()) + if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT && + RecoveryInProgress()) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot access temporary or unlogged relations during recovery"))); diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl index 0680f44..3d90f81 100644 --- a/src/test/subscription/t/001_rep_changes.pl +++ b/src/test/subscription/t/001_rep_changes.pl @@ -3,7 +3,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 23; +use Test::More tests => 24; # Initialize publisher node my $node_publisher = get_new_node('publisher'); @@ -358,3 +358,21 @@ is($result, qq(0), 'check replication origin was dropped on subscriber'); $node_subscriber->stop('fast'); $node_publisher->stop('fast'); + +# CREATE PUBLICATION while wal_level=minimal should succeed, with a WARNING +$node_publisher->append_conf( + 'postgresql.conf', qq( +wal_level=minimal +max_wal_senders=0 +)); +$node_publisher->start; +($result, my $retout, my $reterr) = $node_publis
Re: Protect syscache from bloating with negative cache entries
On 27/01/2021 03:13, Kyotaro Horiguchi wrote: At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi wrote in The commit 4656e3d668 (debug_invalidate_system_caches_always) conflicted with this patch. Rebased. At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi wrote in (I found a bug in a benchmark-aid function (CatalogCacheFlushCatalog2), I repost an updated version soon.) I noticed that a catcachebench-aid function CatalogCacheFlushCatalog2() allocates bucked array wrongly in the current memory context, which leads to a crash. This is a fixed it then rebased version. Thanks, with the scripts you provided, I was able to run the performance tests on my laptop, and got very similar results as you did. The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is very small. I think I could see it in the tests, but only barely. And the tests did nothing else than do syscache lookups; in any real world scenario, it would be lost in noise. I think we can put that aside for now, and focus on v6-0001-CatCache-expiration-feature.patch: The pruning is still pretty lethargic: - Entries created in the same transactions are never pruned away - The size of the hash table is never shrunk. So even though the patch puts a backstop to the hash table growing indefinitely, if you run one transaction that bloats the cache, it's bloated for the rest of the session. I think that's OK. We might want to be more aggressive in the future, but for now it seems reasonable to lean towards the current behavior where nothing is pruned. Although I wonder if we should try to set 'catcacheclock' more aggressively. I think we could set it whenever GetCurrentTimestamp() is called, for example. Given how unaggressive this mechanism is, I think it should be safe to enable it by default. What would be a suitable default for catalog_cache_prune_min_age? 30 seconds? Documentation needs to be updated for the new GUC. Attached is a version with a few little cleanups: - use TimestampTz instead of uint64 for the timestamps - remove assign_catalog_cache_prune_min_age(). All it did was convert the GUC's value from seconds to microseconds, and stored it in a separate variable. Multiplication is cheap, so we can just do it when we use the GUC's value instead. - Heikki >From 73583c2c9cd26bad7c3942eff1ddccb2ebb43222 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 27 Jan 2021 13:08:08 +0200 Subject: [PATCH v8 1/1] CatCache expiration feature Author: Kyotaro Horiguchi --- src/backend/access/transam/xact.c | 3 ++ src/backend/utils/cache/catcache.c | 82 +- src/backend/utils/misc/guc.c | 12 + src/include/utils/catcache.h | 17 +++ 4 files changed, 112 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index a2068e3fd45..86888d24091 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1086,6 +1086,9 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index fa2b49c676e..e29f825687a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4); +static bool CatCacheCleanupOldEntries(CatCache *cp); static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); @@ -99,7 +110,6 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); - /* * internal support functions */ @@ -1264,6 +1274,9 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Record the last access timestamp */ + ct->lastaccess = catcacheclock; + /* * If it's a positive ent
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy wrote: > > On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: > > So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION > will refresh the new and existing publications. > That sounds a bit unusual to me because when the user has specifically asked to just ADD Publication, we might refresh some existing Publication along with it? -- With Regards, Amit Kapila.
Re: pg_upgrade fails with non-standard ACL
On Mon, Jan 25, 2021 at 10:14:43PM +0300, Anastasia Lubennikova wrote: > On 24.01.2021 11:39, Noah Misch wrote: > >On Thu, Jan 21, 2021 at 01:03:58AM +0300, Anastasia Lubennikova wrote: > >>On 03.01.2021 14:29, Noah Misch wrote: > >>>Overall, this patch predicts a subset of cases where pg_dump will emit a > >>>failing GRANT or REVOKE that targets a pg_catalog object. Can you write a > >>>code comment stating what it does and does not detect? I think it's okay > >>>to > >>>not predict every failure, but let's record the intended coverage. Here > >>>are a > >>>few examples I know; there may be more. The above query only finds GRANTs > >>>to > >>>non-pinned roles. pg_dump dumps the following, but the above query doesn't > >>>find them: > >>> > >>> REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public; > >>> GRANT EXECUTE ON FUNCTION pg_reload_conf() TO pg_signal_backend; > >I see a new comment listing object types. Please also explain the lack of > >preventing REVOKE failures (first example query above) and the limitation > >around non-pinned roles (second example). > > 1) Could you please clarify, what do you mean by REVOKE failures? > > I tried following example: > > Start 9.6 cluster. > > REVOKE ALL ON function pg_switch_xlog() FROM public; > REVOKE ALL ON function pg_switch_xlog() FROM backup; > > The upgrade to the current master passes with and without patch. > It seems that current implementation of pg_upgrade doesn't take into account > revoke ACLs. I think you can observe it by adding "revoke all on function pg_stat_get_subscription from public;" to test_add_acl_to_catalog_objects.sql and then rerunning your test script. pg_dump will reproduce that REVOKE, which would fail if postgresql.org had removed that function. That's fine, so long as a comment mentions the limitation. > 2) As for pinned roles, I think we should fix this behavior, rather than > adding a comment. Because such roles can have grants on system objects. > > In earlier versions of the patch, I gathered ACLs directly from system > catalogs: pg_proc.proacl, pg_class.relacl pg_attribute.attacl and > pg_type.typacl. > > The only downside of this approach is that it cannot be easily extended to > other object types, as we need to handle every object type separately. > I don't think it is a big deal, as we already do it in > check_for_changed_signatures() > > And also the query to gather non-standard ACLs won't look as nice as now, > because of the need to parse arrays of aclitems. > > What do you think? Hard to say for certain without seeing the code both ways. I'm not generally enthusiastic about adding pg_upgrade code to predict whether the dump will fail to restore, because such code will never be as good as just trying the restore. The patch has 413 lines of code, which is substantial. I would balk if, for example, the patch tripled in size to catch more cases.
Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation
On Tue, Jan 26, 2021 at 4:56 PM japin wrote: > > > Hi, > > When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION ... > WITH (...), > it says "set_publication_option" only support "refresh" in documentation [1]. > However, we can also supply the "copy_data" option, and the code is: > I think there is a reference to the 'copy_data' option as well. There is a sentence saying: "Additionally, refresh options as described under REFRESH PUBLICATION may be specified." and then if you Refresh option, there we do mention about 'copy_data', isn't that sufficient? -- With Regards, Amit Kapila.
Re: Parallel INSERT (INTO ... SELECT ...)
On Wed, Jan 27, 2021 at 2:13 PM Hou, Zhijie wrote: > > Hi, > > When testing the patch with the following kind of sql. > > --- > Insert into part_table select 1; > Insert into part_table select generate_series(1,1,1); > Insert into part_table select * from testfunc(); > --- > > we usually use these sqls to initialize the table or for testing purpose. > > Personally I think we do not need to do the parallel safety-check for these > cases, > because there seems no chance for the select part to consider parallel. > > I thought we aim to not check the safety unless parallel is possible. > , So I was thinking is it possible to avoid the check it these cases ? > > I did some quick check on the code, An Immature ideal is to check if there is > RTE_RELATION in query. > If no we do not check the safety-check. > > I am not sure is it worth to do that, any thoughts ? > Yes, I think it's worth it. It's surprising that there's not really any optimizations for these with just the current Postgres parallel SELECT functionality (as there's currently no way to divide the work for these amongst the workers, even if the function/expression is parallel-safe). For the additional parallel-safety checks for INSERT, currently we check that RTE_SUBQUERY is in the range-table. So I think we can additionally check that RTE_RELATION is in the subquery range-table (otherwise treat it as unsafe). Regards, Greg Nancarrow Fujitsu Australia
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila wrote: > > On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy > wrote: > > > > On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: > > > > So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION > > will refresh the new and existing publications. > > > > That sounds a bit unusual to me because when the user has specifically > asked to just ADD Publication, we might refresh some existing > Publication along with it? Hmm. That's correct. I also feel we should not touch the existing publications, only the ones that are added/dropped should be refreshed. Because there will be an overhead of a SQL with more publications(in fetch_table_list) when AlterSubscription_refresh() is called with all the existing publications. We could just pass in the newly added/dropped publications to AlterSubscription_refresh(). I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with refresh true refreshes only the newly added publications, because what we do in AlterSubscription_refresh() is that we fetch the tables associated with the publications from the publisher, compare them with the previously fetched tables from that publication and add the new tables or remove the table that don't exist in that publication anymore. For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same thing i.e. refreshes only the dropped publications. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation
On Wed, Jan 27, 2021 at 4:57 PM Amit Kapila wrote: > > On Tue, Jan 26, 2021 at 4:56 PM japin wrote: > > > > > > Hi, > > > > When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION ... > > WITH (...), > > it says "set_publication_option" only support "refresh" in documentation > > [1]. > > However, we can also supply the "copy_data" option, and the code is: > > > > I think there is a reference to the 'copy_data' option as well. There > is a sentence saying: "Additionally, refresh options as described > under REFRESH PUBLICATION may be specified." and then if you Refresh > option, there we do mention about 'copy_data', isn't that sufficient? Right. It looks like the copy_option is indirectly mentioned with the statement "Additionally, refresh options as described under REFRESH PUBLICATION may be specified." under "set_publication_option". IMHO, we can keep it that way. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: simplifying foreign key/RI checks
On Wed, Jan 27, 2021 at 5:32 PM Kyotaro Horiguchi wrote: > At Sun, 24 Jan 2021 20:51:39 +0900, Amit Langote > wrote in > > Here's v5. > > At Mon, 25 Jan 2021 18:19:56 +0900, Amit Langote > wrote in > > > Anybody else want to look this patch over before I mark it Ready For > > > Committer? > > > > Would be nice to have others look it over. Thanks. > > This nice improvement. > > 0001 just looks fine. > > 0002: > > /* RI query type codes */ > -/* these queries are executed against the PK (referenced) table: */ > +/* > + * 1 and 2 are no longer used, because PK (referenced) table is looked up > + * directly using ri_ReferencedKeyExists(). > #define RI_PLAN_CHECK_LOOKUPPK 1 > #define RI_PLAN_CHECK_LOOKUPPK_FROM_PK 2 > #define RI_PLAN_LAST_ON_PK > RI_PLAN_CHECK_LOOKUPPK_FROM_PK > > However, this patch does. > > + if (!ri_ReferencedKeyExists(pk_rel, fk_rel, newslot, riinfo)) > + ri_ReportViolation(riinfo, > + pk_rel, fk_rel, > + newslot, > + NULL, > + RI_PLAN_CHECK_LOOKUPPK, > false); > > It seems to me 1 (RI_PLAN_CHECK_LOOKUPPK) is still alive. (Yeah, I > know that doesn't mean the usefulness of the macro but the mechanism > the macro suggests, but it is confusing.) On the other hand, > RI_PLAN_CHECK_LOOKUPPK_FROM_PK and RI_PLAN_LAST_ON_PK seem to be no > longer used. (Couldn't we remove them?) Yeah, better to just remove those _PK macros and say this module no longer runs any queries on the PK table. How about the attached? -- Amit Langote EDB: http://www.enterprisedb.com v6-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data v6-0001-Export-get_partition_for_tuple.patch Description: Binary data
Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation
On Wed, 27 Jan 2021 at 19:47, Bharath Rupireddy wrote: > On Wed, Jan 27, 2021 at 4:57 PM Amit Kapila wrote: >> >> On Tue, Jan 26, 2021 at 4:56 PM japin wrote: >> > >> > >> > Hi, >> > >> > When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION >> > ... WITH (...), >> > it says "set_publication_option" only support "refresh" in documentation >> > [1]. >> > However, we can also supply the "copy_data" option, and the code is: >> > >> >> I think there is a reference to the 'copy_data' option as well. There >> is a sentence saying: "Additionally, refresh options as described >> under REFRESH PUBLICATION may be specified." and then if you Refresh >> option, there we do mention about 'copy_data', isn't that sufficient? > > Right. It looks like the copy_option is indirectly mentioned with the > statement "Additionally, refresh options as described under REFRESH > PUBLICATION may be specified." under "set_publication_option". IMHO, > we can keep it that way. > My bad. It may be sufficient. Sorry for noise. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Printing backtrace of postgres processes
On Thu, Jan 21, 2021 at 7:26 AM Tom Lane wrote: > > Craig Ringer writes: > > On Wed, 20 Jan 2021 at 05:23, Tom Lane wrote: > >> BTW, it also looks like the patch is doing nothing to prevent the > >> backtrace from being sent to the connected client. > > > I don't see a good reason to send a bt to a client. Even though these > > backtraces won't be analysing debuginfo and populating args, locals, etc, > > it should still just go to the server log. > > Yeah. That's easier than I was thinking, we just need to > s/LOG/LOG_SERVER_ONLY/. > > >> Maybe, given all of these things, we should forget using elog at > >> all and just emit the trace with fprintf(stderr). > > > That causes quite a lot of pain with MemoryContextStats() already > > True. Given the changes discussed in the last couple messages, I don't > see any really killer reasons why we can't ship the trace through elog. > We can always try that first, and back off to fprintf if we do find > reasons why it's too unstable. > Thanks all of them for the suggestions. Attached v3 patch which has fixes based on the suggestions. It includes the following fixes: 1) Removal of support to get callstack of all postgres process, user can get only one process callstack. 2) Update the documentation. 3) Added necessary checks for pg_print_callstack similar to pg_terminate_backend. 4) Changed LOG to LOG_SERVER_ONLY. Thoughts? Regards, Vignesh From dd51938225881be14cce2ba0e80ae9019a3f2bfc Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 27 Jan 2021 18:20:13 +0530 Subject: [PATCH v3] Print backtrace of postgres process that are part of this instance. The idea here is to implement & expose pg_print_callstack function, internally what this function does is, the connected backend will send SIGUSR1 signal by setting PMSIGNAL_BACKTRACE_EMIT to the postmaster process. Once the process receives this signal it will log the backtrace of the process. --- doc/src/sgml/func.sgml| 75 +++ src/backend/postmaster/autovacuum.c | 4 ++ src/backend/postmaster/checkpointer.c | 5 +++ src/backend/postmaster/interrupt.c| 5 +++ src/backend/storage/ipc/procsignal.c | 33 +++ src/backend/storage/ipc/signalfuncs.c | 59 ++- src/backend/tcop/postgres.c | 38 ++ src/backend/utils/init/globals.c | 1 + src/bin/pg_ctl/t/005_backtrace.pl | 73 ++ src/include/catalog/pg_proc.dat | 5 +++ src/include/miscadmin.h | 2 + src/include/storage/pmsignal.h| 2 + src/include/storage/procsignal.h | 3 ++ src/include/tcop/tcopprot.h | 1 + 14 files changed, 305 insertions(+), 1 deletion(-) create mode 100644 src/bin/pg_ctl/t/005_backtrace.pl diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index aa99665..4ff6e7f 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -24709,6 +24709,25 @@ SELECT collation for ('foo' COLLATE "de_DE"); however only superusers can terminate superuser backends. + + + + + pg_print_callstack + +pg_print_callstack ( pid integer ) +boolean + + +Prints the callstack whose backend process has the specified process ID. +Callstack will be printed based on the log configuration set. See + for more information. This is +allowed if the calling role is a member of the role whose backend +callstack is being printed or the calling role has been granted +pg_print_callstack, however only superusers can +print callstack of superuser backends. + + @@ -24728,6 +24747,62 @@ SELECT collation for ('foo' COLLATE "de_DE"); pg_stat_activity view. + +pg_print_callstack can be used to print callstack of +a backend porcess. For example: + +postgres=# select pg_print_callstack(pg_backend_pid()); + pg_print_callstack + + t +(1 row) + +The callstack will be logged in the log file. For example: +2021-01-27 11:33:50.247 IST [111735] LOG: current backtrace: +postgres: postgresdba postgres [local] SELECT(LogBackTrace+0x33) [0x9501cd] +postgres: postgresdba postgres [local] SELECT(ProcessInterrupts+0x774) [0x950bac] +postgres: postgresdba postgres [local] SELECT() [0x761ee8] +postgres: postgresdba postgres [local] SELECT() [0x71bc39] +postgres: postgresdba postgres [local] SELECT() [0x71e3df] +postgres: postgresdba postgres [local] SELECT(standard_ExecutorRun+0x1d6) [0x71c25d] +postgres: postgresdba postgres [local] SELECT(ExecutorRun+0x55) [0x71c085] +postgres: postgresdba postgres [local] SELECT() [0x953f3d] +postgres: postgresdba postgres [local] SELECT(PortalRun+0x262) [0x953bf6] +postgres: postgresdba postgres [local] SELECT() [0x94dafa] +postgres: postgresdba
Re: [PATCH] Disable bgworkers during servers start in pg_upgrade
Hi, On Wed, 27 Jan 2021 11:25:11 +0100 Denis Laxalde wrote: > Andres Freund a écrit : > > On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote: > > > We found an issue in pg_upgrade on a cluster with a third-party > > > background worker. The upgrade goes fine, but the new cluster is then in > > > an inconsistent state. The background worker comes from the PoWA > > > extension but the issue does not appear to related to this particular > > > code. > > > > Well, it does imply that that backgrounder did something, as the pure > > existence of a bgworker shouldn't affect anything. Presumably the issue is > > that the bgworker actually does transactional writes, which causes problems > > because the xids / multixacts from early during pg_upgrade won't actually > > be valid after we do pg_resetxlog etc. Indeed, it does some writes. As soon as the powa bgworker starts, it takes "snapshots" of pg_stat_statements state and record them in a local table. To avoid concurrent run, it takes a lock on some of its local rows using SELECT FOR UPDATE, hence the mxid consumption. The inconsistency occurs at least at two place: * the datminmxid and relminmxid fields pg_dump(all)'ed and restored in the new cluster. * the multixid fields in the controlfile read during the check phase and restored later using pg_resetxlog. > > > As a solution, it seems that, for similar reasons that we restrict > > > socket access to prevent accidental connections (from commit > > > f763b77193), we should also prevent background workers to start at this > > > step. > > > > I think that'd have quite the potential for negative impact - [...] > > Thank you for those insights! +1 > > I wonder if we could > > > > a) set default_transaction_read_only to true, and explicitly change it > >in the sessions that need that. According to Denis' tests discussed off-list, it works fine in regard with the powa bgworker, albeit some complaints in logs. However, I wonder how fragile it could be as bgworker could easily manipulate either the GUC or "BEGIN READ WRITE". I realize this is really uncommon practices, but bgworker code from third parties might be surprising. > > b) when in binary upgrade mode / -b, error out on all wal writes in > >sessions that don't explicitly set a session-level GUC to allow > >writes. It feels safer because more specific to the subject. But I wonder if the benefice worth adding some (limited?) complexity and a small/quick conditional block in a very hot code path for a very limited use case. What about c) where the bgworker are not loaded by default during binary upgrade mode / -b unless they explicitly set a bgw_flags (BGWORKER_BINARY_UPGRADE ?) when they are required during pg_upgrade? Regards,
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
> On Jan 27, 2021, at 19:41, Bharath Rupireddy > wrote: > > On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila wrote: >>> On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy >>> wrote: On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: >>> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION >>> will refresh the new and existing publications. >> That sounds a bit unusual to me because when the user has specifically >> asked to just ADD Publication, we might refresh some existing >> Publication along with it? > > Hmm. That's correct. I also feel we should not touch the existing > publications, only the ones that are added/dropped should be > refreshed. Because there will be an overhead of a SQL with more > publications(in fetch_table_list) when AlterSubscription_refresh() is > called with all the existing publications. We could just pass in the > newly added/dropped publications to AlterSubscription_refresh(). > > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with > refresh true refreshes only the newly added publications, because what > we do in AlterSubscription_refresh() is that we fetch the tables > associated with the publications from the publisher, compare them with > the previously fetched tables from that publication and add the new > tables or remove the table that don't exist in that publication > anymore. > > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same > thing i.e. refreshes only the dropped publications. > > Thoughts? Agreed. We just only need to refresh the added/dropped publications. Furthermore, for publications that will be dropped, we do not need the “copy_data” option, right? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
> On Jan 27, 2021, at 19:41, Bharath Rupireddy > wrote: > > On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila wrote: >>> On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy >>> wrote: On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila wrote: >>> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION >>> will refresh the new and existing publications. >> That sounds a bit unusual to me because when the user has specifically >> asked to just ADD Publication, we might refresh some existing >> Publication along with it? > > Hmm. That's correct. I also feel we should not touch the existing > publications, only the ones that are added/dropped should be > refreshed. Because there will be an overhead of a SQL with more > publications(in fetch_table_list) when AlterSubscription_refresh() is > called with all the existing publications. We could just pass in the > newly added/dropped publications to AlterSubscription_refresh(). > > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with > refresh true refreshes only the newly added publications, because what > we do in AlterSubscription_refresh() is that we fetch the tables > associated with the publications from the publisher, compare them with > the previously fetched tables from that publication and add the new > tables or remove the table that don't exist in that publication > anymore. > > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same > thing i.e. refreshes only the dropped publications. > > Thoughts? Argeed. We just only need to refresh the added/dropped publications. Furthermore, for dropped publications we do not need the “copy_data” option, right? > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Disable bgworkers during servers start in pg_upgrade
Oh, I forgot another point before sending my previous email. Maybe it might worth adding some final safety checks in pg_upgrade itself? Eg. checking controldata and mxid files coherency between old and new cluster would have catch the inconsistency here.
Re: FailedAssertion in heap_index_delete_tuples at heapam.c:7220
On Wed, Jan 27, 2021 at 2:09 AM Peter Geoghegan wrote: > > On Tue, Jan 26, 2021 at 10:52 PM Jaime Casanova > wrote: > > ${subject} happened while executing ${attached query} at regresssion > > database, using 14dev (commit > > d5a83d79c9f9b660a6a5a77afafe146d3c8c6f46) and produced ${attached > > stack trace}. > > I see the bug: gistprunepage() calls > index_compute_xid_horizon_for_tuples() (which ultimately calls the > heapam.c callback for heap_index_delete_tuples()) with an empty array, > which we don't expect. The similar code within _hash_vacuum_one_page() > already only calls index_compute_xid_horizon_for_tuples() when > ndeletable > 0. > > The fix is obvious: Bring gistprunepage() in line with > _hash_vacuum_one_page(). I'll go push a fix for that now. > Thanks -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: Key management with tests
On Tue, Jan 26, 2021 at 05:53:01PM -0500, Bruce Momjian wrote: > On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote: > > I'm wondering whether you've considered storing all the keys in one > > file instead of a file per key. The reason I ask is that it seems to > > me that the key rotation procedure would be a lot simpler if it were > > all in one file. You could just create a temporary file and atomically > > rename it over the existing file. If you see a temporary file you're > > always free to remove it. This is a lot simpler than what you have > > right now. The "repair" concept pretty much goes away completely, > > which seems nice. Granted I don't know exactly how to store multiple > > keys in one file, but I bet there's some way to do it. > > We envisioned allowing heap/index key rotation by having a standby with > the same WAL key as the primary but different heap/index keys so that we > can failover to the standby to change the heap/index key and then change > the WAL key. This separation allows that. We also might need some > additional keys later and this allows that. I do like simplicity, but > the complexity here seems to serve a need. Just to close this issue, several scripts, e,g., PIV, AWS, need to store data to indicate the cluster encryption key used, and those need to be kept synchronized with the wrapped data keys. Having separate directories for each cluster key version allows that to work cleanly. > > The README in src/backend/crypto does not explain how the scripts in > > that directory are intended to be used. If I want to use AWS Secrets > > Manager with this feature, I can see that I should use > > ckey_aws.sh.sample as a basis for that integration, but I don't know > > what I should do with the script because the README says nothing about > > it. I am frankly pretty doubtful about the idea of shipping a bunch of > > /bin/sh scripts as a best practice; for one thing, that's totally > > unusable on Windows, and it also means that this is dependent on > > /bin/sh existing and having the behavior we expect and on all the > > other executables in these scripts as well. But, at the very least, > > there needs to be a clearer explanation of how the scripts are > > intended to be used, which parts people are supposed to modify, what > > arguments they're going to get called with, and things like that. > > I added comments to most of the scripts. I don't know what more I can > do, or what other language would be appropriate. I think someone would need to write Windows versions of these scripts. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Speeding up GIST index creation for tsvectors
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar wrote: > > On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin wrote: > > +1 > > This will make all INSERTs and UPDATES for tsvector's GiSTs. > > Oh, I didn't realize that this code is getting used in GIST index > insertion and creation too. Will check there. I ran some insert and update tests; they show only marginal improvement. So looks like the patch is mainly improving index builds. > > Meanwhile there are at least 4 incarnation of hemdistsign() functions that > > are quite similar. I'd propose to refactor them somehow... > > Yes, I hope we get the benefit there also. Before that, I thought I > should post the first use-case to get some early comments. Thanks for > your encouraging comments :) The attached v2 version of 0001 patch extends the hemdistsign() changes to the other use cases like intarray, ltree and hstore. I see the same index build improvement for all these types. Since for the gist index creation of some of these types the default value for siglen is small (8-20), I tested with small siglens. For siglens <= 20, particularly for values that are not multiples of 8 (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index creation. It's probably because of an extra function call for pg_xorcount(); and also might be due to the extra logic in pg_xorcount() which becomes prominent for shorter traversals. So for siglen less than 32, I kept the existing method using byte-by-byte traversal. -- Thanks, -Amit Khandekar Huawei Technologies From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Thu, 10 Dec 2020 21:45:49 +0800 Subject: [PATCH 2/2] Avoid function pointer dereferencing for pg_popcount32/64() call. With USE_POPCNT_ASM set (which is true only on x86), we decide with the help of __get_cpuid() at runtime whether the CPU supports popcnt instruction, and hence we dynamically choose the corresponding function; thus pg_popcount32/64 is a function pointer. But for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic assignment of functions, but we were still declaring pg_popcount64 as a function pointer. So arrange for direct function call so as to get rid of function pointer dereferencing each time pg_popcount32/64 is called. To do this, define pg_popcount64 to another function name (pg_popcount64_nonasm) rather than a function pointer, whenever USE_POPCNT_ASM is not defined. And let pg_popcount64_nonasm() be a static inline function so that whenever pg_popcount64() is called, the __builtin_popcount() gets called directly. For platforms not supporting __builtin_popcount(), continue using the slow version as is the current behaviour. pg_popcount64_slow() was actually a misnomer, because it was actually calling __builtin_popcount() which is not a slow function. Now with this commit, pg_popcount64_slow() indeed has only slow code. Tested this on ARM64, and the gist index creation for tsvectors speeds up by ~ 6% for default siglen (=124), and by 12% with siglen=700 --- src/include/port/pg_bitutils.h | 59 ++ src/port/pg_bitutils.c | 47 +-- 2 files changed, 67 insertions(+), 39 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 174df28e66..b039f94ee5 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256]; extern const uint8 pg_number_of_ones[256]; #endif +/* + * On x86_64, we can use the hardware popcount instruction, but only if + * we can verify that the CPU supports it via the cpuid instruction. + * + * Otherwise, we fall back to __builtin_popcount if the compiler has that, + * or a hand-rolled implementation if not. + */ +#ifdef HAVE_X86_64_POPCNTQ +#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID) +#define USE_POPCNT_ASM 1 +#endif +#endif + /* * pg_leftmost_one_pos32 * Returns the position of the most significant set bit in "word", @@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num) } /* Count the number of one-bits in a uint32 or uint64 */ + +/* + * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime + * whether the CPU supports popcnt instruction, and hence we dynamically choose + * the corresponding function; thus pg_popcount32/64 is a function pointer. But + * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic + * assignment of functions, so we arrange for direct function call so as to get + * rid of function pointer dereferencing each time pg_popcount32/64 is called. + */ +#ifdef USE_POPCNT_ASM extern int (*pg_popcount32) (uint32 word); extern int (*pg_popcount64) (uint64 word); +#else +#define pg_popcount32 pg_popcount32_nonasm +#define pg_popcount64 pg_popcount64_nonasm +#endif + +/* Slow versions of pg_popcount */ +#ifndef HAVE__BUILTIN_POPCOUNT +extern int pg_popcount32_slow(uint32 word); +extern int pg_popcount
Re: Online checksums patch - once again
Revisiting an issue we discussed earlier: On 25/11/2020 15:20, Daniel Gustafsson wrote: On 23 Nov 2020, at 18:36, Heikki Linnakangas wrote: On 17/11/2020 10:56, Daniel Gustafsson wrote: I've reworked this in the attached such that the enable_ and disable_ functions merely call into the launcher with the desired outcome, and the launcher is responsible for figuring out the rest. The datachecksumworker is now the sole place which initiates a state transfer. Well, you still fill the DatachecksumsWorkerShmem->operations array in the backend process that launches the datacheckumworker, not in the worker process. I find that still a bit surprising, but I believe it works. I'm open to changing it in case there are strong opinions, it just seemed the most natural to me. This kept bothering me, so I spent a while hacking this to my liking. The attached patch moves the code to fill in 'operations' from the backend to the launcher, so that the pg_enable/disable_checksums() call now truly just stores the desired state, checksum on or off, in shared memory, and launches the launcher process. The launcher process figures out how to get to the desired state. This removes the couple of corner cases that previously emitted a NOTICE about the processing being concurrently disabled or aborted. What do you think? I haven't done much testing, so if you adopt this approach, please check if I broke something in the process. This changes the way the abort works. If the pg_enable/disable_checksums() is called, while the launcher is already busy changing the state, pg_enable/disable_checksums() will just set the new desired state in shared memory anyway. The launcher process will notice that the target state changed some time later, and restart from scratch. A couple of other issues came up while doing that: - AbortProcessing() has two callers, one in code that runs in the launcher process, and another one in code that runs in the worker process. Is it really safe to use from the worker process? Calling ProcessAllDatabases() in the worker seems sketchy. (This is moot in this new patch version, as I removed AbortProcessing() altogether) - Is it possible that the worker keeps running after the launcher has already exited, e.g. because of an ERROR or SIGTERM? If you then quickly call pg_enable_checksums() again, can you end up with two workers running at the same time? Is that bad? On 26/01/2021 23:00, Daniel Gustafsson wrote: On 22 Jan 2021, at 12:55, Heikki Linnakangas wrote: @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname, relkind == RELKIND_MATVIEW) RelationInitTableAccessMethod(rel); + /* +* Set the data checksum state. Since the data checksum state can change at +* any time, the fetched value might be out of date by the time the +* relation is built. DataChecksumsNeedWrite returns true when data +* checksums are: enabled; are in the process of being enabled (state: +* "inprogress-on"); are in the process of being disabled (state: +* "inprogress-off"). Since relhaschecksums is only used to track progress +* when data checksums are being enabled, and going from disabled to +* enabled will clear relhaschecksums before starting, it is safe to use +* this value for a concurrent state transition to off. +* +* If DataChecksumsNeedWrite returns false, and is concurrently changed to +* true then that implies that checksums are being enabled. Worst case, +* this will lead to the relation being processed for checksums even though +* each page written will have them already. Performing this last shortens +* the window, but doesn't avoid it. +*/ + HOLD_INTERRUPTS(); + rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite(); + RESUME_INTERRUPTS(); + /* * Okay to insert into the relcache hash table. * I grepped for relhashcheckums, and concluded that the value in the relcache isn't actually used for anything. Not so! In heap_create_with_catalog(), the actual pg_class row is constructed from the relcache entry, so the value set in RelationBuildLocalRelation() finds its way to pg_class. Perhaps it would be more clear to pass relhachecksums directly as an argument to AddNewRelationTuple(). That way, the value in the relcache would be truly never used. I might be thick (or undercaffeinated) but I'm not sure I follow. AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the relcache entry. Ah, you're right, I misread AddNewRelationTuple. That means that the relhaschecksums field in the relcache is never used? That's a clear rule. The changes to formrdesc() and RelationBuildLocalRelation() seem unnecessary then, we can always initialize relhaschecksums to false in the relcache. - Heikki >From 3ccb06a1e39b456d26a5e2f89c9b634f04b34307 Mon Sep 17
Inconsistent function definitions in ECPG's informix.c
I noticed that some of the newer compilers in the buildfarm (e.g., caiman, with gcc 11.0) whine about the definitions of rjulmdy() and rmdyjul() not quite matching their external declarations: informix.c:516:23: warning: argument 2 of type `short int[3]' with mismatched bound [-Warray-parameter=] 516 | rjulmdy(date d, short mdy[3]) | ~~^~ In file included from informix.c:10: ../include/ecpg_informix.h:38:31: note: previously declared as `short int *' 38 | extern int rjulmdy(date, short *); | ^~~ informix.c:567:15: warning: argument 1 of type `short int[3]' with mismatched bound [-Warray-parameter=] 567 | rmdyjul(short mdy[3], date * d) | ~~^~ In file included from informix.c:10: ../include/ecpg_informix.h:41:25: note: previously declared as `short int *' 41 | extern int rmdyjul(short *, date *); | ^~~ This isn't a bug really, since per the C spec these declarations are equivalent. But it'd be good to silence the warning before it gets any more common. The most conservative thing to do would be to take the user-visible extern declarations as being authoritative, and change informix.c to match. I'm slightly tempted to do the opposite though, on the grounds that showing the expected lengths of the arrays is useful. But I wonder if anyone's compatibility checker tools would (mistakenly) classify that as an ABI break. Thoughts? regards, tom lane
Re: Columns correlation and adaptive query optimization
On 27.01.2021 8:45, Yugo NAGATA wrote: On Mon, 25 Jan 2021 16:27:25 +0300 Konstantin Knizhnik wrote: Hello, Thank you for review. My answers are inside. Thank you for updating the patch and answering my questions. (2) If I understand correctly, your proposal consists of the following two features. 1. Add a feature to auto_explain that creates an extended statistic automatically if an error on estimated rows number is large. 2. Improve rows number estimation of join results by considering functional dependencies between vars in the join qual if the qual has more than one clauses, and also functional dependencies between a var in the join qual and vars in quals of the inner/outer relation. As you said, these two parts are independent each other, so one feature will work even if we don't assume the other. I wonder it would be better to split the patch again, and register them to commitfest separately. I agree with you that this are two almost unrelated changes, although without clausesel patch additional statistic can not improve query planning. I think extended statistics created by the auto_explain patch can improve query planning even without the clausesel patch. For example, suppose the following case: postgres=# create table t ( i int, j int); CREATE TABLE postgres=# insert into t select i/10, i/100 from generate_series(1,100) i; INSERT 0 100 postgres=# analyze t; ANALYZE postgres=# explain analyze select * from t where i = 100 and j = 10; QUERY PLAN - Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1) Filter: ((i = 100) AND (j = 10)) Rows Removed by Filter: 90 Planning Time: 0.199 ms Execution Time: 97.327 ms (5 rows) After applying the auto_explain patch (without clausesel patch) and issuing the query, additional statistics were created. postgres=# select * from t where i = 100 and j = 10; LOG: Add statistics t_i_j Then, after analyze, the row estimation was improved. postgres=# analyze t; ANALYZE postgres=# explain analyze select * from t where i = 100 and j = 10; postgres=# explain analyze select * from t where i = 100 and j = 10; QUERY PLAN -- Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1) Filter: ((i = 100) AND (j = 10)) Rows Removed by Filter: 90 Planning Time: 0.124 ms Execution Time: 95.383 ms (5 rows) So, I think the auto_explain patch is useful with just that as a tool to detect a gap between estimate and real and adjust the plan. Also, the clausesel patch would be useful without the auto_explain patch if an appropriate statistics are created. But I already have too many patches at commitfest. May be it will be enough to spit this patch into two? Although we can discuss both of these patches in this thread, I wonder we don't have to commit them together. (3) + DefineCustomBoolVariable("auto_explain.suggest_only", +"Do not create statistic but just record in WAL suggested create statistics statement.", +NULL, + &auto_explain_suggest_on To imply that this parameter is involving to add_statistics_threshold, it seems better for me to use more related name like add_statistics_suggest_only. Also, additional documentations for new parameters are required. Done. + + + + auto_explain.auto_explain.add_statistics_threshold (real) + + auto_explain.add_statistics_threshold configuration parameter + + + + + auto_explain.add_statistics_threshold sets the threshold for + actual/estimated #rows ratio triggering creation of multicolumn statistic + for the related columns. It can be used for adpative query optimization. + If there is large gap between real and estimated number of tuples for the + concrete plan node, then multicolumn statistic is created for involved + attributes. Zero value (default) disables implicit creation of multicolumn statistic. + + I wonder we need to say that this parameter has no effect unless log_analyze is enabled and that statistics are created only when the excution time exceeds log_min_duration, if these behaviors are intentional. In addition, additional statistics are created only if #rows is over-estimated and not if it is under-estimated. Although it seems good as a criterion for creating multicolumn statistic since extended statisstic is usually useful to fix over-estimation, I am not sure if we don't have to consider under-estimation case at all. (9) Curr
Re: [PoC] Non-volatile WAL buffer
On 1/25/21 3:56 AM, Masahiko Sawada wrote: >> >> ... >> >> On 1/21/21 3:17 AM, Masahiko Sawada wrote: >>> ... >>> >>> While looking at the two methods: NTT and simple-no-buffer, I realized >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and >>> pmem_drain()) WAL without acquiring WALWriteLock whereas >>> simple-no-buffer patch acquires WALWriteLock to do that >>> (pmem_persist()). I wonder if this also affected the performance >>> differences between those two methods since WALWriteLock serializes >>> the operations. With PMEM, multiple backends can concurrently flush >>> the records if the memory region is not overlapped? If so, flushing >>> WAL without WALWriteLock would be a big benefit. >>> >> >> That's a very good question - it's quite possible the WALWriteLock is >> not really needed, because the processes are actually "writing" the WAL >> directly to PMEM. So it's a bit confusing, because it's only really >> concerned about making sure it's flushed. >> >> And yes, multiple processes certainly can write to PMEM at the same >> time, in fact it's a requirement to get good throughput I believe. My >> understanding is we need ~8 processes, at least that's what I heard from >> people with more PMEM experience. > > Thanks, that's good to know. > >> >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming >> from the 0002 patch) is actually correct. Essentially, consider the >> backend needs to do a flush, but does not have a segment mapped. So it >> maps it and calls pmem_drain() on it. >> >> But does that actually flush anything? Does it properly flush changes >> done by other processes that may not have called pmem_drain() yet? I >> find this somewhat suspicious and I'd bet all processes that did write >> something have to call pmem_drain(). > For the record, from what I learned / been told by engineers with PMEM experience, calling pmem_drain() should properly flush changes done by other processes. So it should be sufficient to do that in XLogFlush(), from a single process. My understanding is that we have about three challenges here: (a) we still need to track how far we flushed, so this needs to be protected by some lock anyway (although perhaps a much smaller section of the function) (b) pmem_drain() flushes all the changes, so it flushes even "future" part of the WAL after the requested LSN, which may negatively affects performance I guess. So I wonder if pmem_persist would be a better fit, as it allows specifying a range that should be persisted. (c) As mentioned before, PMEM behaves differently with concurrent access, i.e. it reaches peak throughput with relatively low number of threads wroting data, and then the throughput drops quite quickly. I'm not sure if the same thing applies to pmem_drain() too - if it does, we may need something like we have for insertions, i.e. a handful of locks allowing limited number of concurrent inserts. > Yeah, in terms of experiments at least it's good to find out that the > approach mmapping each WAL segment is not good at performance. > Right. The problem with small WAL segments seems to be that each mmap causes the TLB to be thrown away, which means a lot of expensive cache misses. As the mmap needs to be done by each backend writing WAL, this is particularly bad with small WAL segments. The NTT patch works around that by doing just a single mmap. I wonder if we could pre-allocate and mmap small segments, and keep them mapped and just rename the underlying files when recycling them. That'd keep the regular segment files, as expected by various tools, etc. The question is what would happen when we temporarily need more WAL, etc. >>> >>> ... >>> >>> I think the performance improvement by NTT patch with the 16MB WAL >>> segment, the most common WAL segment size, is very good (150437 vs. >>> 212410 with 64 clients). But maybe evaluating writing WAL segment >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we >>> don't do that yet. >>> >> >> Well, not sure. I think the question is still open whether it's actually >> safe to run on DAX, which does not have atomic writes of 512B sectors, >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's >> not an issue. > > I think we can use the Block Translation Table (BTT) driver that > provides atomic sector updates. > But we have benchmarked that, see my message from 2020/11/26, which shows this table: master/bttmaster/dax nttsimple --- 1 5469 7402 7977 6746 1648222 80869107025 82343 3273974158189214718158348 6485921154540225715164248 96 150602221159237008217253 Clearly, BTT is quite expensive. Maybe there's a way to tune that at filesystem/kernel level, I haven't tried that.
Improve join selectivity estimation using extended statistics
Original thread is: https://www.postgresql.org/message-id/flat/196f1e1a-5464-ed07-ab3c-0c9920564af7%40postgrespro.ru Following Yugo's advice, I have splitted this patch into two: 1. Extending auto_explain extension to generate extended statistics in case of bad selectivity estimation. 2. Taken in account extended statistics when computing join selectivity. Now this thread will contain only patches for join selectivity estimation. However, IIUC, the clausesel patch uses only functional dependencies statistics for improving join, so my question was about possibility to consider MCV in the clausesel patch. Sorry, do not have idea right now how to use MCV for better estimation of join selectivity. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf..5446ad5 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -25,6 +25,8 @@ #include "utils/fmgroids.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" +#include "statistics/statistics.h" +#include "catalog/pg_statistic_ext.h" /* * Data structure for accumulating info about possible range-query @@ -56,6 +58,47 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root, / /* + * Find functional dependency between attributes using multicolumn statistic. + * relid: index of relation to which all considered attributes belong + * var: variable which dependencies are inspected + * attnums: set of considered attributes included specified variables + * This function return degree of strongest dependency between some subset of this attributes + * and specified variable or 0.0 if on dependency is found. + */ +double +find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums) +{ + RelOptInfo* rel = find_base_rel(root, relid); + if (rel->rtekind == RTE_RELATION && rel->statlist != NIL) + { + StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES, + &attnums, 1); + if (stat != NULL) + { + MVDependencies *dependencies = statext_dependencies_load(stat->statOid); + MVDependency *strongest = NULL; + int i; + for (i = 0; i < dependencies->ndeps; i++) + { +MVDependency *dependency = dependencies->deps[i]; +int n_dep_vars = dependency->nattributes - 1; +/* Dependency implies attribute */ +if (var->varattno == dependency->attributes[n_dep_vars]) +{ + while (--n_dep_vars >= 0 + && bms_is_member(dependency->attributes[n_dep_vars], attnums)); + if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree)) + strongest = dependency; +} + } + if (strongest) +return strongest->degree; + } + } + return 0.0; +} + +/* * clauselist_selectivity - * Compute the selectivity of an implicitly-ANDed list of boolean * expression clauses. The list can be empty, in which case 1.0 @@ -129,6 +172,9 @@ clauselist_selectivity_ext(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + Bitmapset *clauses_attnums = NULL; + int n_clauses_attnums = 0; + int innerRelid = varRelid; /* * If there's exactly one clause, just go directly to @@ -139,6 +185,9 @@ clauselist_selectivity_ext(PlannerInfo *root, varRelid, jointype, sjinfo, use_extended_stats); + if (innerRelid == 0 && sjinfo) + bms_get_singleton_member(sjinfo->min_righthand, &innerRelid); + /* * Determine if these clauses reference a single relation. If so, and if * it has extended statistics, try to apply those. @@ -171,7 +220,6 @@ clauselist_selectivity_ext(PlannerInfo *root, Node *clause = (Node *) lfirst(l); RestrictInfo *rinfo; Selectivity s2; - listidx++; /* @@ -204,6 +252,7 @@ clauselist_selectivity_ext(PlannerInfo *root, else rinfo = NULL; + /* * See if it looks like a restriction clause with a pseudoconstant on * one side. (Anything more complicated than that might not behave in @@ -215,6 +264,42 @@ clauselist_selectivity_ext(PlannerInfo *root, OpExpr *expr = (OpExpr *) clause; bool varonleft = true; bool ok; + int oprrest = get_oprrest(expr->opno); + + /* Try to take in account functional dependencies between attributes */ + if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0) + { +Var* var = (Var*)linitial(expr->args); +if (!IsA(var, Var) || var->varno != innerRelid) +{ + var = (Var*)lsecond(expr->args); +} +if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid) +{ + clauses_attnums = bms_add_member(clauses_attnums, var->varattno); + if (n_clauses_attnums++ != 0) + { + double dep = find_var_dependency(root, innerRelid, var, clauses_attnums); +
Re: Printing backtrace of postgres processes
Hi, On 2021-01-27 19:05:16 +0530, vignesh C wrote: > /* > + * LogBackTrace > + * > + * Get the backtrace and log the backtrace to log file. > + */ > +void > +LogBackTrace(void) > +{ > + int save_errno = errno; > + > + void *buf[100]; > + int nframes; > + char **strfrms; > + StringInfoData errtrace; > + > + /* OK to process messages. Reset the flag saying there are more to do. > */ > + PrintBacktracePending = false; ISTM that it'd be better to do this in the caller, allowing this function to be used outside of signal triggered backtraces. > > +extern void LogBackTrace(void); /* Called from EmitProcSignalPrintCallStack > */ I don't think this comment is correct anymore? Greetings, Andres Freund
Re: cleaning up a few CLOG-related things
On Mon, Jan 25, 2021 at 2:11 PM Heikki Linnakangas wrote: > > [patches 0001 - 0003] > > Makes sense. Great. I committed the first one and will proceed with those as well. > > So, the CLOG page gets created when nextXid advances from the first > > value on the page to the second value on the page, not when it > > advances from the last value on the previous page to the first value > > on the current page. > Yes. It took me a moment to understand that explanation, though. I'd > phrase it something like "nextXid is the next XID that will be used, but > we want to set latest_page_number according to the last XID that's > already been used. So retreat by one." OK, updated the patch to use that language for the comment. > Having a separate FullTransactionIdToLatestPageNumber() function for > this seems like overkill to me. I initially thought so too, but it turned out to be pretty useful for writing debugging cross-checks and things, so I'm inclined to keep it even though I'm not at present proposing to commit any such debugging cross-checks. For example I tried making the main redo loop check whether XactCtl->shared->latest_page_number == FullTransactionIdToLatestPageNumber(nextXid) which turned out to be super-helpful in understanding things. -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-In-clog_redo-don-t-set-XactCtl-shared-latest_page.patch Description: Binary data v2-0002-In-TrimCLOG-don-t-reset-XactCtl-shared-latest_pag.patch Description: Binary data v2-0003-In-StartupCLOG-correct-an-off-by-one-error.patch Description: Binary data
Re: Support for NSS as a libpq TLS backend
On Wed, 2021-01-27 at 16:39 +0900, Michael Paquier wrote: > My apologies for chiming in. I was looking at your patch set here, > and while reviewing the strong random and cryptohash parts I have > found a couple of mistakes in the ./configure part. I think that the > switch from --with-openssl to --with-ssl={openssl} could just be done > independently as a building piece of the rest, then the first portion > based on NSS could just add the minimum set in configure.ac. > > Please note that the patch set has been using autoconf from Debian, or > something forked from upstream. There were also missing updates in > several parts of the code base, and a lack of docs for the new > switch. I have spent time checking that with --with-openssl to make > sure that the obsolete grammar is still compatible, --with-ssl=openssl > and also without it. > > Thoughts? Seems good to me on Ubuntu; builds with both flavors. From peering at the Windows side: > --- a/src/tools/msvc/config_default.pl > +++ b/src/tools/msvc/config_default.pl > @@ -16,7 +16,7 @@ our $config = { > tcl => undef,# --with-tcl= > perl => undef,# --with-perl= > python=> undef,# --with-python= > - openssl => undef,# --with-openssl= > + openssl => undef,# --with-ssl=openssl with > uuid => undef,# --with-uuid= > xml => undef,# --with-libxml= > xslt => undef,# --with-libxslt= So to check understanding: the `openssl` config variable is still alive for MSVC builds; it just turns that into `--with-ssl=openssl` in the fake CONFIGURE_ARGS? Since SSL is an obsolete term, and the choice of OpenSSL vs NSS vs [nothing] affects server operation (such as cryptohash) regardless of whether or not connection-level TLS is actually used, what would you all think about naming this option --with-crypto? I.e. --with-crypto=openssl --with-crypto=nss --Jacob
Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM
On Fri, Jan 24, 2020 at 09:24:45PM +, Bossart, Nathan wrote: > On 1/21/20, 1:39 PM, "Vik Fearing" wrote: > > On 21/01/2020 22:21, Bossart, Nathan wrote: > >> I've attached a patch for a couple of new options for VACUUM: > >> MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP. The motive > >> behind these options is to allow table owners to easily vacuum only > >> the TOAST table or only the main relation. This is especially useful > >> for TOAST tables since roles do not have access to the pg_toast schema > >> by default and some users may find it difficult to discover the name > >> of a relation's TOAST table. > > > > > > Could you explain why one would want to do this? Autovacuum will > > already deal with the tables separately as needed, but I don't see when > > a manual vacuum would want to make this distinction. > > The main use case I'm targeting is when the level of bloat or > transaction ages of a relation and its TOAST table have significantly > diverged. In these scenarios, it could be beneficial to be able to > vacuum just one or the other, especially if the tables are large. This just came up for me: I have a daily maintenance script which pro-actively vacuums tables: freezing historic partitions, vacuuming current tables if the table's relfrozenxid is old, and to encourage indexonly scan. I'm checking the greatest(age(toast,main)) and vacuum the table (and implicitly its toast) whenever either is getting old. But it'd be more ideal if I could independently vacuum the main table if it's old, but not the toast table. -- Justin
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Hello, hackers. I think I was able to fix the issue related to minRecoveryPoint and crash recovery. To make sure standby will be consistent after crash recovery, we need to take the current value of minRecoveryPoint into account while setting LP_DEAD hints (almost the same way as it is done for *heap* hint bits already). I have introduced new structure IndexHintBitsData: --- /* guaranteed not visible for all backends */ bool all_dead; /* latest removed xid if known */ TransactionId latest_removed_xid; /* lsn of page where dead tuple located */ XLogRecPtr page_lsn; --- This structure is filled by the `heap_hot_search_buffer` function. After, we decide to set or not `kill_prior_tuple` depending on its content (calling `IsMarkBufferDirtyIndexHintAllowed`). For primary - it is always safe to set LP_DEAD in index if `all_dead` == true. In the case of standby, we need to check `latest_removed_xid` (if available) first. If commit LSN of the latest removed xid is already lower than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set `kill_prior_tuple`. Sometimes we are not sure about the latest removed xid - heap record could be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case we check the LSN of the *heap* page containing the tuple (LSN could be updated by other transactions already - but it does not matter in that situation). If page LSN is lower than minRecoveryPoint - it is safe to set LP_DEAD in the index too. Otherwise - just leave the index tuple alive. So, to bring it all together: * Normal operation, proc->indexIgnoreKilledTuples is true: It is safe for standby to use hint bits from the primary FPI because of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution. It is safe for standby to set its index hint bits because `ComputeXidHorizons` honors other read-only procs xmin and lowest xid on primary (`KnownAssignedXidsGetOldestXmin`). * Normal operation, proc->indexIgnoreKilledTuples is false: Index hint bits are never set or taken into account. * Crash recovery, proc->indexIgnoreKilledTuples is true: It is safe for standby to use hint bits from the primary FPW because XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record of transaction removed the tuple is logged before XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken into account by minRecoveryPoint) - both transaction-remover and horizon records are replayed before reading queries. It is safe for standby to use its hint bits because they can be set only if the commit record of transaction-remover is lower than minRecoveryPoint or LSN of heap page with removed tuples is lower than minRecoveryPoint. * Crash recovery, proc->indexIgnoreKilledTuples is false: Index hint bits are never set or taken into account. So, now it seems correct to me. Another interesting point here - now position of minRecoveryPoint affects performance a lot. It is happening already (because of *heap* hint bits) but after the patch, it is noticeable even more. Is there any sense to keep minRecoveryPoint at a low value? Rebased and updated patch in attachment. Will be happy if someone could recheck my ideas or even the code :) Thanks a lot, Michail. diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile index 96442ceb4e..6399184a8c 100644 --- a/src/test/recovery/Makefile +++ b/src/test/recovery/Makefile @@ -10,6 +10,7 @@ #- EXTRA_INSTALL=contrib/test_decoding +EXTRA_INSTALL+=contrib/pageinspect subdir = src/test/recovery top_builddir = ../../.. diff --git a/src/test/recovery/t/022_index_hint_bits.pl b/src/test/recovery/t/022_index_hint_bits.pl new file mode 100644 index 00..737dca0185 --- /dev/null +++ b/src/test/recovery/t/022_index_hint_bits.pl @@ -0,0 +1,282 @@ +# Checks that index hints on standby work as excepted. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 15; +use Config; + +# Initialize primary node +my $node_primary = get_new_node('primary'); +$node_primary->init(allows_streaming => 1); +$node_primary->append_conf('postgresql.conf', qq{ +autovacuum = off +enable_seqscan = off +enable_indexonlyscan = off +}); +$node_primary->start; + +$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect'); +# Create test table with primary index +$node_primary->safe_psql( +'postgres', 'CREATE TABLE test_index_hint (id int, value int)'); +$node_primary->safe_psql( +'postgres', 'CREATE INDEX test_index ON test_index_hint (value, id)'); +# Fill some data to it, note to not put a lot of records to avoid +# heap_page_prune_opt call which cause conflict on recovery hiding conflict +# caused due index hint bits +$node_primary->safe_psql('postgres', +'INSERT INTO test_index_hint VALUES (generate_series(1, 30), 0)'); +# And vacuum to allow index hint bits to be set
Re: Thoughts on "killed tuples" index hint bits support on standby
Hello, hackers. Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1). Thanks, Michail. [1] https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi Mark, On Wed, Jan 27, 2021, at 10:58, Mark Rofail wrote: >Kindly find the updated patch (v15) below I think you forgot to attach the patch. /Joel
protect pg_stat_statements_info() for being used without the library loaded
Hi, Attached is a small patch for ${subject} -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 72a117fc19..62cccbfa44 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -1898,6 +1898,11 @@ pg_stat_statements_info(PG_FUNCTION_ARGS) Datum values[PG_STAT_STATEMENTS_INFO_COLS]; bool nulls[PG_STAT_STATEMENTS_INFO_COLS]; + if (!pgss || !pgss_hash) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("pg_stat_statements must be loaded via shared_preload_libraries"))); + /* Build a tuple descriptor for our result type */ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) elog(ERROR, "return type must be a row type");
Re: Allow matching whole DN from a client certificate
On Tue, 2021-01-26 at 18:43 +, Jacob Champion wrote: > On Tue, 2021-01-26 at 13:49 +0100, Daniel Gustafsson wrote: > > The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517 > > section 4.2.15 (which in turn reference RFC 4514 for the DN string format). > > libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in > > lib/certdb/alg1485.c. Comparing the two would be interesting. > > Yeah. I'll poke around a bit. Here's some output from a test program I threw together, which parses identical DER using OpenSSL and NSS and writes the corresponding string representation. > input/basic.conf: > ssl: CN=pchampion,OU=VMware > nss: CN=pchampion,OU=VMware > > input/escape.conf: > ssl: CN=\,\+\\\;\"\<\> > nss: CN=",+\\;\"<>" > > input/multivalue.conf: > ssl: CN=pchampion+SN=Champion+GN=Jacob,OU=VMware > nss: givenName=Jacob+SN=Champion+CN=pchampion,OU=VMware > > input/unicode.conf: > ssl: CN=οδυσσέας,OU=VMware > nss: CN=οδυσσέας,OU=VMware > > input/unprintable.conf: > ssl: CN=\01\,\02\,\03,OU=\01\02\03 > nss: CN="\01,\02,\03",OU=\01\02\03 basic.conf is exactly what it looks like: CN=pchampion,OU=VMware. Both implementations agree. escape.conf contains a CN with the literal value ,+\;"<> and you can see that NSS doesn't follow RFC 4514 here; it uses the older double-quoting form of escaping. There's a 15-year-old bug on this in NSS [1]. multivalue.conf contains a multivalued AVA with commonName "pchampion", givenName "Jacob", and surname "Champion". They aren't sorted in the same order, and the implementations even disagree on how to represent the givenName attribute. (I'm not convinced that either choice is RFC- 4514-compliant; it doesn't look like GN is registered with IANA as a short name for givenName.) unicode.conf contains a commonName of "οδυσσέας". Both implementations agree, but the only way I was able to get OpenSSL to produce this (rather than a string of escaped hex) was by using the flags XN_FLAG_RFC2253 & ~ASN1_STRFLGS_ESC_MSB in the call to X509_NAME_print_ex(). This should work fine for a database encoding of UTF-8, but would we need to convert for other encodings? Also, I'm not sure how this would handle certificates that aren't UTF-8 encoded. It looks like some UCS variants are legal? unprintable.conf contains the bytes 0x01, 0x02, and 0x03 in the commonName and organizationalUnit. They're backslash-escaped by both implementations, but if you add any other printable escaped characters (such as comma, in the CN example here) NSS will still double-quote the whole thing. --Jacob [1] https://bugzilla.mozilla.org/show_bug.cgi?id=355096
Re: archive status ".ready" files may be created too early
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" wrote: > At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan" > wrote in >> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" wrote: >> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" >> > wrote in >> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" wrote: >> >> > You're right in that regard. There's a window where partial record is >> >> > written when write location passes F0 after insertion location passes >> >> > F1. However, remembering all spanning records seems overkilling to me. >> >> >> >> I'm curious why you feel that recording all cross-segment records is >> >> overkill. IMO it seems far simpler to just do that rather than try to >> > >> > Sorry, my words are not enough. Remembering all spanning records in >> > *shared memory* seems to be overkilling. Much more if it is stored in >> > shared hash table. Even though it rarely the case, it can fail hard >> > way when reaching the limit. If we could do well by remembering just >> > two locations, we wouldn't need to worry about such a limitation. >> >> I don't think it will fail if we reach max_size for the hash table. >> The comment above ShmemInitHash() has this note: >> >> * max_size is the estimated maximum number of hashtable entries. This is >> * not a hard limit, but the access efficiency will degrade if it is >> * exceeded substantially (since it's used to compute directory size and >> * the hash table buckets will get overfull). > > That description means that a shared hash has a directory with fixed > size thus there may be synonyms, which causes degradation. Even though > buckets are preallocated with the specified number, since the minimum > directory size is 256, buckets are allocated at least 256 in a long > run. Minimum on-the-fly allocation size is 32. I haven't calcuated > further precicely, but I'm worried about the amount of spare shared > memory the hash can allocate. On my machine, hash_estimate_size() for the table returns 5,968 bytes. That estimate is for a max_size of 16. In my testing, I've been able to need up to 6 elements in this table, but that required turning off synchronous_commit, adding a long sleep at the end of XLogWrite(), and increasing wal_buffers substantially. This leads me to think that a max_size of 16 elements is typically sufficient. (I may have also accidentally demonstrated that only storing two record boundaries could be insufficient.) >> > Another concern about the concrete patch: >> > >> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a >> > LWLock every time XLogWrite is called while segment archive is being >> > held off. I don't think it is acceptable and I think it could be a >> > problem when many backends are competing on WAL. >> >> This is a fair point. I did some benchmarking with a few hundred >> connections all doing writes, and I was not able to discern any >> noticeable performance impact. My guess is that contention on this >> new lock is unlikely because callers of XLogWrite() must already hold >> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no >> more than once per segment to record new record boundaries. > > Thanks. I agree that the reader-reader contention is not a problem due > to existing serialization by WALWriteLock. Adding an entry happens > only at segment boundary so the ArchNotifyLock doesn't seem to be a > problem. > > However the function prolongs the WALWriteLock section. Couldn't we > somehow move the call to NotifySegmentsReadyForArchive in XLogWrite > out of the WALWriteLock section? I don't see a clean way to do that. XLogWrite() assumes that WALWriteLock is held when it is called, and it doesn't release it at any point. I think we'd have to move NotifySegmentsReadyForArchive() to the callers of XLogWrite() if we wanted to avoid holding onto WALWriteLock for longer. Unless we can point to a measurable performance penalty, I'm not sure this is worth it. Nathan
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi Mark, On Wed, Jan 27, 2021, at 20:34, Mark Rofail wrote: >Here it is. >Array-ELEMENT-foreign-key-v15.patch Thanks. I've tested it and notice there still seems to be a problem with int2vector? Reading your previous comment a few messages ago, it sounds like this was fixed, but perhaps not? This is the comment that made me think it was fixed: On Sun, Jan 24, 2021, at 21:46, Mark Rofail wrote: >Hello again Joel, >>UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE >>indexrelid = 2837; >>ERROR: operator does not exist: int2vector pg_catalog.@> smallint[] >>LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p... >I apologize for my rash response, I did not quite understand your example at >first, it appears the UPDATE statement is >neither over the referencing nor >the referenced columns, I understand the problem now, please disregard the >previous >email. Thank you for this find, please find the fix below > >Changelog: >- v14 (compatible with current master 2021-01-24, commit >0c1e8845f28bd07ad381c8b0d6701575d967b88e) Here is a small test causing that still fails on v15: CREATE TABLE a ( a_id smallint NOT NULL, PRIMARY KEY (a_id) ); CREATE TABLE b ( b_id integer NOT NULL, a_ids int2vector NOT NULL, PRIMARY KEY (b_id) ); ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id); INSERT INTO a (a_id) VALUES (1); INSERT INTO a (a_id) VALUES (2); INSERT INTO b (b_id, a_ids) VALUES (3, '1 2'::int2vector); ERROR: operator does not exist: smallint[] pg_catalog.<@ int2vector LINE 1: ..."."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(p... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. QUERY: SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) FROM (SELECT 1 FROM ONLY "public"."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(pg_catalog.<@) $1::pg_catalog.anyarray FOR KEY SHARE OF x) z) /Joel
Re: Allow matching whole DN from a client certificate
On 1/27/21 3:14 PM, Jacob Champion wrote: > On Tue, 2021-01-26 at 18:43 +, Jacob Champion wrote: >> On Tue, 2021-01-26 at 13:49 +0100, Daniel Gustafsson wrote: >>> The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517 >>> section 4.2.15 (which in turn reference RFC 4514 for the DN string format). >>> libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in >>> lib/certdb/alg1485.c. Comparing the two would be interesting. >> Yeah. I'll poke around a bit. > Here's some output from a test program I threw together, which parses > identical DER using OpenSSL and NSS and writes the corresponding string > representation. > >> input/basic.conf: >> ssl: CN=pchampion,OU=VMware >> nss: CN=pchampion,OU=VMware >> >> input/escape.conf: >> ssl: CN=\,\+\\\;\"\<\> >> nss: CN=",+\\;\"<>" >> >> input/multivalue.conf: >> ssl: CN=pchampion+SN=Champion+GN=Jacob,OU=VMware >> nss: givenName=Jacob+SN=Champion+CN=pchampion,OU=VMware >> >> input/unicode.conf: >> ssl: CN=οδυσσέας,OU=VMware >> nss: CN=οδυσσέας,OU=VMware >> >> input/unprintable.conf: >> ssl: CN=\01\,\02\,\03,OU=\01\02\03 >> nss: CN="\01,\02,\03",OU=\01\02\03 > basic.conf is exactly what it looks like: CN=pchampion,OU=VMware. Both > implementations agree. > > escape.conf contains a CN with the literal value > > ,+\;"<> > > and you can see that NSS doesn't follow RFC 4514 here; it uses the > older double-quoting form of escaping. There's a 15-year-old bug on > this in NSS [1]. > > multivalue.conf contains a multivalued AVA with commonName "pchampion", > givenName "Jacob", and surname "Champion". They aren't sorted in the > same order, and the implementations even disagree on how to represent > the givenName attribute. (I'm not convinced that either choice is RFC- > 4514-compliant; it doesn't look like GN is registered with IANA as a > short name for givenName.) > > unicode.conf contains a commonName of "οδυσσέας". Both implementations > agree, but the only way I was able to get OpenSSL to produce this > (rather than a string of escaped hex) was by using the flags > > XN_FLAG_RFC2253 & ~ASN1_STRFLGS_ESC_MSB > > in the call to X509_NAME_print_ex(). This should work fine for a > database encoding of UTF-8, but would we need to convert for other > encodings? Also, I'm not sure how this would handle certificates that > aren't UTF-8 encoded. It looks like some UCS variants are legal? > > unprintable.conf contains the bytes 0x01, 0x02, and 0x03 in the > commonName and organizationalUnit. They're backslash-escaped by both > implementations, but if you add any other printable escaped characters > (such as comma, in the CN example here) NSS will still double-quote the > whole thing. > > --Jacob > > [1] https://bugzilla.mozilla.org/show_bug.cgi?id=355096 OK, that bug is a bit ugly. I'm not sure where we want to go with the present patch. Do we want to go with what we have and document the limitations more, or try to do something more elaborate? If so, exactly what? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-27 06:14, Michael Paquier wrote: On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote: In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(), and removed expensive text generation in test. Not touched yet some of your previously raised concerns. Also, you made SetRelationTableSpace() to accept Relation instead of Oid, so now we have to open/close indexes in the ReindexPartitions(), I am not sure that I use proper locking there, but it works. Passing down Relation to the new routines makes the most sense to me because we force the callers to think about the level of locking that's required when doing any tablespace moves. + Relation iRel = index_open(partoid, ShareLock); + + if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid)) + SetRelationTableSpace(iRel, + params->tablespaceOid, + InvalidOid); Speaking of which, this breaks the locking assumptions of SetRelationTableSpace(). I feel that we should think harder about this part for partitioned indexes and tables because this looks rather unsafe in terms of locking assumptions with partition trees. If we cannot come up with a safe solution, I would be fine with disallowing TABLESPACE in this case, as a first step. Not all problems have to be solved at once, and even without this part the feature is still useful. I have read more about lock levels and ShareLock should prevent any kind of physical modification of indexes. We already hold ShareLock doing find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so using ShareLock seems to be safe here, but I will look on it closer. + /* It's not a shared catalog, so refuse to move it to shared tablespace */ + if (params->tablespaceOid == GLOBALTABLESPACE_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot move non-shared relation to tablespace \"%s\"", +get_tablespace_name(params->tablespaceOid; Why is that needed if CheckRelationTableSpaceMove() is used? This is from ReindexRelationConcurrently() where we do not use CheckRelationTableSpaceMove(). For me it makes sense to add only this GLOBALTABLESPACE_OID check there, since before we already check for system catalogs and after for temp relations, so adding CheckRelationTableSpaceMove() will be a double-check. - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? + tablespaceOid : indexRelation->rd_rel->reltablespace, Let's remove this logic from index_concurrently_create_copy() and let the caller directly decide the tablespace to use, without a dependency on InvalidOid in the inner routine. A share update exclusive lock is already hold on the old index when creating the concurrent copy, so there won't be concurrent schema changes. Changed. Also added tests for ACL checks, relfilenode changes. Added ACL recheck for multi-transactional case. Added info about TOAST index reindexing. Changed some comments. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom f176a6e5a81ab133fee849f72e4edb8b287d6062 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 27 Jan 2021 00:46:17 +0300 Subject: [PATCH v8] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 31 +++- src/backend/catalog/index.c | 50 +- src/backend/commands/indexcmds.c | 112 - src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 9 +- src/test/regress/input/tablespace.source | 106 + src/test/regress/output/tablespace.source | 181 ++ 7 files changed, 481 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..e610a0f52c 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" and system (unless allow_system_table_mods + is set to TRUE) relations. If SCHEMA, + DATABASE or SYSTEM are specified, + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOAST tables + are reindexed, but not moved the new tablespace. + + + +
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Thanks for updating the patch. I have just a couple comments on the new (and old) language. On Thu, Jan 28, 2021 at 12:19:06AM +0300, Alexey Kondratov wrote: > Also added tests for ACL checks, relfilenode changes. Added ACL recheck for > multi-transactional case. Added info about TOAST index reindexing. Changed > some comments. > + Specifies that indexes will be rebuilt on a new tablespace. > + Cannot be used with "mapped" and system (unless > allow_system_table_mods say mapped *or* system relations Or maybe: mapped or (unless >allow_system_table_mods<) system relations. > + is set to TRUE) relations. If > SCHEMA, > + DATABASE or SYSTEM are specified, > + then all "mapped" and system relations will be skipped and a single > + WARNING will be generated. Indexes on TOAST tables > + are reindexed, but not moved the new tablespace. moved *to* the new tablespace. I don't know if that needs to be said at all. We talked about it a lot to arrive at the current behavior, but I think that's only due to the difficulty of correcting the initial mistake. > + /* > + * Set the new tablespace for the relation. Do that only in the > + * case where the reindex caller wishes to enforce a new tablespace. I'd say just "/* Set new tablespace, if requested */ You wrote something similar in an earlier revision of your refactoring patch. > + * Mark the relation as ready to be dropped at transaction > commit, > + * before making visible the new tablespace change so as this > won't > + * miss things. This comment is vague. I think Michael first wrote this comment about a year ago. Does it mean "so the new tablespace won't be missed" ? Missed by what ? -- Justin
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
> > > Hey Joel, I apologize for my rash response, I did not quite understand your example > at first, it appears the UPDATE statement is neither over the > referencing nor the referenced columns > The v14 fixed the issue where an error would occur by an update statement that didn't target the refrencing or refrenced columns Vectors as refrencing columns are not supported and out of scope of this patch. Try to use arrays. /Mark
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-Jan-28, Alexey Kondratov wrote: > I have read more about lock levels and ShareLock should prevent any kind of > physical modification of indexes. We already hold ShareLock doing > find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so > using ShareLock seems to be safe here, but I will look on it closer. You can look at lock.c where LockConflicts[] is; that would tell you that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but it does not conflict with itself! So it would be possible to have more than one process doing this thing at the same time, which surely makes no sense. I didn't look at the patch closely enough to understand why you're trying to do something like CLUSTER, VACUUM FULL or REINDEX without holding full AccessExclusiveLock on the relation. But do keep in mind that once you hold a lock on a relation, trying to grab a weaker lock afterwards is pretty pointless. -- Álvaro Herrera39°49'30"S 73°17'W "E pur si muove" (Galileo Galilei)
Index predicate locking and serializability contention
Hi! We're currently having issues with serializable contention at our shop, and after tracking it down very carefully, we found that there are two main reasons for one such conflict: 1. Page-level predicate locks on primary key indexes, whose associated column gets their Id from a sequence. 2. An empty table which gets inserted to but has those inserted rows deleted before committing. We're confident they are the only remaining impediments to allowing transactions not to conflict with each other, because we have changed the code just in the right places to make sure that no conflicts arise when we do both of: - In the first case, the sequence's nextval and increment are set so that the first transaction gets an Id that is on a different index page than the Id the second transaction will get. - Not writing to the table that once got inserted to and emptied. Before this, we also tried setting enable_seqscan to off and inspecting the query plans and SIReadLocks carefully before committing to make sure sequential scans were avoided, but it wasn't sufficient. I believe in the first case the problem is one of granularity which has been mentioned before at https://www.postgresql.org/message-id/flat/20110503064807.GB85173%40csail.mit.edu#836599e3c18caf54052114d46f929cbb ). In the second case, I believe part of the problem could be due to how empty tables are predicately locked - according to https://dba.stackexchange.com/questions/246179/postgresql-serialisation-failure-on-different-ids . In our case, we use empty tables to keep complex invariants checked at the DB level by inserting into them with triggers and making sure deferrable constraints will fail if the rows are still there (thus forcing the committer to run a "consistency-enforcing" job before committing). I'm not sure if our use-case is too particular, but we have found in general that having little data - which some of our tables do, and will still have for the foreseeable future - is sometimes worse than having lots of it due to index locking granularity being at least at page-level. So I have a few questions: - Would index-key / index-gap locking avoid avoid creating serialization anomalies for inserts of consecutive Ids that currently fall in the same index page? Is it in the roadmap? - A colleague made a suggestion which I found no mention of anywhere: would it be possible not to predicate-lock on indices for insertion into GENERATED AS IDENTITY columns, unless of course in the case of UPDATE, INSERT INTO .. OVERRIDING, ALTER TABLE .. RESTART WITH or other similarly conflicting statements? - Is there something that can be done for the problem with empty tables? We currently use Postgres 11, and anything that could help us change how we approach the problem on our side is very much welcome too! Thanks in advance, Marcelo.
Re: Is it useful to record whether plans are generic or custom?
On 2020/12/04 14:29, Fujii Masao wrote: On 2020/11/30 15:24, Tatsuro Yamada wrote: Hi Torikoshi-san, In this patch, exposing new columns is mandatory, but I think it's better to make it optional by adding a GUC something like 'pgss.track_general_custom_plans. I also feel it makes the number of columns too many. Just adding the total time may be sufficient. I think this feature is useful for DBA. So I hope that it gets committed to PG14. IMHO, many columns are Okay because DBA can select specific columns by their query. Therefore, it would be better to go with the current design. But that design may waste lots of memory. No? For example, when plan_cache_mode=force_custom_plan, the memory used for the columns for generic plans is not used. Regards, Sorry for the super delayed replay. I don't think that because I suppose that DBA uses plan_cache_mode if they faced an inefficient execution plan. And the parameter will be used as a session-level GUC parameter, not a database-level. Regards, Tatsuro Yamada
Re: Is it useful to record whether plans are generic or custom?
Torikoshi-san, On 2020/12/04 15:03, torikoshia wrote: ISTM now that creating pg_stat_statements_xxx views both for generic andcustom plans is better than my PoC patch. And I'm also struggling with the following. | However, I also began to wonder how effective it would be to just | distinguish between generic and custom plans. Custom plans can | include all sorts of plans. and thinking cache validation, generic | plans can also include various plans. | Considering this, I'm starting to feel that it would be better to | not just keeping whether generic or cutom but the plan itself as | discussed in the below thread. Yamada-san, Do you think it's effective just distinguish between generic and custom plans? Sorry for the super delayed replay. Ah, it's okay. It would be better to check both info by using a single view from the perspective of usability. However, it's okay to divide both information into two views to use memory effectively. Regards, Tatsuro Yamada
Re: Is it useful to record whether plans are generic or custom?
Horiguchi-san, On 2020/12/04 15:37, Kyotaro Horiguchi wrote: And I'm also struggling with the following. | However, I also began to wonder how effective it would be to just | distinguish between generic and custom plans. Custom plans can | include all sorts of plans. and thinking cache validation, generic | plans can also include various plans. | Considering this, I'm starting to feel that it would be better to | not just keeping whether generic or cutom but the plan itself as | discussed in the below thread. FWIW, that seems to me to be like some existing extension modules, pg_stat_plans or pg_store_plans.. The former is faster but may lose plans, the latter doesn't lose plans but slower. I feel that we'd beter consider simpler feature if we are intendeng it to be a part of a contrib module, There is also pg_show_plans. Ideally, it would be better to able to track all of the plan changes by checking something view since Plan Stability is important for DBA when they use PostgreSQL in Mission-critical systems. I prefer that the feature will be released as a contrib module. :-D Regards, Tatsuro Yamada
Re: Is it useful to record whether plans are generic or custom?
Hi Toricoshi-san, On 2021/01/12 20:36, torikoshia wrote: I suppose it would be normal practice to store past results of pg_stat_statements for future comparisons. If this is the case, I think that if we only add the number of generic plan execution, it will give us a hint to notice the cause of performance degradation due to changes in the plan between generic and custom. For example, if there is a clear difference in the number of times the generic plan is executed between before and after performance degradation as below, it would be natural to check if there is a problem with the generic plan. ... Attached a patch that just adds a generic call counter to pg_stat_statements. I think that I'd like to use the view when we faced a performance problem and find the reason. If we did the fixed-point observation (should I say time-series analysis?) of generic_calls, it allows us to realize the counter changes, and we can know whether the suspect is generic_plan or not. So the patch helps DBA, I believe. Regards, Tatsuro Yamada
Re: vacuum_cost_page_miss default value and modern hardware
On Thu, Jan 21, 2021 at 5:12 PM Peter Geoghegan wrote: > I'm going to go ahead with committing my patch to lower the default > next week. If anybody has any objections to that plan, please speak > up. Just pushed a commit that reduced the default for vacuum_cost_page_miss to 2. One more thing on this: I noticed that the docs for these settings say "There are many situations where it is not important that maintenance commands like VACUUM and ANALYZE finish quickly". That now seems a little dubious to me -- I wonder if we should refresh it. There are not that many problems that can be solved by making VACUUM slower. This is due to the visibility map, and to a lesser degree certain key improvements in index AMs. The text that I quoted was written in 2005, a time when the delay stuff was still very new, and even the earliest visibility map design was still a few years away. Thanks -- Peter Geoghegan
Re: Perform COPY FROM encoding conversions in larger chunks
Hi Heikki, 0001 through 0003 are straightforward, and I think they can be committed now if you like. 0004 is also pretty straightforward. The check you proposed upthread for pg_upgrade seems like the best solution to make that workable. I'll take a look at 0005 soon. I measured the conversions that were rewritten in 0003, and there is indeed a noticeable speedup: Big5 to EUC-TW: head196ms 0001-3 152ms EUC-TW to Big5: head190ms 0001-3 144ms I've attached the driver function for reference. Example use: select drive_conversion( 1000, 'euc_tw'::name, 'big5'::name, convert('a few kB of utf8 text here', 'utf8', 'euc_tw') ); I took a look at the test suite also, and the only thing to note is a couple places where the comment doesn't match the code: + -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2) + byte1 = hex('0e'); + for byte2 in hex('a1')..hex('df') loop +return next b(byte1, byte2); + end loop; + + -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3) + byte1 = hex('0f'); + for byte2 in hex('a1')..hex('fe') loop +for byte3 in hex('a1')..hex('fe') loop + return next b(byte1, byte2, byte3); +end loop; + end loop; Not sure if it matters , but thought I'd mention it anyway. -- John Naylor EDB: http://www.enterprisedb.com drive_conversion.c Description: Binary data
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi, Mark: + if (ARR_NDIM(arr) != 1 || + ARR_HASNULL(arr) || + ARR_ELEMTYPE(arr) != CHAROID) + elog(ERROR, "confreftype is not a 1-D char array"); I think the ARR_HASNULL(arr) condition is not reflected in the error message. +* Array foreign keys support only UPDATE/DELETE NO ACTION, UPDATE/DELETE +* RESTRICT amd DELETE CASCADE actions I don't see CASCADE in the if condition that follows the above comment. + charreftype;/* FKCONSTR_REF_xxx code */ The code would be FKCONSTR_REF_EACH_ELEMENT and FKCONSTR_REF_PLAIN. I think you can mention them in the comment. Cheers On Wed, Jan 27, 2021 at 11:34 AM Mark Rofail wrote: > Hello Joel, > > >> I think you forgot to attach the patch. >> > Appears so, sorry about that. > > Here it is. > > /Mark >
RE: Enhance traceability of wal_level changes for backup management
Hello On Thursday, January 21, 2021 11:19 PM I wrote: > If no one disagree with it, I'll proceed with (1) below. > > (1) writing the time or LSN in the control file to indicate when/where > wal_level > is changed to 'minimal' > from upper level to invalidate the old backups or make alerts to users. I attached the first patch which implementes this idea. It was aligned by pgindent and shows no regression. Best Regards, Takamichi Osumi trace_wal_level_drop_v01.patch Description: trace_wal_level_drop_v01.patch
Re: protect pg_stat_statements_info() for being used without the library loaded
On Thu, Jan 28, 2021 at 3:53 AM Jaime Casanova wrote: > > Hi, > > Attached is a small patch for ${subject} Good catch, and patch looks good to me.
Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM
On Wed, Jan 27, 2021 at 11:16:26PM +, Bossart, Nathan wrote: > On 1/27/21, 11:07 AM, "Justin Pryzby" wrote: > > This just came up for me: > > > > I have a daily maintenance script which pro-actively vacuums tables: > > freezing > > historic partitions, vacuuming current tables if the table's relfrozenxid is > > old, and to encourage indexonly scan. > > > > I'm checking the greatest(age(toast,main)) and vacuum the table (and > > implicitly > > its toast) whenever either is getting old. > > > > But it'd be more ideal if I could independently vacuum the main table if > > it's > > old, but not the toast table. > > Thanks for chiming in. > > It looks like we were leaning towards only adding the > TOAST_TABLE_CLEANUP option, which is already implemented internally > with VACOPT_SKIPTOAST. It's already possible to vacuum a TOAST table > directly, so we can probably do without the MAIN_RELATION_CLEANUP > option. > > I've attached a new patch that only adds TOAST_TABLE_CLEANUP. Thanks, I wrote my message after running into the issue and remembered this thread. I didn't necessarily mean to send another patch :) My only comment is on the name: TOAST_TABLE_CLEANUP. "Cleanup" suggests that the (main or only) purpose is to "clean" dead tuples to avoid bloat. But in my use case, the main purpose is to avoid XID wraparound (or its warnings). Okay, my second only comment is that this: | This option cannot be disabled when the FULL option is | used. Should it instead be ignored if FULL is also specified ? Currently only PARALLEL and DISABLE_PAGE_SKIPPING cause an error when used with FULL. That's documented for PARALLEL, but I think it should also be documented for DISABLE_PAGE_SKIPPING (which is however an advanced option). -- Justin
Re: Thoughts on "killed tuples" index hint bits support on standby
On Wed, Jan 27, 2021 at 11:30 AM Michail Nikolaev wrote: > Sorry for necroposting, but if someone is interested - I hope the patch is > ready now and available in the other thread (1). I wonder if it would help to not actually use the LP_DEAD bit for this. Instead, you could use the currently-unused-in-indexes LP_REDIRECT bit. The general idea here is that this status bit is interpreted as a "recently dead" bit in nbtree. It is always something that is true *relative* to some *ephemeral* coarse-grained definition of recently dead. Whether or not "recently dead" means "dead to my particular MVCC snapshot" can be determined using some kind of in-memory state that won't survive a crash (or a per-index-page epoch?). We still have LP_DEAD bits in this world, which work in the same way as now (and so unambiguously indicate that the index tuple is dead-to-all, at least on the primary). I think that this idea of a "recently dead" bit might be able to accomplish a few goals all at once, including your specific goal of setting "hint bits" in indexes. The issue here is that it would also be nice to use a "recently dead" bit on the primary, but if you do that then maybe you're back to the original problem. OTOH, I also think that we could repurpose the LP_NORMAL bit in index AMs, so we could potentially have 3 different definitions of dead-ness without great difficulty! Perhaps this doesn't seem all that helpful, since I am expanding scope here for a project that is already very difficult. And maybe it just isn't helpful -- I don't know. But it is at least possible that expanding scope could actually *help* your case a lot, even if you only ever care about your original goal. My personal experience with nbtree patches is just that: sometimes *expanding* scope actually makes things *easier*, not harder. This is sometimes called "The Inventor's Paradox": https://en.wikipedia.org/wiki/Inventor%27s_paradox Consider the example of my work on nbtree in PostgreSQL 12. It was actually about 6 or 7 enhancements, not just 1 big enhancement -- it is easy to forget that now. I think that it was *necessary* to add at least 5 of these enhancements at the same time (maybe 2 or so really were optional). This is deeply counter-intuitive, but still seems to be true in my case. The specific reasons why I believe that this is true of the PostgreSQL 12 work are complicated, but it boils down to this: the ~5 related-though-distinct enhancements were individually not all that compelling (certainly not compelling enough to make on-disk changes for), but were much greater than the sum of their parts when considered together. Maybe I got lucky there. More generally, the nbtree stuff that I worked on in 12, 13, and now 14 actually feels like one big project to me. I will refrain from explaining exactly why that is right now, but you might be very surprised at how closely related it all is. I didn't exactly plan it that way, but trying to see things in the most general terms turned out to be a huge asset to me. If there are several independent reasons to move in one particular direction all at once, you can generally afford to be wrong about a lot of things without it truly harming anybody. Plus it's easy to see that you will not frustrate future work that is closely related but distinct when that future work is *directly enabled* by what you're doing. What's more, the extra stuff I'm talking about probably has a bunch of other benefits on the primary, if done well. Consider how the deletion stuff with LP_DEAD bits now considers "extra" index tuples to delete when they're close by. We could even do something like that with these LP_REDIRECT/recently dead bits on the primary. I understand that it's hard to have a really long term outlook. I think that that's almost necessary when working on a project like this, though. Don't forget that this works both ways. Maybe a person that wanted to do this "recently dead" stuff (which sounds really interesting to me right now) would similarly be compelled to consider the bigger picture, including the question of setting hint bits on standbys -- this other person had better not make that harder in the future, for the same reasons (obviously what you want to do here makes sense, it just isn't everything to everybody). This is yet another way in which expanding scope can help. -- Peter Geoghegan
Re: Thoughts on "killed tuples" index hint bits support on standby
On Wed, Jan 27, 2021 at 5:24 PM Peter Geoghegan wrote: > The issue here is that it would also be nice to use a "recently dead" > bit on the primary, but if you do that then maybe you're back to the > original problem. OTOH, I also think that we could repurpose the > LP_NORMAL bit in index AMs, so we could potentially have 3 different > definitions of dead-ness without great difficulty! To be clear, what I mean is that you currently have two bits in line pointers. In an nbtree leaf page we only currently use one -- the LP_DEAD bit. But bringing the second bit into it allows us to have a representation for two additional states (not just one), since of course the meaning of the second bit can be interpreted using the LP_DEAD bit. You could for example having encodings for each of the following distinct per-LP states, while still preserving on-disk compatibility: "Not known to be dead in any sense" (0), "Unambiguously dead to all" (what we now simply call LP_DEAD), "recently dead on standby" (currently-spare bit is set), and "recently dead on primary" (both 'lp_flags' bits set). Applying FPIs on the standby would have to be careful to preserve a standby-only bit. I'm probably not thinking of every subtlety, but "putting all of the pieces of the puzzle together for further consideration" is likely to be a useful exercise. -- Peter Geoghegan
Re: row filtering for logical replication
Hi, On 2020-12-17 09:43:30 +0300, Önder Kalacı wrote: > The above part can be considered the core of the logic, executed per tuple. > As far as I can see, it has two downsides. > > First, calling `expression_planner()` for every tuple can be quite > expensive. I created a sample table, loaded data and ran a quick benchmark > to see its effect. I attached the very simple script that I used to > reproduce the issue on my laptop. I'm pretty sure you can find nicer ways > of doing similar perf tests, just sharing as a reference. > > The idea of the test is to add a WHERE clause to a table, but none of the > tuples are filtered out. They just go through this code-path and send it to > the remote node. > > #rows Patched| Master > 1M 00:00:25.067536| 00:00:16.633988 > 10M 00:04:50.770791| 00:02:40.945358 > > > So, it seems a significant overhead to me. What do you think? That seems almost prohibitively expensive. I think at the very least some of this work would need to be done in a cached manner, e.g. via get_rel_sync_entry(). > Secondly, probably more importantly, allowing any operator is as dangerous > as allowing any function as users can create/overload operator(s). That's not safe, indeed. It's not even just create/overloading operators, as far as I can tell the expression can contain just plain function calls. The issue also isn't primarily that the user can overload functions, it's that logical decoding is a limited environment, and not everything is safe to do within. You e.g. only catalog tables can be accessed. Therefore I don't think we can allow arbitrary expressions. > The other problematic area was the performance, as calling > `expression_planner()` for every tuple can be very expensive. To avoid > that, it might be considered to ask users to provide a function instead of > a free form WHERE clause, such that if the function returns true, the tuple > is sent. The allowed functions need to be immutable SQL functions with bool > return type. As we can parse the SQL functions, we should be able to allow > only functions that rely on the above mentioned procs. We can apply as many > restrictions (such as no modification query) as possible. For example, see > below: > ``` I don't think that would get us very far. >From a safety aspect: A function's body can be changed by the user at any time, therefore we cannot rely on analyses of the function's body. >From a performance POV: SQL functions are planned at every invocation, so that'd not buy us much either. I think what you would have to do instead is to ensure that the expression is "simple enough", and then process it into a cheaply executable format in get_rel_sync_entry(). I'd suggest that in the first version you just allow a simple ANDed list of 'foo.bar op constant' expressions. Does that make sense? Greetings, Andres Freund
RE: [POC] Fast COPY FROM command for the table with foreign partitions
Hi Andrey, > Sometimes before i suggested additional optimization [1] which can > additionally speed up COPY by 2-4 times. Maybe you can perform the > benchmark for this solution too? Sorry for the late reply, I just have time to take this test now. But the patch no longer applies, I tried to apply on e42b3c3bd6(2021/1/26) but failed. Can you send a rebased version? Regards, Tang
RE: [PoC] Non-volatile WAL buffer
From: Tomas Vondra > (c) As mentioned before, PMEM behaves differently with concurrent > access, i.e. it reaches peak throughput with relatively low number of > threads wroting data, and then the throughput drops quite quickly. I'm > not sure if the same thing applies to pmem_drain() too - if it does, we > may need something like we have for insertions, i.e. a handful of locks > allowing limited number of concurrent inserts. > I think WALWriteLock itself (i.e. acquiring/releasing it) is not an > issue - the problem is that writing the WAL to persistent storage itself > is expensive, and we're waiting to that. > > So it's not clear to me if removing the lock (and allowing multiple > processes to do pmem_drain concurrently) can actually help, considering > pmem_drain() should flush writes from other processes anyway. I may be out of the track, but HPE's benchmark using Oracle 18c, placing the REDO log file on Intel PMEM in App Direct mode, showed only 27% performance increase compared to even "SAS" SSD. https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00074230enw The just-released Oracle 21c has started support for placing data files on PMEM, eliminating the overhead of buffer cache. It's interesting that this new feature is categorized in "Manageability", not "Performance and scalability." https://docs.oracle.com/en/database/oracle/oracle-database/21/nfcon/persistent-memory-database-258797846.html They recommend placing REDO logs on DAX-aware file systems. I ownder what's behind this. https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/using-PMEM-db-support.html#GUID-D230B9CF-1845-4833-9BF7-43E9F15B7113 "You can use PMEM Filestore for database datafiles and control files. For performance reasons, Oracle recommends that you store redo log files as independent files in a DAX-aware filesystem such as EXT4/XFS." Regards Takayuki Tsunakawa
Re: On login trigger: take three
On Mon, Dec 28, 2020 at 5:46 PM Masahiko Sawada wrote: > > On Sat, Dec 26, 2020 at 4:04 PM Pavel Stehule wrote: > > > > > > > > so 26. 12. 2020 v 8:00 odesílatel Pavel Stehule > > napsal: > >> > >> Hi > >> > >> > >>> Thank you. > >>> I have applied all your fixes in on_connect_event_trigger-12.patch. > >>> > >>> Concerning enable_client_connection_trigger GUC, I think that it is > >>> really useful: it is the fastest and simplest way to disable login > >>> triggers in case > >>> of some problems with them (not only for superuser itself, but for all > >>> users). Yes, it can be also done using "ALTER EVENT TRIGGER DISABLE". > >>> But assume that you have a lot of databases with different login policies > >>> enforced by on-login event triggers. And you want temporary disable them > >>> all, for example for testing purposes. > >>> In this case GUC is most convenient way to do it. > >>> > >> > >> There was typo in patch > >> > >> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > >> index f810789..8861f1b 100644 > >> --- a/doc/src/sgml/config.sgml > >> +++ b/doc/src/sgml/config.sgml > >> @@ -1,4 +1,4 @@ > >> - > >> +\ > >> > >> I have not any objections against functionality or design. I tested the > >> performance, and there are no negative impacts when this feature is not > >> used. There is significant overhead related to plpgsql runtime > >> initialization, but when this trigger will be used, then probably some > >> other PLpgSQL procedures and functions will be used too, and then this > >> overhead can be ignored. > >> > >> * make without warnings > >> * make check-world passed > >> * doc build passed > >> > >> Possible ToDo: > >> > >> The documentation can contain a note so usage connect triggers in > >> environments with short life sessions and very short fast queries without > >> usage PLpgSQL functions or procedures can have negative impact on > >> performance due overhead of initialization of PLpgSQL engine. > >> > >> I'll mark this patch as ready for committers > > > > > > looks so this patch has not entry in commitfestapp 2021-01 > > > > Yeah, please register this patch before the next CommitFest[1] starts, > 2021-01-01 AoE[2]. > Konstantin, did you register this patch in any CF? Even though the reviewer seems to be happy with the patch, I am afraid that we might lose track of this unless we register it. -- With Regards, Amit Kapila.
Re: Wrong usage of RelationNeedsWAL
At Wed, 27 Jan 2021 02:48:48 -0800, Noah Misch wrote in > On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote: > > On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote: > > > Perhaps I'm missing something, but the patch doesn't pass the v5-0001 > > > test with wal_level=minimal? > > > > Correct. The case we must avoid is letting an old snapshot read an > > early-pruned page without error. v5-0001 expects "ERROR: snapshot too > > old". > > The patch suspends early pruning, so that error is not applicable. > > I think the attached version is ready. The changes since v6nm are cosmetic: > > - Wrote log messages > - Split into two patches, since the user-visible bugs are materially different > - Fixed typos > - Ran perltidy > > Is it okay if I push these on Saturday, or would you like more time to > investigate? Thank you for the new version. I studied the sto feature further and concluded that the checker side is fine that it always follow the chages of page-LSN. So what we can do for the issue is setting seemingly correct page LSN at pruning or refrain from early-pruning while we are skipping WAL. The reason I took the former is I thought that the latter might be a problem since early-pruning would be postponed by a long-running wal-skipping transaction. So the patch looks fine to me. The commit message mekes sense. However, is it ok that the existing tests (modules/snapshot_too_old) fails when wal_level=minimal? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Wed, Jan 27, 2021 at 7:35 PM Li Japin wrote: > > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with > > refresh true refreshes only the newly added publications, because what > > we do in AlterSubscription_refresh() is that we fetch the tables > > associated with the publications from the publisher, compare them with > > the previously fetched tables from that publication and add the new > > tables or remove the table that don't exist in that publication > > anymore. > > > > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same > > thing i.e. refreshes only the dropped publications. > > > > Thoughts? > > Agreed. We just only need to refresh the added/dropped publications. > Furthermore, for publications that will be dropped, we do not need the > “copy_data” option, right? I think you are right. The copy_data option doesn't make sense for ALTER SUBSCRIPTION ... DROP PUBLICATION, maybe we should throw an error if the user specifies it. Of course, we need that option for ALTER SUBSCRIPTION ... ADD PUBLICATION. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Is Recovery actually paused?
On Wed, Jan 27, 2021 at 2:28 PM Dilip Kumar wrote: > > On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA wrote: > > > > On Wed, 27 Jan 2021 13:29:23 +0530 > > Dilip Kumar wrote: > > > > > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas > > > > wrote: > > > > > > > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy > > > > > wrote: > > > > > > +1 to just show the recovery pause state in the output of > > > > > > pg_is_wal_replay_paused. But, should the function name > > > > > > "pg_is_wal_replay_paused" be something like > > > > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" > > > > > > exists > > > > > > in a function, I expect a boolean output. Others may have better > > > > > > thoughts. > > > > > > > > > > Maybe we should leave the existing function pg_is_wal_replay_paused() > > > > > alone and add a new one with the name you suggest that returns text. > > > > > That would create less burden for tool authors. > > > > > > > > +1 > > > > > > > > > > Yeah, we can do that, I will send an updated patch soon. > > > > This means pg_is_wal_replay_paused is left without any change and this > > returns whether pause is requested or not? If so, it seems good to modify > > the documentation of this function in order to note that this could not > > return the actual pause state. > > Yes, we can say that it will return true if the replay pause is > requested. I am changing that in my new patch. I have modified the patch, changes - I have added a new interface pg_get_wal_replay_pause_state to get the pause request state - Now, we are not waiting for the recovery to actually get paused so I think it doesn't make sense to put a lot of checkpoints to check the pause requested so I have removed that check from the recoveryApplyDelay but I think it better we still keep that check in the WaitForWalToBecomeAvailable because it can wait forever before the next wal get available. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v8-0001-Provide-a-new-interface-to-get-the-recovery-pause.patch Description: Binary data
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Thu, 28 Jan 2021 at 12:22, Bharath Rupireddy wrote: > On Wed, Jan 27, 2021 at 7:35 PM Li Japin wrote: >> > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with >> > refresh true refreshes only the newly added publications, because what >> > we do in AlterSubscription_refresh() is that we fetch the tables >> > associated with the publications from the publisher, compare them with >> > the previously fetched tables from that publication and add the new >> > tables or remove the table that don't exist in that publication >> > anymore. >> > >> > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same >> > thing i.e. refreshes only the dropped publications. >> > >> > Thoughts? >> >> Agreed. We just only need to refresh the added/dropped publications. >> Furthermore, for publications that will be dropped, we do not need the >> “copy_data” option, right? > > I think you are right. The copy_data option doesn't make sense for > ALTER SUBSCRIPTION ... DROP PUBLICATION, maybe we should throw an > error if the user specifies it. Of course, we need that option for > ALTER SUBSCRIPTION ... ADD PUBLICATION. > Thanks for your confirm. Attached v3 patch fix it. * v3-0001 Only refresh the publications that will be added/dropped, also remove "copy_data" option from DROP PUBLICATION. * v3-0002 Add a new testcase for DROP PUBLICATION WITH (copy_data). * v3-0003 Remove the reference of REFRESH PUBLICATION in DROP PUBLICATION. * v3-0004 Do not change. Attaching v3 patches, please consider these for further review. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From bff45768b066ccf94de6dae9f687cf54212900b1 Mon Sep 17 00:00:00 2001 From: Japin Li Date: Tue, 26 Jan 2021 15:43:52 +0800 Subject: [PATCH v3 1/4] Introduce a new syntax to add/drop publications At present, if we want to update publications in subscription, we can use SET PUBLICATION, however, it requires supply all publications that exists and the new publications if we want to add new publications, it's inconvenient. The new syntax only supply the new publications. When the refresh is true, it only refresh the new publications. --- src/backend/commands/subscriptioncmds.c | 121 src/backend/parser/gram.y | 20 src/include/nodes/parsenodes.h | 2 + 3 files changed, 143 insertions(+) diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 082f7855b8..88fa7f1b3f 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -46,6 +46,8 @@ #include "utils/syscache.h" static List *fetch_table_list(WalReceiverConn *wrconn, List *publications); +static List *merge_subpublications(HeapTuple tuple, TupleDesc tupledesc, + List *publications, bool addpub); /* * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands. @@ -857,6 +859,51 @@ AlterSubscription(AlterSubscriptionStmt *stmt) break; } + case ALTER_SUBSCRIPTION_ADD_PUBLICATION: + case ALTER_SUBSCRIPTION_DROP_PUBLICATION: + { +bool copy_data = false; +bool isadd = stmt->kind == ALTER_SUBSCRIPTION_ADD_PUBLICATION; +bool refresh; +List *publist = NIL; + +publist = merge_subpublications(tup, RelationGetDescr(rel), +stmt->publication, isadd); + +parse_subscription_options(stmt->options, + NULL, /* no "connect" */ + NULL, NULL, /* no "enabled" */ + NULL, /* no "create_slot" */ + NULL, NULL, /* no "slot_name" */ + isadd ? ©_data : NULL, + NULL, /* no "synchronous_commit" */ + &refresh, + NULL, NULL, /* no "binary" */ + NULL, NULL); /* no "streaming" */ +values[Anum_pg_subscription_subpublications - 1] = + publicationListToArray(publist); +replaces[Anum_pg_subscription_subpublications - 1] = true; + +update_tuple = true; + +/* Refresh if user asked us to. */ +if (refresh) +{ + if (!sub->enabled) + ereport(ERROR, +(errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"), + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false)."))); + + /* Only refresh the added/dropped list of publications. */ + sub->publications = stmt->publication; + + AlterSubscription_refresh(sub, copy_data); +} + +break; + } + case ALTER_SUBSCRIPTION_REFRESH: { bool copy_data; @@ -1278,3 +1325,77 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications) return tablelist; } + +/* + * Merge current subscription's publications and user specified publications + * by ADD/DROP PUBLICATIONS. + * + * If isadd == true, we will add the list of publications into current + * subscription's publications. Otherwise, we will delete the list of + * publications from current su
Re: On login trigger: take three
On Thu, Jan 28, 2021 at 8:17 AM Amit Kapila wrote: > > On Mon, Dec 28, 2020 at 5:46 PM Masahiko Sawada wrote: > > > > On Sat, Dec 26, 2020 at 4:04 PM Pavel Stehule > > wrote: > > > > > > > > > > > > so 26. 12. 2020 v 8:00 odesílatel Pavel Stehule > > > napsal: > > >> > > >> Hi > > >> > > >> > > >>> Thank you. > > >>> I have applied all your fixes in on_connect_event_trigger-12.patch. > > >>> > > >>> Concerning enable_client_connection_trigger GUC, I think that it is > > >>> really useful: it is the fastest and simplest way to disable login > > >>> triggers in case > > >>> of some problems with them (not only for superuser itself, but for all > > >>> users). Yes, it can be also done using "ALTER EVENT TRIGGER DISABLE". > > >>> But assume that you have a lot of databases with different login > > >>> policies enforced by on-login event triggers. And you want temporary > > >>> disable them all, for example for testing purposes. > > >>> In this case GUC is most convenient way to do it. > > >>> > > >> > > >> There was typo in patch > > >> > > >> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > > >> index f810789..8861f1b 100644 > > >> --- a/doc/src/sgml/config.sgml > > >> +++ b/doc/src/sgml/config.sgml > > >> @@ -1,4 +1,4 @@ > > >> - > > >> +\ > > >> > > >> I have not any objections against functionality or design. I tested the > > >> performance, and there are no negative impacts when this feature is not > > >> used. There is significant overhead related to plpgsql runtime > > >> initialization, but when this trigger will be used, then probably some > > >> other PLpgSQL procedures and functions will be used too, and then this > > >> overhead can be ignored. > > >> > > >> * make without warnings > > >> * make check-world passed > > >> * doc build passed > > >> > > >> Possible ToDo: > > >> > > >> The documentation can contain a note so usage connect triggers in > > >> environments with short life sessions and very short fast queries > > >> without usage PLpgSQL functions or procedures can have negative impact > > >> on performance due overhead of initialization of PLpgSQL engine. > > >> > > >> I'll mark this patch as ready for committers > > > > > > > > > looks so this patch has not entry in commitfestapp 2021-01 > > > > > > > Yeah, please register this patch before the next CommitFest[1] starts, > > 2021-01-01 AoE[2]. > > > > Konstantin, did you register this patch in any CF? Even though the > reviewer seems to be happy with the patch, I am afraid that we might > lose track of this unless we register it. > I see the CF entry (https://commitfest.postgresql.org/31/2900/) for this work. Sorry for the noise. -- With Regards, Amit Kapila.
Re: logical replication worker accesses catalogs in error context callback
On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy wrote: > > On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu wrote: > > Thanks for pointing to the changes in the commit message. I corrected > them. Attaching v4 patch set, consider it for further review. > About 0001, have we tried to reproduce the actual bug here which means when the error_callback is called we should face some problem? I feel with the correct testcase we should hit the Assert (Assert(IsTransactionState());) in SearchCatCacheInternal because there we expect the transaction to be in a valid state. I understand that the transaction is in a broken state at that time but having a testcase to hit the actual bug makes it easy to test the fix. -- With Regards, Amit Kapila.
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi Mark, On Wed, Jan 27, 2021, at 22:36, Mark Rofail wrote: > Vectors as refrencing columns are not supported and out of scope of this > patch. Try to use arrays. OK, not a biggie, but I think the user at least should get an error message immediately when trying to add the foreign key on an incompatible column, just like we would get if e.g. trying to add a fk on numeric[] -> smallint or some other incompatible combination. The first error message looks good: CREATE TABLE a ( a_id smallint NOT NULL, PRIMARY KEY (a_id) ); CREATE TABLE b ( b_id integer NOT NULL, a_ids numeric[] NOT NULL, PRIMARY KEY (b_id) ); joel=# ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id); ERROR: foreign key constraint "b_a_ids_fkey" cannot be implemented DETAIL: Key column "a_ids" has element type numeric which does not have a default btree operator class that's compatible with class "int2_ops". But you don't get any error message if a_ids is instead int2vector: CREATE TABLE b ( b_id integer NOT NULL, a_ids int2vector NOT NULL, PRIMARY KEY (b_id) ); ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id); It's silently added without any error. The error first comes when you try to insert data: INSERT INTO a (a_id) VALUES (1); INSERT INTO a (a_id) VALUES (2); INSERT INTO b (b_id, a_ids) VALUES (3, '1 2'::int2vector); ERROR: operator does not exist: smallint[] pg_catalog.<@ int2vector LINE 1: ..."."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(p... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. QUERY: SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) FROM (SELECT 1 FROM ONLY "public"."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(pg_catalog.<@) $1::pg_catalog.anyarray FOR KEY SHARE OF x) z) This means, as a user, I might not be aware of the vector restriction when adding the foreign keys for my existing schema, and think everything is fine, and not realize there is a problem until new data arrives. /Joel
Re: Support for NSS as a libpq TLS backend
On Wed, Jan 27, 2021 at 06:47:17PM +, Jacob Champion wrote: > So to check understanding: the `openssl` config variable is still alive > for MSVC builds; it just turns that into `--with-ssl=openssl` in the > fake CONFIGURE_ARGS? Yeah, I think that keeping both variables separated in the MSVC scripts is the most straight-forward option, as this passes down a path. Once there is a value for nss, we'd need to properly issue an error if both OpenSSL and NSS are specified. > Since SSL is an obsolete term, and the choice of OpenSSL vs NSS vs > [nothing] affects server operation (such as cryptohash) regardless of > whether or not connection-level TLS is actually used, what would you > all think about naming this option --with-crypto? I.e. > > --with-crypto=openssl > --with-crypto=nss Looking around, curl has multiple switches for each lib with one named --with-ssl for OpenSSL, but it needs to be able to use multiple libraries at run time. I can spot that libssh2 uses what you are proposing. It seems to me that --with-ssl is a bit more popular but not by that much: wget, wayland, some apache stuff (it uses a path as option value). Anyway, what you are suggesting sounds like a good in the context of Postgres. Daniel? -- Michael signature.asc Description: PGP signature
Re: Single transaction in the tablesync worker?
Hi Amit. PSA the v21 patch for the Tablesync Solution1. Main differences from v20: + Rebased to latest OSS HEAD @ 27/Jan + v21 is a merging of patches [v17] and [v20], which was made necessary when it was found [ak0127] that the v20 usage of TEMPORARY tablesync slots did not work correctly. v21 reverts to using PERMANENT tablesync slots same as implemented in v17, while retaining other review comment fixes made for v18, v19, v20. [v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com [v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com [ak0127] https://www.postgresql.org/message-id/CAA4eK1LDsj9kw4FbWAw3CMHyVsjafgDum03cYy-wpGmor%3D8-1w%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia v21-0002-Tablesync-extra-logging.patch Description: Binary data v21-0001-Tablesync-Solution1.patch Description: Binary data
RE: libpq debug log
On Mon, Jan 25, 2021 10:11 PM (JST), Alvaro Herrera wrote: > On 2021-Jan-25, tsunakawa.ta...@fujitsu.com wrote: > > > Iwata-san, > > [...] > > > Considering these and the compilation error Kirk-san found, I'd like > > you to do more self-review before I resume this review. > > Kindly note that these errors are all mine. > > Thanks for the review > Hello, To make the CFBot happy, I fixed the compiler errors. I have not included here the change proposal to move the tracing functions to a new file: fe-trace.c or something, so I retained the changes . Maybe Iwata-san can consider the proposal. Currently, both pqTraceInit() and pqTraceUninit() are called only by one function. Summary: 1. I changed the bits32 flags to int flags 2. I assumed INT_MAX value is the former MAX_FRONTEND_MSGS I defined it to solve the compile error. Please tell if the value was wrong. 3. Fixed the doc errors 4. Added freeing of be_msg in pqTraceUninit() 5. Addressed the following comments: > (25) > + conn->fe_msg = > malloc(sizeof(offsetof(pqFrontendMessage, fields)) + > + > DEF_FE_MSGFIELDS * sizeof(pqFrontendMessageField)); > > It's incorrect that offsetof() is nested in sizeof(). See my original > review comment. I removed the initial sizeof(). > (26) > +bool > +pqTraceInit(PGconn *conn, bits32 flags) { > ... > + conn->be_msg->state = LOG_FIRST_BYTE; > + conn->be_msg->length = 0; > + } > ... > + conn->be_msg->state = LOG_FIRST_BYTE; > + conn->be_msg->length = 0; > > The former is redundant? Right. I've removed the former. > (27) > +/* > + * Deallocate frontend-message tracking memory. We only do this > +because > + * it can grow arbitrarily large, and skip it in the initial state, > +because > + * it's likely pointless. > + */ > +void > +pqTraceUninit(PGconn *conn) > +{ > + if (conn->fe_msg && > + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS) > + { > + pfree(conn->fe_msg); > + conn->fe_msg = NULL; > + } > +} > > I thought this function plays the reverse role of pqTraceInit(), but > it only frees memory for frontend messages. Plus, use free() instead > of pfree(), because > malloc() is used to allocate memory. Fixed to use free(). Also added the freeing of be_msg. > (28) > +void PQtrace(PGconn *conn, FILE *stream, bits32 flags); > > bits32 can't be used because it's a data type defined in c.h, which > should not be exposed to applications. I think you can just int or something. I used int. It still works to suppress the timestamp when flags is true. In my environment, when flags is false(0): 2021-01-28 06:26:11.368 > Query 27 "COPY country TO STDOUT" 2021-01-28 06:26:11.368 > Query -1 2021-01-28 06:26:11.369 < CopyOutResponse 13 \x00 #3 #0 #0 #0 2021-01-28 06:26:11.369 < CopyDone4 2021-01-28 06:26:11.369 < CopyDone4 2021-01-28 06:26:11.369 < CopyDone4 2021-01-28 06:26:11.369 < CommandComplete 11 "COPY 0" 2021-01-28 06:26:11.369 < ReadyForQuery 5 2021-01-28 06:26:11.369 > Terminate 4 2021-01-28 06:26:11.369 > Terminate -1 When flags is true(1), running the same query: > Query 27 "COPY country TO STDOUT" > Query -1 < CopyOutResponse 13 \x00 #3 #0 #0 #0 < CopyDone4 < CopyDone4 < CopyDone4 < CommandComplete 11 "COPY 0" < ReadyForQuery 5 > Terminate 4 > Terminate -1 Regards, Kirk Jamison v14-0001-libpq-trace.patch Description: v14-0001-libpq-trace.patch
Re: protect pg_stat_statements_info() for being used without the library loaded
On Thu, Jan 28, 2021 at 08:49:54AM +0800, Julien Rouhaud wrote: > Good catch, and patch looks good to me. This crashes the server, cash. Looking at all the other modules in the tree, I am not seeing any other hole. This is new as of 9fbc3f3, and I will apply it on HEAD. -- Michael signature.asc Description: PGP signature
Re: protect pg_stat_statements_info() for being used without the library loaded
On 2021/01/28 15:42, Michael Paquier wrote: On Thu, Jan 28, 2021 at 08:49:54AM +0800, Julien Rouhaud wrote: Good catch, and patch looks good to me. This crashes the server, cash. Looking at all the other modules in the tree, I am not seeing any other hole. This is new as of 9fbc3f3, and I will apply it on HEAD. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: pg_upgrade fails with non-standard ACL
On Wed, Jan 27, 2021 at 07:32:42PM +0300, Anastasia Lubennikova wrote: > On 27.01.2021 14:21, Noah Misch wrote: > >On Mon, Jan 25, 2021 at 10:14:43PM +0300, Anastasia Lubennikova wrote: > > > >>1) Could you please clarify, what do you mean by REVOKE failures? > >> > >>I tried following example: > >> > >>Start 9.6 cluster. > >> > >>REVOKE ALL ON function pg_switch_xlog() FROM public; > >>REVOKE ALL ON function pg_switch_xlog() FROM backup; > >> > >>The upgrade to the current master passes with and without patch. > >>It seems that current implementation of pg_upgrade doesn't take into account > >>revoke ACLs. > >I think you can observe it by adding "revoke all on function > >pg_stat_get_subscription from public;" to test_add_acl_to_catalog_objects.sql > >and then rerunning your test script. pg_dump will reproduce that REVOKE, > >which would fail if postgresql.org had removed that function. That's fine, > >so > >long as a comment mentions the limitation. > > In the updated patch, I implemented generation of both GRANT ALL and REVOKE > ALL for problematic objects. If I understand it correctly, these calls will > clean object's ACL completely. And I see no harm in doing this, because the > objects don't exist in the new cluster anyway. > > To test it I attempted to reproduce the problem, using attached > test_revoke.sql in the test. Still, pg_upgrade works fine without any > adjustments. I'd be grateful if you test it some more. test_revoke.sql has "revoke all on function pg_stat_get_subscription() from test", which does nothing. You need something that causes a REVOKE in pg_dump output. Worked example: === shell script set -x createuser test pg_dump | grep -E 'GRANT|REVOKE' psql -Xc 'revoke all on function pg_stat_get_subscription() from test;' psql -Xc 'revoke all on function pg_stat_get_subscription from test;' pg_dump | grep -E 'GRANT|REVOKE' psql -Xc 'revoke all on function pg_stat_get_subscription from public;' pg_dump | grep -E 'GRANT|REVOKE' === output + createuser test + pg_dump + grep -E 'GRANT|REVOKE' + psql -Xc 'revoke all on function pg_stat_get_subscription() from test;' ERROR: function pg_stat_get_subscription() does not exist + psql -Xc 'revoke all on function pg_stat_get_subscription from test;' REVOKE + pg_dump + grep -E 'GRANT|REVOKE' + psql -Xc 'revoke all on function pg_stat_get_subscription from public;' REVOKE + pg_dump + grep -E 'GRANT|REVOKE' REVOKE ALL ON FUNCTION pg_catalog.pg_stat_get_subscription(subid oid, OUT subid oid, OUT relid oid, OUT pid integer, OUT received_lsn pg_lsn, OUT last_msg_send_time timestamp with time zone, OUT last_msg_receipt_time timestamp with time zone, OUT latest_end_lsn pg_lsn, OUT latest_end_time timestamp with time zone) FROM PUBLIC; That REVOKE is going to fail if the upgrade target cluster lacks the function in question, and your test_rename_catalog_objects_v13 does simulate pg_stat_get_subscription not existing.
Re: Single transaction in the tablesync worker?
On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila wrote: > > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila wrote: > > > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith wrote: > > > > > > PSA the v18 patch for the Tablesync Solution1. > > > > 7. Have you tested with the new patch the scenario where we crash > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the > > replication using the new temporary slot? Here, we need to test the > > case where during the catchup phase we have received few commits and > > then the tablesync worker is crashed/errored out? Basically, check if > > the replication is continued from the same point? > > > > I have tested this and it didn't work, see the below example. > > Publisher-side > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); > > BEGIN; > INSERT INTO mytbl1(somedata, text) VALUES (1, 1); > INSERT INTO mytbl1(somedata, text) VALUES (1, 2); > COMMIT; > > CREATE PUBLICATION mypublication FOR TABLE mytbl1; > > Subscriber-side > > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync > worker stops. > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); > > > CREATE SUBSCRIPTION mysub > CONNECTION 'host=localhost port=5432 dbname=postgres' > PUBLICATION mypublication; > > During debug, stop after we mark FINISHEDCOPY state. > > Publisher-side > > INSERT INTO mytbl1(somedata, text) VALUES (1, 3); > INSERT INTO mytbl1(somedata, text) VALUES (1, 4); > > > Subscriber-side > > - Have a breakpoint in apply_dispatch > - continue in debugger; > - After we replay first commit (which will be for values(1,3), note > down the origin position in apply_handle_commit_internal and somehow > error out. I have forced the debugger to set to the last line in > apply_dispatch where the error is raised. > - After the error, again the tablesync worker is restarted and it > starts from the position noted in the previous step > - It exits without replaying the WAL for (1,4) > > So, on the subscriber-side, you will see 3 records. Fourth is missing. > Now, if you insert more records on the publisher, it will anyway > replay those but the fourth one got missing. > > The temporary slots didn't seem to work because we created again the > new temporary slot after the crash and ask it to start decoding from > the point we noted in origin_lsn. The publisher didn’t hold the > required WAL as our slot was temporary so it started sending from some > later point. We retain WAL based on the slots restart_lsn position and > wal_keep_size. For our case, the positions of the slots will matter > and as we have created temporary slots, there is no way for a > publisher to save that WAL. > > In this particular case, even if the WAL would have been there we only > pass the start_decoding_at position but didn’t pass restart_lsn, so it > picked a random location (current insert position in WAL) which is > ahead of start_decoding_at point so it never sent the required fourth > record. Now, I don’t think it will work even if somehow sent the > correct restart_lsn because of what I wrote earlier that there is no > guarantee that the earlier WAL would have been saved. > > At this point, I can't think of any way to fix this problem except for > going back to the previous approach of permanent slots but let me know > if you have any ideas to salvage this approach? > OK. The latest patch [v21] now restores the permanent slot (and slot cleanup) approach as it was implemented in an earlier version [v17]. Please note that this change also re-introduces some potential slot cleanup problems for some race scenarios. These will be addressed by future patches. [v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com [v21] https://www.postgresql.org/message-id/CAHut%2BPvzHRRA_5O0R8KZCb1tVe1mBVPxFtmttXJnmuOmAegoWA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Wrong usage of RelationNeedsWAL
On Thu, Jan 28, 2021 at 12:06:27PM +0900, Kyotaro Horiguchi wrote: > At Wed, 27 Jan 2021 02:48:48 -0800, Noah Misch wrote in > > On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote: > > > On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote: > > > > Perhaps I'm missing something, but the patch doesn't pass the v5-0001 > > > > test with wal_level=minimal? > > > > > > Correct. The case we must avoid is letting an old snapshot read an > > > early-pruned page without error. v5-0001 expects "ERROR: snapshot too > > > old". > > > The patch suspends early pruning, so that error is not applicable. > I studied the sto feature further and concluded that the checker side > is fine that it always follow the chages of page-LSN. > > So what we can do for the issue is setting seemingly correct page LSN > at pruning or refrain from early-pruning while we are skipping > WAL. The reason I took the former is I thought that the latter might > be a problem since early-pruning would be postponed by a long-running > wal-skipping transaction. Yes, that's an accurate summary. > So the patch looks fine to me. The commit message mekes sense. > > However, is it ok that the existing tests (modules/snapshot_too_old) > fails when wal_level=minimal? That would not be okay, but I'm not seeing that. How did your setup differ from the following? [nm@rfd 6:1 2021-01-27T23:06:33 postgresql 0]$ cat /nmscratch/minimal.conf log_statement = all wal_level = minimal max_wal_senders = 0 log_line_prefix = '%m [%p %l %x] %q%a ' [nm@rfd 6:1 2021-01-27T23:06:38 postgresql 0]$ make -C src/test/modules/snapshot_too_old check TEMP_CONFIG=/nmscratch/minimal.conf == creating temporary instance== == initializing database system == == starting postmaster== running on port 58080 with PID 2603099 == creating database "isolation_regression" == CREATE DATABASE ALTER DATABASE == running regression test queries== test sto_using_cursor ... ok30168 ms test sto_using_select ... ok24197 ms test sto_using_hash_index ... ok 6089 ms == shutting down postmaster == == removing temporary instance== = All 3 tests passed. =
Re: protect pg_stat_statements_info() for being used without the library loaded
On Thu, Jan 28, 2021 at 03:53:54PM +0900, Fujii Masao wrote: > Thanks! No problem. Applied as of bca96dd. -- Michael signature.asc Description: PGP signature
Re: Add SQL function for SHA1
On Tue, Jan 26, 2021 at 09:53:52PM -0500, Sehrope Sarkuni wrote: > The refactor patch looks good. It builds and passes make check. Thanks for double-checking! The refactoring has been just done as of f854c69. -- Michael signature.asc Description: PGP signature
Re: Two patches to speed up pg_rewind.
On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote: > Second one is use copy_file_range() for the local rewind case to replace > read()+write(). > This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other > code could use copy_file_range() if needed. copy_file_range() was introduced > In high-version Linux Kernel, in low-version Linux or other Unix-like OS > mmap() > might be better than read()+write() but copy_file_range() is more interesting > given that it could skip the data copying in some file systems - this could > benefit more > on Linux fs on network-based block storage. Have you done some measurements? -- Michael signature.asc Description: PGP signature
Re: [PATCH] remove pg_standby
On Wed, Jan 27, 2021 at 05:08:56PM +0900, Fujii Masao wrote: > But one question is; shouldn't we follow "usual" way to retire the > feature instead of dropping that immediately? That is, mark > pg_standby as obsolete, announce that pg_standby will be dropped > after several releases, and then drop pg_standby. This seems safe > because there might be some users. While it's been marked as > obsolete, maybe WAL prefetch feature doesn't work with pg_standby, > but we can live with that because it's obsolete. Thanks. FWIW, at this stage, my take is just to move on and remove it. If we mark that as obsolete, it will stay around forever while annoying future development. -- Michael signature.asc Description: PGP signature
Re: Protect syscache from bloating with negative cache entries
At Wed, 27 Jan 2021 13:11:55 +0200, Heikki Linnakangas wrote in > On 27/01/2021 03:13, Kyotaro Horiguchi wrote: > > At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi > > wrote in > >> The commit 4656e3d668 (debug_invalidate_system_caches_always) > >> conflicted with this patch. Rebased. > > At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi > > wrote in > >> (I found a bug in a benchmark-aid function > >> (CatalogCacheFlushCatalog2), I repost an updated version soon.) > > I noticed that a catcachebench-aid function > > CatalogCacheFlushCatalog2() allocates bucked array wrongly in the > > current memory context, which leads to a crash. > > This is a fixed it then rebased version. > > Thanks, with the scripts you provided, I was able to run the > performance tests on my laptop, and got very similar results as you > did. > > The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is > very small. I think I could see it in the tests, but only barely. And > the tests did nothing else than do syscache lookups; in any real world > scenario, it would be lost in noise. I think we can put that aside for > now, and focus on v6-0001-CatCache-expiration-feature.patch: I agree to that opinion. But a bit dissapointing that the long struggle ended up in vain:p > The pruning is still pretty lethargic: > > - Entries created in the same transactions are never pruned away > > - The size of the hash table is never shrunk. So even though the patch > - puts a backstop to the hash table growing indefinitely, if you run one > - transaction that bloats the cache, it's bloated for the rest of the > - session. Right. But more frequent check impacts on performance. We can do more aggressive pruning in idle-time. > I think that's OK. We might want to be more aggressive in the future, > but for now it seems reasonable to lean towards the current behavior > where nothing is pruned. Although I wonder if we should try to set > 'catcacheclock' more aggressively. I think we could set it whenever > GetCurrentTimestamp() is called, for example. Ah. I didn't thought that direction. global_last_acquired_timestamp or such? > Given how unaggressive this mechanism is, I think it should be safe to > enable it by default. What would be a suitable default for > catalog_cache_prune_min_age? 30 seconds? Without a detailed thought, it seems a bit too short. The ever-suggested value for the variable is 300-600s. That is, intermittent queries with about 5-10 minutes intervals don't lose cache entries. In a bad case, two queries alternately remove each other's cache entries. Q1: adds 100 entries <1 minute passed> Q2: adds 100 entries but rehash is going to happen at 150 entries and the existing 100 entreis added by Q1 are removed. <1 minute passed> Q1: adds 100 entries but rehash is going to happen at 150 entries and the existing 100 entreis added by Q2 are removed. Or a transaction sequence persists longer than that seconds may lose some of the catcache entries. > Documentation needs to be updated for the new GUC. > > Attached is a version with a few little cleanups: > - use TimestampTz instead of uint64 for the timestamps > - remove assign_catalog_cache_prune_min_age(). All it did was convert > - the GUC's value from seconds to microseconds, and stored it in a > - separate variable. Multiplication is cheap, so we can just do it when > - we use the GUC's value instead. Yeah, the laater is a trace of the struggle for cutting down cpu cycles in the normal paths. I don't object to do so. I found that some comments are apparently stale. cp->cc_oldest_ts is not used anywhere, but it is added for the decision of whether to scan or not. I fixed the following points in the attached. - Removed some comments that is obvious. ("Timestamp in us") - Added cp->cc_oldest_ts check in CatCacheCleanupOldEntries. - Set the default value for catalog_cache_prune_min_age to 600s. - Added a doc entry for the new GUC in the resoruce/memory section. - Fix some code comments. - Adjust pruning criteria from (ct->lastaccess < prune_threshold) to <=. I was going to write in the doc something like "you can inspect memory consumption by catalog caches using pg_backend_memory_contexts", but all the memory used by catalog cache is in CacheMemoryContext. Is it sensible for each catalog cache to have their own contexts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 01d32ffa499ee9185e222fe6fa3d39ad6ac5ff37 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 27 Jan 2021 13:08:08 +0200 Subject: [PATCH v9] CatCache expiration feature Author: Kyotaro Horiguchi --- doc/src/sgml/config.sgml | 20 +++ src/backend/access/transam/xact.c | 3 ++ src/backend/utils/cache/catcache.c | 85 +- src/backend/utils/misc/guc.c | 12 + src/include/utils/catcache.h | 17 ++ 5 files changed, 136 insertions(+), 1 deletion(-) diff --git a/doc/src/sgm