Re: pg_upgrade fails with non-standard ACL
On Thu, Jan 21, 2021 at 01:03:58AM +0300, Anastasia Lubennikova wrote: > On 03.01.2021 14:29, Noah Misch wrote: > >Overall, this patch predicts a subset of cases where pg_dump will emit a > >failing GRANT or REVOKE that targets a pg_catalog object. Can you write a > >code comment stating what it does and does not detect? I think it's okay to > >not predict every failure, but let's record the intended coverage. Here are > >a > >few examples I know; there may be more. The above query only finds GRANTs to > >non-pinned roles. pg_dump dumps the following, but the above query doesn't > >find them: > > > > REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public; > > GRANT EXECUTE ON FUNCTION pg_reload_conf() TO pg_signal_backend; I see a new comment listing object types. Please also explain the lack of preventing REVOKE failures (first example query above) and the limitation around non-pinned roles (second example). > >The above query should test refclassid. Please do so. > + /* Handle table column objects */ > + if (strstr(aclinfo->obj_type, "column") != NULL) > + { > + char *name_pos = > strstr(aclinfo->obj_ident, > + > aclinfo->obj_name); > + if (*name_pos == '\"') > + name_pos--; This solves the problem affecting a column named "a.b", but it fails for a column named "pg_catalog" or "a""b". I recommend solving this by retrieving all three elements of the pg_identify_object_as_address array, then quoting each of them on the client side.
Re: Is Recovery actually paused?
On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy wrote: > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar wrote: > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy > > wrote: > >> > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar wrote: > >> > Please find the patch for the same. I haven't added a test case for > >> > this yet. I mean we can write a test case to pause the recovery and > >> > get the status. But I am not sure that we can really write a reliable > >> > test case for 'pause requested' and 'paused'. > >> > >> +1 to just show the recovery pause state in the output of > >> pg_is_wal_replay_paused. But, should the function name > >> "pg_is_wal_replay_paused" be something like > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists > >> in a function, I expect a boolean output. Others may have better > >> thoughts. > >> > >> IIUC the above change, ensuring the recovery is paused after it's > >> requested lies with the user. IMHO, the main problem we are trying to > >> solve is not met > > > > > > Basically earlier their was no way for the user yo know whether the > > recovery is actually paused or not because it was always returning true > > after pause requested. Now, we will return whether pause requested or > > actually paused. So > for tool designer who want to wait for recovery to > > get paused can have a loop and wait until the recovery state reaches to > > paused. That will give a better control. > > I get it and I agree to have that change. My point was whether we can > have a new function pg_wal_replay_pause_and_wait that waits until > recovery is actually paused ((along with pg_is_wal_replay_paused > returning the actual state than a true/false) so that tool developers > don't need to have the waiting code outside, if at all they care about > it? Others may have better thoughts than me. I think the previous patch was based on that idea where we thought that we can pass an argument to pg_is_wal_replay_paused which can decide whether to wait or return without the wait. I think this version looks better to me where we give the status instead of waiting. I am not sure whether we want another version of pg_wal_replay_pause which will wait for actually it to get paused. I mean there is always a scope to include the functionality in the database which can be achieved by the tool, but this patch was trying to solve the problem that there was no way to know the status so I think returning the correct status should be the scope of this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi Mark and others, On Sun, Jan 24, 2021, at 09:22, Mark Rofail wrote: > Changelog: > - v13 (compatible with current master 2021-01-24, commit > 7e57255f6189380d545e1df6a6b38827b213e3da) ... > I encourage everyone to take review this patch. After considerable reviews > and performance testing, it will be ready for a commitfest. ... > Array-ELEMENT-foreign-key-v13.patch This is awesome, a million thanks for this! I've tested the patch and tried to use it in the pg_catalog-diff-tool I'm working on. I found one problem, described in Test #3 below. *** Test #1 OK: Multi-key FK on (oid, smallint[]) Find a suitable row to do testing on: joel=# SELECT oid,conrelid,conkey FROM catalog_clone.pg_constraint WHERE cardinality(conkey) > 1 LIMIT 1; oid | conrelid | conkey ---+--+-- 12112 | 1255 | {2,20,3} (1 row) Corrupting the row will not be detected since no FK yet: joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3,1234}' WHERE oid = 12112; UPDATE 1 Trying to add a FK now will detect the corrupted row: joel=# ALTER TABLE catalog_clone.pg_constraint ADD FOREIGN KEY (conrelid, EACH ELEMENT OF conkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum); ERROR: insert or update on table "pg_constraint" violates foreign key constraint "pg_constraint_conrelid_conkey_fkey" DETAIL: Key (conrelid, EACH ELEMENT OF conkey)=(1255, {2,20,3,1234}) is not present in table "pg_attribute". OK, good, we got an error. Fix row and try again: joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3}' WHERE oid = 12112; UPDATE 1 joel=# ALTER TABLE catalog_clone.pg_constraint ADD FOREIGN KEY (conrelid, EACH ELEMENT OF conkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum); ALTER TABLE OK, good, FK added. Thanks to the FK, trying to corrupt the column again will now give an error: joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3,1234}' WHERE oid = 12112; ERROR: insert or update on table "pg_constraint" violates foreign key constraint "pg_constraint_conrelid_conkey_fkey" DETAIL: Key (conrelid, EACH ELEMENT OF conkey)=(1255, {2,20,3,1234}) is not present in table "pg_attribute". OK, good, we got an error. *** Test #2 OK: FK on oid[] Find a suitable row to do testing on: joel=# \d catalog_clone.pg_proc proallargtypes | oid[] | | | joel=# SELECT oid,proallargtypes FROM catalog_clone.pg_proc WHERE cardinality(proallargtypes) > 1 LIMIT 1; oid | proallargtypes --+ 3059 | {25,2276} (1 row) Corrupting the row will not be detected since no FK yet: joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276,1234}' WHERE oid = 3059; UPDATE 1 Trying to add a FK now will detect the corrupted row: joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proallargtypes) REFERENCES catalog_clone.pg_type (oid); ERROR: insert or update on table "pg_proc" violates foreign key constraint "pg_proc_proallargtypes_fkey" DETAIL: Key (EACH ELEMENT OF proallargtypes)=({25,2276,1234}) is not present in table "pg_type". OK, good, we got an error. Fix row and try again: joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276}' WHERE oid = 3059; UPDATE 1 joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proallargtypes) REFERENCES catalog_clone.pg_type (oid); ALTER TABLE OK, good, FK added. Thanks to the FK, trying to corrupt the column again will now give an error: joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276,1234}' WHERE oid = 3059; ERROR: insert or update on table "pg_proc" violates foreign key constraint "pg_proc_proallargtypes_fkey" DETAIL: Key (EACH ELEMENT OF proallargtypes)=({25,2276,1234}) is not present in table "pg_type". OK, good, we got an error. *** Test 3 NOT OK: FK on oidvector Find a suitable row to do testing on: joel=# \d catalog_clone.pg_proc proargtypes | oidvector | | | joel=# SELECT oid,proargtypes FROM catalog_clone.pg_proc WHERE cardinality(proargtypes) > 1 LIMIT 1; oid | proargtypes -+- 79 | 19 25 (1 row) Corrupting the row will not be detected since no FK yet: joel=# UPDATE catalog_clone.pg_proc SET proargtypes = '19 25 12345'::oidvector WHERE oid = 79; UPDATE 1 Trying to add a FK now will detect the corrupted row: joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proargtypes) REFERENCES catalog_clone.pg_type (oid); ERROR: insert or update on table "pg_proc" violates foreign key constraint "pg_proc_proargtypes_fkey" DETAIL: Key (EACH ELEMENT OF proargtypes)=(19 25 12345) is not present in table "pg_type". OK, good, we got an error. Fix row and try again: joel=# UPDATE catalog_clone.pg_proc SET proargtypes = '19 25'::oidvector WHERE oid = 79; UPDATE 1 joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proargtypes) REFERENCES catalog_clone.pg_type (oid); ALTER TABLE OK, good,
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi Masahiko: On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada wrote: > Hi Andy, > > On Mon, Dec 7, 2020 at 9:15 PM Andy Fan wrote: > > > > > > > > On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen < > jesper.peder...@redhat.com> wrote: > >> > >> Hi, > >> > >> On 12/5/20 10:38 PM, Andy Fan wrote: > >> > Currently the UniqueKey is defined as a List of Expr, rather than > >> > EquivalenceClasses. > >> > A complete discussion until now can be found at [1] (The messages I > replied > >> > to also > >> > care a lot and the information is completed). This patch has stopped > at > >> > this place for > >> > a while, I'm planning to try EquivalenceClasses, but any suggestion > would > >> > be welcome. > >> > > >> > > >> > >> Unfortunately I think we need a RfC style patch of both versions in > >> their minimum implementation. > >> > >> Hopefully this will make it easier for one or more committers to decide > >> on the right direction since they can do a side-by-side comparison of > >> the two solutions. > >> > > > > I do get the exact same idea. Actually I have made EquivalenceClasses > > works with baserel last weekend and then I realized it is hard to compare > > the 2 situations without looking into the real/Poc code, even for very > > experienced people. I will submit a new patch after I get the > partitioned > > relation, subquery works. Hope I can make it in one week. > > Status update for a commitfest entry. > > Are you planning to submit a new patch? Or is there any blocker for > this work? This patch entry on CF app has been in state Waiting on > Author for a while. If there is any update on that, please reflect on > CF app. > > > I agree that the current status is "Waiting on author", and no block issue for others. I plan to work on this in 1 month. I have to get my current urgent case completed first. Sorry for the delay action and thanks for asking. -- Best Regards Andy Fan (https://www.aliyun.com/)
Extend more usecase for planning time partition pruning and init partition pruning.
Hi: I recently found a use case like this. SELECT * FROM p, q WHERE p.partkey = q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning time partition prune or init partition prune. Even though we have run-time partition pruning work at last, it is too late in some cases since we have to init all the plan nodes in advance. In my case, there are 10+ partitioned relation in one query and the execution time is short, so the init plan a lot of plan nodes cares a lot. The attached patches fix this issue. It just get the "p.partkey = q.colx" case in root->eq_classes or rel->joinlist (outer join), and then check if there is some baserestrictinfo in another relation which can be used for partition pruning. To make the things easier, both partkey and colx must be Var expression in implementation. - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch Just some existing refactoring and extending ChangeVarNodes to be able to change var->attno. - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch Do the real job. Thought? -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch Description: Binary data v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch Description: Binary data
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On Sun, Jan 24, 2021, at 11:21, Mark Rofail wrote: >This seems to be a type casting problem indeed. The coercion part of the patch >found in "ruleutils.c:11438-11491" is the culprit, is there a cleaner way to >achieve this? >I am aware of the problem and will try to fix it, but I assume this is because >of "Array-containselem-gin-v1.patch" can you give the v13 patch a try alone? >I will work on a fix and send it soon. Actually, the error occurred without the "Array-containselem-gin-v1.patch". However, I also had the "v3-0001-Add-primary-keys-and-unique-constraints-to-system.patch", to get PKs on the oids. So, I've retested to ensure it didn't cause this problem, this time only 7e57255f6189380d545e1df6a6b38827b213e3da + "Array-ELEMENT-foreign-key-v13.patch", but I still get the same error though. Exact steps to reproduce error from a clean installation: $ cd postgresql $ git checkout 7e57255f6189380d545e1df6a6b38827b213e3da $ patch -p1 < ~/Array-ELEMENT-foreign-key-v13.patch $ ./configure --prefix=~/pg-head $ make -j16 $ make install $ initdb -D ~/pghead $ pg_ctl -D ~/pghead -l /tmp/logfile start $ createdb $USER $ psql CREATE SCHEMA catalog_clone; CREATE TABLE catalog_clone.pg_proc AS SELECT * FROM pg_catalog.pg_proc; CREATE TABLE catalog_clone.pg_type AS SELECT * FROM pg_catalog.pg_type; ALTER TABLE catalog_clone.pg_type ADD PRIMARY KEY (oid); ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proargtypes) REFERENCES catalog_clone.pg_type (oid); UPDATE catalog_clone.pg_proc SET proargtypes = '19 25 12345'::oidvector WHERE oid = 79; ERROR: operator does not exist: oidvector pg_catalog.@> oid[] LINE 1: ... 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1 OPERATOR(p... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. QUERY: SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1 OPERATOR(pg_catalog. @>) ARRAY["oid"] FOR KEY SHARE OF x) z) /Joel
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hello Joel, On Sun, Jan 24, 2021 at 12:11 PM Joel Jacobson wrote: > HINT: No operator matches the given name and argument types. You might > need to add explicit type casts. > QUERY: SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM > pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) > FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1 > OPERATOR(pg_catalog. @>) ARRAY["oid"] FOR KEY SHARE OF x) z) > > It seems to me there is some type conversion between oidvector and oid[] > that isn't working properly? > This seems to be a type casting problem indeed. The coercion part of the patch found in "ruleutils.c:11438-11491" is the culprit, is there a cleaner way to achieve this? I am aware of the problem and will try to fix it, but I assume this is because of "Array-containselem-gin-v1.patch" can you give the v13 patch a try alone? I will work on a fix and send it soon. /Mark
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hi again, I found a similar problem with int2vector columns: CREATE TABLE catalog_clone.pg_index AS SELECT * FROM pg_catalog.pg_index; CREATE TABLE catalog_clone.pg_attribute AS SELECT attrelid,attnum FROM pg_catalog.pg_attribute; ALTER TABLE catalog_clone.pg_attribute ADD UNIQUE (attrelid, attnum); ALTER TABLE catalog_clone.pg_index ADD FOREIGN KEY (indrelid, EACH ELEMENT OF indkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum); SELECT indexrelid,indrelid,indkey FROM catalog_clone.pg_index WHERE cardinality(indkey) > 1 LIMIT 1; indexrelid | indrelid | indkey +--+ 2837 | 2836 | 1 2 (1 row) UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE indexrelid = 2837; ERROR: operator does not exist: int2vector pg_catalog.@> smallint[] LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. QUERY: SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM pg_catalog.unnest($2) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_attribute" x WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(pg_catalog. @>) ARRAY["attnum"] FOR KEY SHARE OF x) z) This made me wonder if there are any more of these "vector" columns than the oidvector and int2vector. It appears they are the only two: SELECT DISTINCT data_type, udt_name from information_schema.columns WHERE udt_name LIKE '%vector' ORDER BY 1,2; data_type | udt_name ---+ ARRAY | int2vector ARRAY | oidvector (2 rows) /Joel
Re: simplifying foreign key/RI checks
On Sun, Jan 24, 2021 at 11:26 AM Corey Huinker wrote: > On Sat, Jan 23, 2021 at 12:52 PM Zhihong Yu wrote: >> >> Hi, Thanks for the review. >> + for (i = 0; i < riinfo->nkeys; i++) >> + { >> + Oid eq_opr = eq_oprs[i]; >> + Oid typeid = RIAttType(fk_rel, riinfo->fk_attnums[i]); >> + RI_CompareHashEntry *entry = ri_HashCompareOp(eq_opr, typeid); >> + >> + if (pk_nulls[i] != 'n' && >> OidIsValid(entry->cast_func_finfo.fn_oid)) >> >> It seems the pk_nulls[i] != 'n' check can be lifted ahead of the assignment >> to the three local variables. That way, ri_HashCompareOp wouldn't be called >> when pk_nulls[i] == 'n'. Good idea, so done. Although, there can't be nulls right now. >> + case TM_Updated: >> + if (IsolationUsesXactSnapshot()) >> ... >> + case TM_Deleted: >> + if (IsolationUsesXactSnapshot()) >> >> It seems the handling for TM_Updated and TM_Deleted is the same. The cases >> for these two values can be put next to each other (saving one block of >> code). Ah, yes. The TM_Updated case used to be handled a bit differently in earlier unposted versions of the patch, though at some point I concluded that the special handling was unnecessary, but didn't realize what you just pointed out. Fixed. > I'll pause on reviewing v4 until you've addressed the suggestions above. Here's v5. -- Amit Langote EDB: http://www.enterprisedb.com v5-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data v5-0001-Export-get_partition_for_tuple.patch Description: Binary data
Re: POC: postgres_fdw insert batching
On Sun, Jan 24, 2021 at 2:17 AM Tomas Vondra wrote: > On 1/23/21 9:31 AM, Amit Langote wrote: > > I was looking at this and it looks like we've got a problematic case > > where postgresGetForeignModifyBatchSize() is called from > > ExecInitRoutingInfo(). > > > > That case is when the insert is performed as part of a cross-partition > > update of a partitioned table containing postgres_fdw foreign table > > partitions. Because we don't check the operation in > > ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such > > inserts attempt to use batching. However the ResultRelInfo may be one > > for the original update operation, so ri_FdwState contains a > > PgFdwModifyState with batch_size set to 0, because updates don't > > support batching. As things stand now, > > postgresGetForeignModifyBatchSize() simply returns that, tripping the > > following Assert in the caller. > > > > Assert(partRelInfo->ri_BatchSize >= 1); > > > > Use this example to see the crash: > > > > create table p (a int) partition by list (a); > > create table p1 (like p); > > create extension postgres_fdw; > > create server lb foreign data wrapper postgres_fdw ; > > create user mapping for current_user server lb; > > create foreign table fp1 (a int) server lb options (table_name 'p1'); > > alter table p attach partition fp1 for values in (1); > > create or replace function report_trig_details() returns trigger as $$ > > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = > > 'DELETE' then return old; end if; return new; end; $$ language > > plpgsql; > > create trigger trig before update on fp1 for each row execute function > > report_trig_details(); > > create table p2 partition of p for values in (2); > > insert into p values (2); > > update p set a = 1; -- crashes > > > > So we let's check mtstate->operation == CMD_INSERT in > > ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize() > > in cross-update situations where mtstate->operation would be > > CMD_UPDATE. > > > > I've attached a patch. > > Thanks for catching this. I think it'd be good if the fix included a > regression test. The example seems like a good starting point, not sure > if it can be simplified further. Yes, it can be simplified by using a local join to prevent the update of the foreign partition from being pushed to the remote server, for which my example in the previous email used a local trigger. Note that the update of the foreign partition to be done locally is a prerequisite for this bug to occur. I've added that simplified test case in the attached updated patch. -- Amit Langote EDB: http://www.enterprisedb.com v2-0001-Prevent-FDW-insert-batching-during-cross-partitio.patch Description: Binary data
Re: mkid reference
On Fri, Jan 22, 2021 at 7:07 PM Tom Lane wrote: > > Daniel Gustafsson writes: > >> On 22 Jan 2021, at 12:56, Magnus Hagander wrote: > >> And maybe even more interestnig -- is there a point to this whole > >> make_diff directory at all in these days of git? Or should we just > >> remove it rather than try to fix it? > > > There's also src/tools/make_mkid which use this mkid tool. +1 for removing. > > If anything, it seems better replaced by extended documentation on the > > existing > > wiki article [0] on how to use "git format-patch". > > I found man pages for mkid online --- it's apparently a ctags-like > code indexing tool, not something for patches. So maybe Bruce still > uses it, or maybe not. But as long as we've also got make_ctags and > make_etags in there, I don't have a problem with leaving make_mkid. > > make_diff, on the other hand, certainly looks like technology whose > time has passed. I wonder about pgtest, too. I'll go kill make_diff then -- quicker than fixing the docs of it. As for pgtest, that one looks a bit interesting as well -- but it's been patched on as late as 9.5 and in 2018, so it seems at least Bruce uses it :) While at it, what point is "codelines" adding? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination
On 2021-Jan-24, Julien Rouhaud wrote: > + /* > + * Do not allow tuples with invalid combinations of hint bits to be > placed > + * on a page. These combinations are detected as corruption by the > + * contrib/amcheck logic, so if you disable one or both of these > + * assertions, make corresponding changes there. > + */ > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) && > + (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))); > > > I attach a simple self contained script to reproduce the problem, the last > UPDATE triggering the Assert. > > I'm not really familiar with this part of the code, so it's not exactly clear > to me if some logic is missing in compute_new_xmax_infomask() / > heap_prepare_insert(), or if this should actually be an allowed combination of > hint bit. Hmm, it's probably a bug in compute_new_xmax_infomask. I don't think the combination is sensible. -- Álvaro Herrera39°49'30"S 73°17'W "There is evil in the world. There are dark, awful things. Occasionally, we get a glimpse of them. But there are dark corners; horrors almost impossible to imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)
Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination
On Sun, 24 Jan 2021 at 11:48, Julien Rouhaud wrote: > > Hi, > > While working on pg14 compatibility for an extension relying on an apparently > uncommon combination of FOR UPDATE and stored function calls, I hit some new > Asserts introduced in 866e24d47db (Extend amcheck to check heap pages): > > + /* > +* Do not allow tuples with invalid combinations of hint bits to be placed > +* on a page. These combinations are detected as corruption by the > +* contrib/amcheck logic, so if you disable one or both of these > +* assertions, make corresponding changes there. > +*/ > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) && > +(tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))); > > > I attach a simple self contained script to reproduce the problem, the last > UPDATE triggering the Assert. > > I'm not really familiar with this part of the code, so it's not exactly clear > to me if some logic is missing in compute_new_xmax_infomask() / > heap_prepare_insert(), or if this should actually be an allowed combination of > hint bit. Thanks Juliean for reporting this. I am also able to reproduce this assert. *Small test case to reproduce:* > DROP TABLE IF EXISTS t1; > CREATE TABLE t1(id integer, val text); > INSERT INTO t1 SELECT i, 'val' FROM generate_series(1, 2) i; > > BEGIN; > SAVEPOINT s1; > SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE; > UPDATE t1 SET val = 'hoho' WHERE id = 2; > release s1; > SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE; > UPDATE t1 SET val = 'hoho' WHERE id = 2; > If we remove the "release s1;" step from the test case, then we are not getting this assert failure. *Stack trace:* > warning: Unexpected size of section `.reg-xstate/123318' in core file. > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Core was generated by `postgres: mahendrathalor postgres [local] UPDATE >'. > Program terminated with signal SIGABRT, Aborted. > > warning: Unexpected size of section `.reg-xstate/123318' in core file. > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > (gdb) bt > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > #1 0x7fb50a7c88b1 in __GI_abort () at abort.c:79 > #2 0x5612a63f7c84 in ExceptionalCondition ( > conditionName=0x5612a64da470 "!((tuple->t_data->t_infomask & > HEAP_XMAX_LOCK_ONLY) && (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))", > errorType=0x5612a64da426 "FailedAssertion", fileName=0x5612a64da420 > "hio.c", lineNumber=57) at assert.c:69 > #3 0x5612a597e76b in RelationPutHeapTuple (relation=0x7fb50ce37fc0, > buffer=163, tuple=0x5612a795de18, token=false) at hio.c:56 > #4 0x5612a5955d32 in heap_update (relation=0x7fb50ce37fc0, > otid=0x7ffc8b5e30d2, newtup=0x5612a795de18, cid=0, crosscheck=0x0, > wait=true, tmfd=0x7ffc8b5e3060, > lockmode=0x7ffc8b5e3028) at heapam.c:3791 > #5 0x5612a596ebdc in heapam_tuple_update (relation=0x7fb50ce37fc0, > otid=0x7ffc8b5e30d2, slot=0x5612a794d348, cid=3, snapshot=0x5612a793d620, > crosscheck=0x0, wait=true, > tmfd=0x7ffc8b5e3060, lockmode=0x7ffc8b5e3028, > update_indexes=0x7ffc8b5e3025) at heapam_handler.c:327 > #6 0x5612a5da745d in table_tuple_update (rel=0x7fb50ce37fc0, > otid=0x7ffc8b5e30d2, slot=0x5612a794d348, cid=3, snapshot=0x5612a793d620, > crosscheck=0x0, wait=true, > tmfd=0x7ffc8b5e3060, lockmode=0x7ffc8b5e3028, > update_indexes=0x7ffc8b5e3025) at ../../../src/include/access/tableam.h:1422 > #7 0x5612a5dab6ef in ExecUpdate (mtstate=0x5612a794bc20, > resultRelInfo=0x5612a794be58, tupleid=0x7ffc8b5e30d2, oldtuple=0x0, > slot=0x5612a794d348, planSlot=0x5612a794d1f8, > epqstate=0x5612a794bd18, estate=0x5612a794b9b0, canSetTag=true) at > nodeModifyTable.c:1498 > #8 0x5612a5dadb17 in ExecModifyTable (pstate=0x5612a794bc20) at > nodeModifyTable.c:2254 > #9 0x5612a5d4fdc5 in ExecProcNodeFirst (node=0x5612a794bc20) at > execProcnode.c:450 > #10 0x5612a5d3bd3a in ExecProcNode (node=0x5612a794bc20) at > ../../../src/include/executor/executor.h:247 > #11 0x5612a5d40764 in ExecutePlan (estate=0x5612a794b9b0, > planstate=0x5612a794bc20, use_parallel_mode=false, operation=CMD_UPDATE, > sendTuples=false, numberTuples=0, > direction=ForwardScanDirection, dest=0x5612a7946d38, > execute_once=true) at execMain.c:1542 > #12 0x5612a5d3c8e2 in standard_ExecutorRun (queryDesc=0x5612a79468c0, > direction=ForwardScanDirection, count=0, execute_once=true) at > execMain.c:364 > #13 0x5612a5d3c5aa in ExecutorRun (queryDesc=0x5612a79468c0, > direction=ForwardScanDirection, count=0, execute_once=true) at > execMain.c:308 > #14 0x5612a612d78a in ProcessQuery (plan=0x5612a7946c48, > sourceText=0x5612a787b570 "UPDATE t1 SET val = 'hoho' WHERE id = 2;", > params=0x0, queryEnv=0
Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination
On Mon, Jan 25, 2021 at 12:01 AM Alvaro Herrera wrote: > > On 2021-Jan-24, Julien Rouhaud wrote: > > > + /* > > + * Do not allow tuples with invalid combinations of hint bits to be > > placed > > + * on a page. These combinations are detected as corruption by the > > + * contrib/amcheck logic, so if you disable one or both of these > > + * assertions, make corresponding changes there. > > + */ > > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) && > > + (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))); > > > > > > I attach a simple self contained script to reproduce the problem, the last > > UPDATE triggering the Assert. > > > > I'm not really familiar with this part of the code, so it's not exactly > > clear > > to me if some logic is missing in compute_new_xmax_infomask() / > > heap_prepare_insert(), or if this should actually be an allowed combination > > of > > hint bit. > > Hmm, it's probably a bug in compute_new_xmax_infomask. I don't think > the combination is sensible. Yeah, the combination clearly doesn't make sense, but I'm wondering what to do about existing data? Amcheck.verify_am will report corruption for those, and at least all servers where powa-archivist extension is installed will be impacted.
Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination
On Mon, Jan 25, 2021 at 12:06 AM Mahendra Singh Thalor wrote: > > On Sun, 24 Jan 2021 at 11:48, Julien Rouhaud wrote: > > > > Hi, > > > > I'm not really familiar with this part of the code, so it's not exactly > > clear > > to me if some logic is missing in compute_new_xmax_infomask() / > > heap_prepare_insert(), or if this should actually be an allowed combination > > of > > hint bit. > > Thanks Juliean for reporting this. I am also able to reproduce this assert. Thanks for looking at it! > > Small test case to reproduce: >> >> DROP TABLE IF EXISTS t1; >> CREATE TABLE t1(id integer, val text); >> INSERT INTO t1 SELECT i, 'val' FROM generate_series(1, 2) i; >> >> BEGIN; >> SAVEPOINT s1; >> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE; >> UPDATE t1 SET val = 'hoho' WHERE id = 2; >> release s1; >> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE; >> UPDATE t1 SET val = 'hoho' WHERE id = 2; > > > If we remove the "release s1;" step from the test case, then we are not > getting this assert failure. Yes, this is the smallest reproducer that could trigger the problem, and the release is required.
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Hello Joel, UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE > indexrelid = 2837; > ERROR: operator does not exist: int2vector pg_catalog.@> smallint[] > LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p... > In your example, you are using the notation '1 2 12345'::int2vector which signifies a vector as you have mentioned, however in this patch we use the array operator @> to help with the indexing, the incompatibility stems from here. I do not think that postgres contains vector operators, correct me if I am wrong. I feel that supporting vectors is our of the scope of this patch, if you have an idea how to support it please let me know. /Mark
Re: Stronger safeguard for archive recovery not to miss data
On Thu, 2021-01-21 at 15:30 +0100, I wrote: > On Thu, 2021-01-21 at 13:09 +, osumi.takami...@fujitsu.com wrote: > > > > My vote is that we should not have a GUC for such an unlikely event, and > > > that > > > stopping recovery is good enough. > > OK. IIUC, my current patch for this fix doesn't need to be changed or > > withdrawn. > > Thank you for your explanation. > > Well, that's just my opinion. > > Fujii Masao seemed to disagree with the patch, and his voice carries weight. I think you should pst another patch where the second, now superfluous, error message is removed. Yours, Laurenz Albe
RE: Parallel INSERT (INTO ... SELECT ...)
Hello Greg-san, Second group of comments (I'll reply to (1) - (4) later): (5) @@ -790,7 +790,8 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt) ... - if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE) + if ((plannedstmt->commandType != CMD_SELECT && +!IsModifySupportedInParallelMode(plannedstmt->commandType)) || plannedstmt->hasModifyingCTE) PreventCommandIfParallelMode(CreateCommandName((Node *) plannedstmt)); } Now that we're trying to allow parallel writes (INSERT), we should: * use ExecCheckXactReadOnly() solely for checking read-only transactions, as the function name represents. That is, move the call to PreventCommandIfParallelMode() up to standard_ExecutorStart(). * Update the comment above the call to ExecCheckXactReadOnly(). (6) @@ -764,6 +777,22 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, ... + else + { + pei->processed_count = NULL; + } The braces can be deleted. (7) @@ -1400,6 +1439,16 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) true); queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options); + Assert(queryDesc->operation == CMD_SELECT || IsModifySupportedInParallelMode(queryDesc->operation)); + if (IsModifySupportedInParallelMode(queryDesc->operation)) + { + /* +* Record that the CurrentCommandId is used, at the start of the +* parallel operation. +*/ + SetCurrentCommandIdUsedForWorker(); + } + /* Setting debug_query_string for individual workers */ debug_query_string = queryDesc->sourceText; @@ -765,12 +779,16 @@ GetCurrentCommandId(bool used) if (used) { /* -* Forbid setting currentCommandIdUsed in a parallel worker, because -* we have no provision for communicating this back to the leader. We -* could relax this restriction when currentCommandIdUsed was already -* true at the start of the parallel operation. +* If in a parallel worker, only allow setting currentCommandIdUsed if +* currentCommandIdUsed was already true at the start of the parallel +* operation (by way of SetCurrentCommandIdUsed()), otherwise forbid +* setting currentCommandIdUsed because we have no provision for +* communicating this back to the leader. Once currentCommandIdUsed is +* set, the commandId used by leader and workers can't be changed, +* because CommandCounterIncrement() then prevents any attempted +* increment of the current commandId. */ - Assert(!IsParallelWorker()); + Assert(!(IsParallelWorker() && !currentCommandIdUsed)); currentCommandIdUsed = true; } return currentCommandId; What happens without these changes? If this kind of change is really necessary, it seems more natural to pass currentCommandIdUsed together with currentCommandId through SerializeTransactionState() and StartParallelWorkerTransaction(), instead of the above changes. As an aside, SetCurrentCommandIdUsed() in the comment should be SetCurrentCommandIdUsedForWorker(). (8) + /* +* If the trigger type is RI_TRIGGER_FK, this indicates a FK exists in +* the relation, and this would result in creation of new CommandIds +* on insert/update/delete and this isn't supported in a parallel +* worker (but is safe in the parallel leader). +*/ + trigtype = RI_FKey_trigger_type(trigger->tgfoid); + if (trigtype == RI_TRIGGER_FK) + { + if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context)) + return true; + } Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK triggers do not generate command IDs. See RI_FKey_check() which is called in RI_TRIGGER_FK case. In there, ri_PerformCheck() is called with the detectNewRows argument set to false, which causes CommandCounterIncrement() to not be called. Plus, tables that have RI_TRIGGER_PK should allow parallel INSERT in a parallel-safe manner, because those triggers only fire for UPDATE and DELETE. So, for the future parallel UPDATE/DELETE support, the above check should be performed in UPDATE and DELETE cases. (In a data warehouse, fact tables, which store large amounts of historical data, typically have foreign keys to smaller dimension tables. Thus, it's important to allow parallel INSERTs on tables with foreign keys.) Regards Takayuki Tsunakawa
Re: About to add WAL write/fsync statistics to pg_stat_wal view
Hi, Japin Thanks for your comments. On 2021-01-23 01:46, japin wrote: Hi, Masahiro Thanks for you update the v4 patch. Here are some comments: (1) + char*msg = NULL; + boolsync_called;/* whether to sync data to the disk. */ + instr_time start; + instr_time duration; + + /* check whether to sync data to the disk is really occurred. */ + sync_called = false; Maybe we can initialize the "sync_called" variable when declare it. Yes, I fixed it. (2) + if (sync_called) + { + /* increment the i/o timing and the number of times to fsync WAL data */ + if (track_wal_io_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); + } + + WalStats.m_wal_sync++; + } There is an extra space before INSTR_TIME_GET_MICROSEC(duration). Yes, I removed it. In the issue_xlog_fsync(), the comment says that if sync_method is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced. Does that mean it synced when write the WAL data? And for those cases, we cannot get accurate write/sync timing and number of write/sync times, right? case SYNC_METHOD_OPEN: case SYNC_METHOD_OPEN_DSYNC: /* write synced it already */ break; Yes, I add the following comments in the document. @@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i Total number of times WAL data was synced to disk + (if is open_datasync or + open_sync, this value is zero because WAL data is synced + when to write it). @@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i Total amount of time that has been spent in the portion of WAL data was synced to disk, in milliseconds - (if is enabled, otherwise zero) + (if is enabled, otherwise zero. + if is open_datasync or + open_sync, this value is zero too because WAL data is synced + when to write it). I attached a modified patch. Regards, -- Masahiro Ikeda NTT DATA CORPORATIONFrom ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001 From: Masahiro Ikeda Date: Fri, 22 Jan 2021 21:38:31 +0900 Subject: [PATCH] Add statistics related to write/sync wal records. This patch adds following statistics to pg_stat_wal view to track WAL I/O activity. - the total number of writing/syncing WAL data. - the total amount of time that has been spent in writing/syncing WAL data. Since to track I/O timing may leads significant overhead, GUC parameter "track_wal_io_timing" is introduced. Only if this is on, the I/O timing is measured. The statistics related to sync are zero when "wal_sync_method" is "open_datasync" or "open_sync", because it leads syncing WAL data at same time when to write it. Author: Masahiro Ikeda Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75...@oss.nttdata.com (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID) --- doc/src/sgml/config.sgml | 21 +++ doc/src/sgml/monitoring.sgml | 48 ++- src/backend/access/transam/xlog.c | 59 ++- src/backend/catalog/system_views.sql | 4 ++ src/backend/postmaster/pgstat.c | 4 ++ src/backend/postmaster/walwriter.c| 3 + src/backend/utils/adt/pgstatfuncs.c | 24 +++- src/backend/utils/misc/guc.c | 9 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + src/include/catalog/pg_proc.dat | 14 ++--- src/include/pgstat.h | 8 +++ src/test/regress/expected/rules.out | 6 +- 13 files changed, 189 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 82864bbb24..43f3fbcaf8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + track_wal_io_timing (boolean) + + track_wal_io_timing configuration parameter + + + + +Enables timing of WAL I/O calls. This parameter is off by default, +because it will repeatedly query the operating system for +the current time, which may cause significant overhead on some +platforms. You can use the tool to +measure the overhead of timing on your system. +I/O timing information is +displayed in
Re: simplifying foreign key/RI checks
On Sun, Jan 24, 2021 at 6:51 AM Amit Langote wrote: > On Sun, Jan 24, 2021 at 11:26 AM Corey Huinker > wrote: > > On Sat, Jan 23, 2021 at 12:52 PM Zhihong Yu wrote: > >> > >> Hi, > > Thanks for the review. > > >> + for (i = 0; i < riinfo->nkeys; i++) > >> + { > >> + Oid eq_opr = eq_oprs[i]; > >> + Oid typeid = RIAttType(fk_rel, riinfo->fk_attnums[i]); > >> + RI_CompareHashEntry *entry = ri_HashCompareOp(eq_opr, > typeid); > >> + > >> + if (pk_nulls[i] != 'n' && > OidIsValid(entry->cast_func_finfo.fn_oid)) > >> > >> It seems the pk_nulls[i] != 'n' check can be lifted ahead of the > assignment to the three local variables. That way, ri_HashCompareOp > wouldn't be called when pk_nulls[i] == 'n'. > > Good idea, so done. Although, there can't be nulls right now. > > >> + case TM_Updated: > >> + if (IsolationUsesXactSnapshot()) > >> ... > >> + case TM_Deleted: > >> + if (IsolationUsesXactSnapshot()) > >> > >> It seems the handling for TM_Updated and TM_Deleted is the same. The > cases for these two values can be put next to each other (saving one block > of code). > > Ah, yes. The TM_Updated case used to be handled a bit differently in > earlier unposted versions of the patch, though at some point I > concluded that the special handling was unnecessary, but didn't > realize what you just pointed out. Fixed. > > > I'll pause on reviewing v4 until you've addressed the suggestions above. > > Here's v5. > v5 patches apply to master. Suggested If/then optimization is implemented. Suggested case merging is implemented. Passes make check and make check-world yet again. Just to confirm, we *don't* free the RI_CompareHashEntry because it points to an entry in a hash table which is TopMemoryContext aka lifetime of the session, correct? Anybody else want to look this patch over before I mark it Ready For Committer?
Re: Is Recovery actually paused?
On Sun, Jan 17, 2021 at 5:19 PM Dilip Kumar wrote: > > On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada wrote: > > > > On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar wrote: > > > > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar wrote: > > > > > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA wrote: > > > > > > > > > > On Thu, 10 Dec 2020 11:25:23 +0530 > > > > > Dilip Kumar wrote: > > > > > > > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to > > > > > > > > wait. > > > > > > > > Especially, if max_standby_streaming_delay is -1, this will be > > > > > > > > blocked forever, > > > > > > > > although this setting may not be usual. In addition, some users > > > > > > > > may set > > > > > > > > recovery_min_apply_delay for a large. If such users call > > > > > > > > pg_is_wal_replay_paused, > > > > > > > > it could wait for a long time. > > > > > > > > > > > > > > > > At least, I think we need some descriptions on document to > > > > > > > > explain > > > > > > > > pg_is_wal_replay_paused could wait while a time. > > > > > > > > > > > > > > Ok > > > > > > > > > > > > Fixed this, added some comments in .sgml as well as in function > > > > > > header > > > > > > > > > > Thank you for fixing this. > > > > > > > > > > Also, is it better to fix the description of pg_wal_replay_pause from > > > > > "Pauses recovery." to "Request to pause recovery." in according with > > > > > pg_is_wal_replay_paused? > > > > > > > > Okay > > > > > > > > > > > > > > > > > Also, how about adding a new boolean argument to > > > > > > > > pg_is_wal_replay_paused to > > > > > > > > control whether this waits for recovery to get paused or not? > > > > > > > > By setting its > > > > > > > > default value to true or false, users can use the old format > > > > > > > > for calling this > > > > > > > > and the backward compatibility can be maintained. > > > > > > > > > > > > > > So basically, if the wait_recovery_pause flag is false then we > > > > > > > will > > > > > > > immediately return true if the pause is requested? I agree that > > > > > > > it is > > > > > > > good to have an API to know whether the recovery pause is > > > > > > > requested or > > > > > > > not but I am not sure is it good idea to make this API serve both > > > > > > > the > > > > > > > purpose? Anyone else have any thoughts on this? > > > > > > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has another > > > > > purpose; > > > > > this waits recovery to actually get paused. If we want to limit this > > > > > API's > > > > > purpose only to return the pause state, it seems better to fix this > > > > > to return > > > > > the actual state at the cost of lacking the backward compatibility. > > > > > If we want > > > > > to know whether pause is requested, we may add a new API like > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > > > > > recovery to actually > > > > > get paused, we may add an option to pg_wal_replay_pause() for this > > > > > purpose. > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't > > > > > care either. > > > > > > > > I don't think that it will be blocked ever, because > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > recovery process will not be stuck on waiting for the WAL. > > > > > > > > > > > > As another comment, while pg_is_wal_replay_paused is blocking, > > > > > > > > I can not cancel > > > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the > > > > > > > > waiting loop. > > > > > > > > > > How about this fix? I think users may want to cancel > > > > > pg_is_wal_replay_paused() during > > > > > this is blocking. > > > > > > > > Yeah, we can do this. I will send the updated patch after putting > > > > some more thought into these comments. Thanks again for the feedback. > > > > > > > > > > Please find the updated patch. > > > > I've looked at the patch. Here are review comments: > > > > + /* Recovery pause state */ > > + RecoveryPauseState recoveryPause; > > > > Now that the value can have tri-state, how about renaming it to > > recoveryPauseState? > > This makes sense to me. > > > --- > > bool > > RecoveryIsPaused(void) > > +{ > > + boolrecoveryPause; > > + > > + SpinLockAcquire(&XLogCtl->info_lck); > > + recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ? > > true : false; > > + SpinLockRelease(&XLogCtl->info_lck); > > + > > + return recoveryPause; > > +} > > + > > +bool > > +RecoveryPauseRequested(void) > > { > > boolrecoveryPause; > > > > SpinLockAcquire(&XLogCtl->info_lck); > > - recoveryPause = XLogCtl->recoveryPause; > > + recoveryPause = (XLogCtl->recoveryPause != > > RECOVERY_IN_PROGRESS) ? true : false; > > SpinLockRelease(&XLogCtl->i
Re: Single transaction in the tablesync worker?
On Sun, Jan 24, 2021 at 5:54 PM Peter Smith wrote: > > 4. > > - /* > > - * To build a slot name for the sync work, we are limited to NAMEDATALEN - > > - * 1 characters. We cut the original slot name to NAMEDATALEN - 28 chars > > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0'). (It's actually the > > - * NAMEDATALEN on the remote that matters, but this scheme will also work > > - * reasonably if that is different.) > > - */ > > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for > > sanity */ > > - slotname = psprintf("%.*s_%u_sync_%u", > > - NAMEDATALEN - 28, > > - MySubscription->slotname, > > - MySubscription->oid, > > - MyLogicalRepWorker->relid); > > + /* Calculate the name of the tablesync slot. */ > > + slotname = ReplicationSlotNameForTablesync( > > +MySubscription->oid, > > +MyLogicalRepWorker->relid); > > > > What is the reason for changing the slot name calculation? If there is > > any particular reasons, then we can add a comment to indicate why we > > can't include the subscription's slotname in this calculation. > > > > The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION) > and so including the subscription slot name as part of the tablesync > slot name was considered to be: > a) possibly risky/undefined, if the subscription slot_name = NONE > b) confusing, if we end up using 2 different slot names for the same > tablesync (e.g. if the subscription slot name is changed before a sync > worker is re-launched). > And since this subscription slot name part is not necessary for > uniqueness anyway, it was removed from the tablesync slot name to > eliminate those concerns. > > Also, the tablesync slot name calculation was encapsulated as a > separate function because previously (i.e. before v18) it was used by > various other cleanup codes. I still like it better as a function, but > now it is only called from one place so we could put that code back > inline if you prefer it how it was.. It turns out those (a/b) concerns I wrote above are maybe unfounded, because it seems not possible to alter the slot_name = NONE unless the subscription is first DISABLED. So probably I can revert all this tablesync slot name calculation back to how it originally was in the OSS HEAD if you want. Kind Regards, Peter Smith. Fujitsu Australia
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On 2021-Jan-24, Mark Rofail wrote: > I do not think that postgres contains vector operators, correct me if I am > wrong. I feel that supporting vectors is our of the scope of this patch, if > you have an idea how to support it please let me know. I do not think that this patch needs to support oidvector and int2vector types as if they were oid[] and int2[] respectively. If you really want an element FK there, you can alter the column types to the real array types in your catalog clone. oidvector and int2vector are internal implementation artifacts. -- Álvaro Herrera39°49'30"S 73°17'W "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
RE: About to add WAL write/fsync statistics to pg_stat_wal view
Dear Ikeda-san, Thank you for updating the patch. This can be applied to master, and can be used on my RHEL7. wal_write_time and wal_sync_time increase normally :-). ``` postgres=# select * from pg_stat_wal; -[ RECORD 1 ]+-- wal_records | 121781 wal_fpi | 2287 wal_bytes| 36055146 wal_buffers_full | 799 wal_write| 12770 wal_write_time | 4.469 wal_sync | 11962 wal_sync_time| 132.352 stats_reset | 2021-01-25 00:51:40.674412+00 ``` I put a further comment: ``` @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i wal_buffers_full bigint - Number of times WAL data was written to disk because WAL buffers became full + Total number of times WAL data was written to disk because WAL buffers became full + + + + + + wal_write bigint + + + Total number of times WAL data was written to disk + + + + + + wal_write_time double precision + + + Total amount of time that has been spent in the portion of + WAL data was written to disk, in milliseconds + (if is enabled, otherwise zero). + + + + + + wal_sync bigint + + + Total number of times WAL data was synced to disk + (if is open_datasync or + open_sync, this value is zero because WAL data is synced + when to write it). + + + + + + wal_sync_time double precision + + + Total amount of time that has been spent in the portion of + WAL data was synced to disk, in milliseconds + (if is enabled, otherwise zero. + if is open_datasync or + open_sync, this value is zero too because WAL data is synced + when to write it). ``` Maybe "Total amount of time" should be used, not "Total number of time." Other views use "amount." I have no comments anymore. Hayato Kuroda FUJITSU LIMITED
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda wrote: > > On 2021-01-22 14:50, Masahiko Sawada wrote: > > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > > wrote: > >> > >> Hi, > >> > >> I rebased the patch to the master branch. > > > > Thank you for working on this. I've read the latest patch. Here are > > comments: > > > > --- > > + if (track_wal_io_timing) > > + { > > + INSTR_TIME_SET_CURRENT(duration); > > + INSTR_TIME_SUBTRACT(duration, start); > > + WalStats.m_wal_write_time += > > INSTR_TIME_GET_MILLISEC(duration); > > + } > > > > * I think it should add the time in micro sec. > > After running pgbench with track_wal_io_timing = on for 30 sec, > > pg_stat_wal showed the following on my environment: > > > > postgres(1:61569)=# select * from pg_stat_wal; > > -[ RECORD 1 ]+- > > wal_records | 285947 > > wal_fpi | 53285 > > wal_bytes| 442008213 > > wal_buffers_full | 0 > > wal_write| 25516 > > wal_write_time | 0 > > wal_sync | 25437 > > wal_sync_time| 14490 > > stats_reset | 2021-01-22 10:56:13.29464+09 > > > > Since writes can complete less than a millisecond, wal_write_time > > didn't increase. I think sync_time could also have the same problem. > > Thanks for your comments. I didn't notice that. > I changed the unit from milliseconds to microseconds. > > > --- > > + /* > > +* Measure i/o timing to fsync WAL data. > > +* > > +* The wal receiver skip to collect it to avoid performance > > degradation of standy servers. > > +* If sync_method doesn't have its fsync method, to skip too. > > +*/ > > + if (!AmWalReceiverProcess() && track_wal_io_timing && > > fsyncMethodCalled()) > > + INSTR_TIME_SET_CURRENT(start); > > > > * Why does only the wal receiver skip it even if track_wal_io_timinig > > is true? I think the performance degradation is also true for backend > > processes. If there is another reason for that, I think it's better to > > mention in both the doc and comment. > > * How about checking track_wal_io_timing first? > > * s/standy/standby/ > > I fixed it. > As kuroda-san mentioned too, the skip is no need to be considered. I think you also removed the code to have the wal receiver report the stats. So with the latest patch, the wal receiver tracks those statistics but doesn't report. And maybe XLogWalRcvWrite() also needs to track I/O? > > > --- > > + /* increment the i/o timing and the number of times to fsync WAL > > data */ > > + if (fsyncMethodCalled()) > > + { > > + if (!AmWalReceiverProcess() && track_wal_io_timing) > > + { > > + INSTR_TIME_SET_CURRENT(duration); > > + INSTR_TIME_SUBTRACT(duration, start); > > + WalStats.m_wal_sync_time += > > INSTR_TIME_GET_MILLISEC(duration); > > + } > > + > > + WalStats.m_wal_sync++; > > + } > > > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > > incrementing m_wal_sync after each sync operation? > > I think if syncing the disk does not occur, m_wal_sync should not be > incremented. > It depends enableFsync and sync_method. > > enableFsync is checked in each fsync method like > pg_fsync_no_writethrough(), > so if incrementing m_wal_sync after each sync operation, it should be > implemented > in each fsync method. It leads to many duplicated codes. Right. I missed that each fsync function checks enableFsync. > So, why don't you change the function to a flag whether to > sync data to the disk will be occurred or not in issue_xlog_fsync()? Looks better. Since we don't necessarily need to increment m_wal_sync after doing fsync we can write the code without an additional variable as follows: if (enableFsync) { switch (sync_method) { case SYNC_METHOD_FSYNC: #ifdef HAVE_FSYNC_WRITETHROUGH case SYNC_METHOD_FSYNC_WRITETHROUGH: #endif #ifdef HAVE_FDATASYNC case SYNC_METHOD_FDATASYNC: #endif WalStats.m_wal_sync++; if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); break; default: break; } } (do fsync and error handling here) /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } I think we can change the first switch-case to an if statement. > > > > * As far as I read the code, issue_xlog_fsync() seems to do fsync even > > if enableFsync is false. Why does the function return false in that > > case? I might be missing something. > > IIUC, the reason is that I thought that each fsync functions like > pg_fsync_no_writethrough() check enableFsync. > > If this code doesn't check
Re: Extend more usecase for planning time partition pruning and init partition pruning.
On Sun, Jan 24, 2021 at 6:34 PM Andy Fan wrote: > Hi: > > I recently found a use case like this. SELECT * FROM p, q WHERE > p.partkey = > q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning > time > partition prune or init partition prune. Even though we have run-time > partition pruning work at last, it is too late in some cases since we have > to init all the plan nodes in advance. In my case, there are 10+ > partitioned relation in one query and the execution time is short, so the > init plan a lot of plan nodes cares a lot. > > The attached patches fix this issue. It just get the "p.partkey = q.colx" > case in root->eq_classes or rel->joinlist (outer join), and then check if > there > is some baserestrictinfo in another relation which can be used for > partition > pruning. To make the things easier, both partkey and colx must be Var > expression in implementation. > > - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch > > Just some existing refactoring and extending ChangeVarNodes to be able > to change var->attno. > > - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch > > Do the real job. > > Thought? > > > > -- > Best Regards > Andy Fan (https://www.aliyun.com/) > Some results from this patch. create table p (a int, b int, c character varying(8)) partition by list(c); create table p1 partition of p for values in ('01'); create table p2 partition of p for values in ('02'); create table p3 partition of p for values in ('03'); create table q (a int, c character varying(8), b int) partition by list(c); create table q1 partition of q for values in ('01'); create table q2 partition of q for values in ('02'); create table q3 partition of q for values in ('03'); Before the patch: postgres=# explain (costs off) select * from p inner join q on p.c = q.c and q.c > '02'; QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Seq Scan on q3 q Filter: ((c)::text > '02'::text) (9 rows) After the patch: QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Seq Scan on p3 p -> Hash -> Seq Scan on q3 q Filter: ((c)::text > '02'::text) (6 rows) Before the patch: postgres=# explain (costs off) select * from p inner join q on p.c = q.c and (q.c = '02' or q.c = '01'); QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2 q_2 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) (12 rows) After the patch: QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2 q_2 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) (11 rows) Before the patch: postgres=# explain (costs off) select * from p left join q on p.c = q.c where (q.c = '02' or q.c = '01'); QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2 q_2 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) (12 rows) After the patch: QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Hash -> Append -> Seq Scan on q1 q_1
Re: Single transaction in the tablesync worker?
Hi Amit. PSA the v19 patch for the Tablesync Solution1. Main differences from v18: + Patch has been rebased off HEAD @ 24/Jan + Addressing some review comments [ak0123] [ak0123] https://www.postgresql.org/message-id/CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g%40mail.gmail.com Features: * The tablesync worker is now allowing multiple tx instead of single tx. * A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a successful copy_table in tablesync's LogicalRepSyncTableStart. * If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY then it will bypass the initial copy_table phase. * Now tablesync sets up replication origin tracking in LogicalRepSyncTableStart (similar as done for the apply worker). The origin is advanced when first created. * The tablesync replication origin tracking record is cleaned up by: - process_syncing_tables_for_apply - DropSubscription - AlterSubscription_refresh * Updates to PG docs. * New TAP test case. Known Issues: * None. --- Kind Regards, Peter Smith. Fujitsu Australia v19-0001-Tablesync-Solution1.patch Description: Binary data v19-0002-Tablesync-extra-logging.patch Description: Binary data
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada wrote: > On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda > wrote: >> >> On 2021-01-22 14:50, Masahiko Sawada wrote: >> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >> > wrote: >> >> >> >> Hi, >> >> >> >> I rebased the patch to the master branch. >> > >> > Thank you for working on this. I've read the latest patch. Here are >> > comments: >> > >> > --- >> > + if (track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_write_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > >> > * I think it should add the time in micro sec. >> > After running pgbench with track_wal_io_timing = on for 30 sec, >> > pg_stat_wal showed the following on my environment: >> > >> > postgres(1:61569)=# select * from pg_stat_wal; >> > -[ RECORD 1 ]+- >> > wal_records | 285947 >> > wal_fpi | 53285 >> > wal_bytes| 442008213 >> > wal_buffers_full | 0 >> > wal_write| 25516 >> > wal_write_time | 0 >> > wal_sync | 25437 >> > wal_sync_time| 14490 >> > stats_reset | 2021-01-22 10:56:13.29464+09 >> > >> > Since writes can complete less than a millisecond, wal_write_time >> > didn't increase. I think sync_time could also have the same problem. >> >> Thanks for your comments. I didn't notice that. >> I changed the unit from milliseconds to microseconds. >> >> > --- >> > + /* >> > +* Measure i/o timing to fsync WAL data. >> > +* >> > +* The wal receiver skip to collect it to avoid performance >> > degradation of standy servers. >> > +* If sync_method doesn't have its fsync method, to skip too. >> > +*/ >> > + if (!AmWalReceiverProcess() && track_wal_io_timing && >> > fsyncMethodCalled()) >> > + INSTR_TIME_SET_CURRENT(start); >> > >> > * Why does only the wal receiver skip it even if track_wal_io_timinig >> > is true? I think the performance degradation is also true for backend >> > processes. If there is another reason for that, I think it's better to >> > mention in both the doc and comment. >> > * How about checking track_wal_io_timing first? >> > * s/standy/standby/ >> >> I fixed it. >> As kuroda-san mentioned too, the skip is no need to be considered. > > I think you also removed the code to have the wal receiver report the > stats. So with the latest patch, the wal receiver tracks those > statistics but doesn't report. > > And maybe XLogWalRcvWrite() also needs to track I/O? > >> >> > --- >> > + /* increment the i/o timing and the number of times to fsync WAL >> > data */ >> > + if (fsyncMethodCalled()) >> > + { >> > + if (!AmWalReceiverProcess() && track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_sync_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > + >> > + WalStats.m_wal_sync++; >> > + } >> > >> > * I'd avoid always calling fsyncMethodCalled() in this path. How about >> > incrementing m_wal_sync after each sync operation? >> >> I think if syncing the disk does not occur, m_wal_sync should not be >> incremented. >> It depends enableFsync and sync_method. >> >> enableFsync is checked in each fsync method like >> pg_fsync_no_writethrough(), >> so if incrementing m_wal_sync after each sync operation, it should be >> implemented >> in each fsync method. It leads to many duplicated codes. > > Right. I missed that each fsync function checks enableFsync. > >> So, why don't you change the function to a flag whether to >> sync data to the disk will be occurred or not in issue_xlog_fsync()? > > Looks better. Since we don't necessarily need to increment m_wal_sync > after doing fsync we can write the code without an additional variable > as follows: > > if (enableFsync) > { > switch (sync_method) > { > case SYNC_METHOD_FSYNC: > #ifdef HAVE_FSYNC_WRITETHROUGH > case SYNC_METHOD_FSYNC_WRITETHROUGH: > #endif > #ifdef HAVE_FDATASYNC > case SYNC_METHOD_FDATASYNC: > #endif > WalStats.m_wal_sync++; > if (track_wal_io_timing) > INSTR_TIME_SET_CURRENT(start); > break; > default: > break; > } > } > > (do fsync and error handling here) > >/* increment the i/o timing and the number of times to fsync WAL data */ >if (track_wal_io_timing) >{ >INSTR_TIME_SET_CURRENT(duration); >INSTR_TIME_SUBTRACT(duration, start); >WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); >} > > I think we can change the first switch-case to an if statement. > +1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch. -- Regrads
Re: Single transaction in the tablesync worker?
On Mon, Jan 25, 2021 at 6:15 AM Peter Smith wrote: > > On Sun, Jan 24, 2021 at 5:54 PM Peter Smith wrote: > > > 4. > > > - /* > > > - * To build a slot name for the sync work, we are limited to NAMEDATALEN > > > - > > > - * 1 characters. We cut the original slot name to NAMEDATALEN - 28 chars > > > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0'). (It's actually the > > > - * NAMEDATALEN on the remote that matters, but this scheme will also work > > > - * reasonably if that is different.) > > > - */ > > > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for > > > sanity */ > > > - slotname = psprintf("%.*s_%u_sync_%u", > > > - NAMEDATALEN - 28, > > > - MySubscription->slotname, > > > - MySubscription->oid, > > > - MyLogicalRepWorker->relid); > > > + /* Calculate the name of the tablesync slot. */ > > > + slotname = ReplicationSlotNameForTablesync( > > > +MySubscription->oid, > > > +MyLogicalRepWorker->relid); > > > > > > What is the reason for changing the slot name calculation? If there is > > > any particular reasons, then we can add a comment to indicate why we > > > can't include the subscription's slotname in this calculation. > > > > > > > The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION) > > and so including the subscription slot name as part of the tablesync > > slot name was considered to be: > > a) possibly risky/undefined, if the subscription slot_name = NONE > > b) confusing, if we end up using 2 different slot names for the same > > tablesync (e.g. if the subscription slot name is changed before a sync > > worker is re-launched). > > And since this subscription slot name part is not necessary for > > uniqueness anyway, it was removed from the tablesync slot name to > > eliminate those concerns. > > > > Also, the tablesync slot name calculation was encapsulated as a > > separate function because previously (i.e. before v18) it was used by > > various other cleanup codes. I still like it better as a function, but > > now it is only called from one place so we could put that code back > > inline if you prefer it how it was.. > > It turns out those (a/b) concerns I wrote above are maybe unfounded, > because it seems not possible to alter the slot_name = NONE unless the > subscription is first DISABLED. > Yeah, but I think the user can still change to some other predefined slot_name. However, I guess it doesn't matter unless it can lead what you have mentioned in (a). As that can't happen, it is probably better to take out that change from the patch. I see your point of moving this calculation to a separate function but not sure if it is worth it unless we have to call it from multiple places or it simplifies the existing code. -- With Regards, Amit Kapila.
Re: Single transaction in the tablesync worker?
On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila wrote: > > 2. > @@ -98,11 +102,16 @@ > #include "miscadmin.h" > #include "parser/parse_relation.h" > #include "pgstat.h" > +#include "postmaster/interrupt.h" > #include "replication/logicallauncher.h" > #include "replication/logicalrelation.h" > +#include "replication/logicalworker.h" > #include "replication/walreceiver.h" > #include "replication/worker_internal.h" > +#include "replication/slot.h" > > I don't think the above includes are required. They seem to the > remnant of the previous approach. > OK. Fixed in the latest patch [v19]. > 3. > process_syncing_tables_for_sync(XLogRecPtr current_lsn) > { > - Assert(IsTransactionState()); > + bool sync_done = false; > > SpinLockAcquire(&MyLogicalRepWorker->relmutex); > + sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP && > + current_lsn >= MyLogicalRepWorker->relstate_lsn; > + SpinLockRelease(&MyLogicalRepWorker->relmutex); > > - if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP && > - current_lsn >= MyLogicalRepWorker->relstate_lsn) > + if (sync_done) > { > TimeLineID tli; > > + /* > + * Change state to SYNCDONE. > + */ > + SpinLockAcquire(&MyLogicalRepWorker->relmutex); > > Why do we need these changes? If you have done it for the > code-readability purpose then we can consider this as a separate patch > because I don't see why these are required w.r.t this patch. > Yes it was for code readability in v17 when this function used to be much larger. But it is not very necessary anymore and has been reverted in the latest patch [v19]. > 4. > - /* > - * To build a slot name for the sync work, we are limited to NAMEDATALEN - > - * 1 characters. We cut the original slot name to NAMEDATALEN - 28 chars > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0'). (It's actually the > - * NAMEDATALEN on the remote that matters, but this scheme will also work > - * reasonably if that is different.) > - */ > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity > */ > - slotname = psprintf("%.*s_%u_sync_%u", > - NAMEDATALEN - 28, > - MySubscription->slotname, > - MySubscription->oid, > - MyLogicalRepWorker->relid); > + /* Calculate the name of the tablesync slot. */ > + slotname = ReplicationSlotNameForTablesync( > +MySubscription->oid, > +MyLogicalRepWorker->relid); > > What is the reason for changing the slot name calculation? If there is > any particular reasons, then we can add a comment to indicate why we > can't include the subscription's slotname in this calculation. > The tablesync slot name changes were not strictly necessary, so the code is all reverted to be the same as OSS HEAD now in the latest patch [v19]. > 5. > This is WAL > + * logged for for the purpose of recovery. Locks are to prevent the > + * replication origin from vanishing while advancing. > > /for for/for > OK. Fixed in the latest patch [v19]. > 6. > + /* Remove the tablesync's origin tracking if exists. */ > + snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid); > + originid = replorigin_by_name(originname, true); > + if (originid != InvalidRepOriginId) > + { > + elog(DEBUG1, "DropSubscription: dropping origin tracking for > \"%s\"", originname); > > I don't think we need this and the DEBUG1 message in > AlterSubscription_refresh. IT is fine to print this information for > background workers like in apply-worker but not sure if need it here. > The DropSubscription drops the origin of apply worker but it doesn't > use such a DEBUG message so I guess we don't it for tablesync origins > as well. > OK. These DEBUG1 logs are removed in the latest patch [v19]. [v19] https://www.postgresql.org/message-id/CAHut%2BPsj7Xm8C1LbqeAbk-3duyS8xXJtL9TiGaeu3P8g272mAA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: [PoC] Non-volatile WAL buffer
On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra wrote: > > > > On 1/21/21 3:17 AM, Masahiko Sawada wrote: > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra > > wrote: > >> > >> Hi, > >> > >> I think I've managed to get the 0002 patch [1] rebased to master and > >> working (with help from Masahiko Sawada). It's not clear to me how it > >> could have worked as submitted - my theory is that an incomplete patch > >> was submitted by mistake, or something like that. > >> > >> Unfortunately, the benchmark results were kinda disappointing. For a > >> pgbench on scale 500 (fits into shared buffers), an average of three > >> 5-minute runs looks like this: > >> > >> branch 116326496 > >> > >> master 7291 87704165310150437224186 > >> ntt 7912106095213206212410237819 > >> simple-no-buffers 7654 96544115416 95828103065 > >> > >> NTT refers to the patch from September 10, pre-allocating a large WAL > >> file on PMEM, and simple-no-buffers is the simpler patch simply removing > >> the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM. > >> > >> Note: The patch is just replacing the old implementation with mmap. > >> That's good enough for experiments like this, but we probably want to > >> keep the old one for setups without PMEM. But it's good enough for > >> testing, benchmarking etc. > >> > >> Unfortunately, the results for this simple approach are pretty bad. Not > >> only compared to the "ntt" patch, but even to master. I'm not entirely > >> sure what's the root cause, but I have a couple hypotheses: > >> > >> 1) bug in the patch - That's clearly a possibility, although I've tried > >> tried to eliminate this possibility. > >> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster than > >> NVMe storage, but still much slower than DRAM (both in terms of latency > >> and bandwidth, see [2] for some data). It's not terrible, but the > >> latency is maybe 2-3x higher - not a huge difference, but may matter for > >> WAL buffers? > >> > >> 3) PMEM does not handle parallel writes well - If you look at [2], > >> Figure 4(b), you'll see that the throughput actually *drops" as the > >> number of threads increase. That's pretty strange / annoying, because > >> that's how we write into WAL buffers - each thread writes it's own data, > >> so parallelism is not something we can get rid of. > >> > >> I've added some simple profiling, to measure number of calls / time for > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data > >> for each backend, and logs the counts every 1M ops. > >> > >> Typical stats from a concurrent run looks like this: > >> > >> xlog stats cnt 4300 > >>map cnt 100 time 5448333 unmap cnt 100 time 3730963 > >>memcpy cnt 985964 time 1550442272 len 15150499 > >>memset cnt 0 time 0 len 0 > >>persist cnt 13836 time 10369617 len 16292182 > >> > >> The times are in nanoseconds, so this says the backend did 100 mmap and > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls, > >> taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy > >> copying about 15MB of data. That's quite a lot :-( > > > > It might also be interesting if we can see how much time spent on each > > logging function, such as XLogInsert(), XLogWrite(), and XLogFlush(). > > > > Yeah, we could extend it to that, that's fairly mechanical thing. Bbut > maybe that could be visible in a regular perf profile. Also, I suppose > most of the time will be used by the pmem calls, shown in the stats. > > >> > >> My conclusion from this is that eliminating WAL buffers and writing WAL > >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the > >> right approach. > >> > >> I suppose we should keep WAL buffers, and then just write the data to > >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does, > >> except that it allocates one huge file on PMEM and writes to that > >> (instead of the traditional WAL segments). > >> > >> So I decided to try how it'd work with writing to regular WAL segments, > >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that, > >> and the results look a bit nicer: > >> > >> branch 116326496 > >> > >> master 7291 87704165310150437224186 > >> ntt 7912106095213206212410237819 > >> simple-no-buffers 7654 96544115416 95828103065 > >> with-wal-buffers7477 95454181702140167214715 > >> > >> So, much better than the version without WAL buffers, somewhat better > >> than master (except for 64/96 clients), but still not
Re: Single transaction in the tablesync worker?
On Sun, Jan 24, 2021 at 12:24 PM Peter Smith wrote: > > On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila wrote: > > > > > > Few comments: > > = > > 1. > > - * So the state progression is always: INIT -> DATASYNC -> SYNCWAIT -> > > - * CATCHUP -> SYNCDONE -> READY. > > + * So the state progression is always: INIT -> DATASYNC -> > > + * (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> > > READY. > > > > I don't think we need to be specific here that sync worker sets > > FINISHEDCOPY state. > > > > This was meant to indicate that *only* the sync worker knows about the > FINISHEDCOPY state, whereas all the other states are either known > (assigned and/or used) by *both* kinds of workers. But, I can remove > it if you feel that distinction is not useful. > Okay, but I feel you can mention that in the description you have added for FINISHEDCOPY state. It looks a bit odd here and the message you want to convey is also not that clear. -- With Regards, Amit Kapila.
RE: Parallel Inserts in CREATE TABLE AS
>Thanks a lot for the tests. In your test case, parallel insertions are not >being picked because the Gather node has > some projections(floor(((random() * '1'::double precision) + >'1'::double > precision)) to perform. That's expected. >Whenever parallel insertions are chosen for CTAS, we should see "Create >target_table '' under Gather node [1] and >also the actual >row count for Gather node 0 (but in your test it is >rows=16268) in the explain analyze output. >Coming to your test case, if it's modified to something like [1], where the >Gather >node has no projections, >then parallel insertions will be chosen. Thanks for your explanation and test. Actually, I deliberately made my test case(with projection) to pick serial insert to make tuples unbalance distribution(99% tuples read by one worker) happened. This issue will lead the performance regression. But it's not introduced by your patch, it’s happening in master(HEAD). Do you have some thoughts about this. >[1] - I did this test on my development system, I will run on some performance >system and post my observations. Thank you, It will be very kind of you to do this. To reproduce above issue, you need to use my case(with projection). Because it won’t occur in “parallel insert”. Regards, Tang
Re: Is Recovery actually paused?
At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar wrote in > On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy > wrote: > > > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar wrote: > > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy > > > wrote: > > >> > > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar > > >> wrote: > > >> > Please find the patch for the same. I haven't added a test case for > > >> > this yet. I mean we can write a test case to pause the recovery and > > >> > get the status. But I am not sure that we can really write a reliable > > >> > test case for 'pause requested' and 'paused'. > > >> > > >> +1 to just show the recovery pause state in the output of > > >> pg_is_wal_replay_paused. But, should the function name > > >> "pg_is_wal_replay_paused" be something like > > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists > > >> in a function, I expect a boolean output. Others may have better > > >> thoughts. > > >> > > >> IIUC the above change, ensuring the recovery is paused after it's > > >> requested lies with the user. IMHO, the main problem we are trying to > > >> solve is not met > > > > > > > > > Basically earlier their was no way for the user yo know whether the > > > recovery is actually paused or not because it was always returning true > > > after pause requested. Now, we will return whether pause requested or > > > actually paused. So > for tool designer who want to wait for recovery to > > > get paused can have a loop and wait until the recovery state reaches to > > > paused. That will give a better control. > > > > I get it and I agree to have that change. My point was whether we can > > have a new function pg_wal_replay_pause_and_wait that waits until > > recovery is actually paused ((along with pg_is_wal_replay_paused > > returning the actual state than a true/false) so that tool developers > > don't need to have the waiting code outside, if at all they care about > > it? Others may have better thoughts than me. > > I think the previous patch was based on that idea where we thought > that we can pass an argument to pg_is_wal_replay_paused which can > decide whether to wait or return without the wait. I think this > version looks better to me where we give the status instead of > waiting. I am not sure whether we want another version of > pg_wal_replay_pause which will wait for actually it to get paused. I > mean there is always a scope to include the functionality in the > database which can be achieved by the tool, but this patch was trying > to solve the problem that there was no way to know the status so I > think returning the correct status should be the scope of this. I understand that the requirement here is that no record is replayed after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused() returns true, and delays taken while recovery don't delay the state change. That requirements are really synchronous. On the other hand the machinery is designed to be asynchronous. >* Note that we intentionally don't take the info_lck spinlock >* here. We might therefore read a slightly stale value of >* the recoveryPause flag, but it can't be very stale (no >* worse than the last spinlock we did acquire). Since a >* pause request is a pretty asynchronous thing anyway, >* possibly responding to it one WAL record later than we >* otherwise would is a minor issue, so it doesn't seem worth >* adding another spinlock cycle to prevent that. As the result, this patch tries to introduce several new checkpoints to some delaying point so that that waits can find pause request in a timely manner. I think we had better use locking (or atomics) for the information instead of such scattered checkpoints if we expect that machinery to work in a such syhcnronous manner. That would make the tri-state state variable and many checkpoints unnecessary. Maybe. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Single transaction in the tablesync worker?
On Sat, Jan 23, 2021 at 11:08 AM Ajin Cherian wrote: > > On Sat, Jan 23, 2021 at 3:16 PM Amit Kapila wrote: > > > > > I think so. But do you have any reason to believe that it won't be > > required anymore? > > A temporary slot will not clash with a permanent slot of the same name, > I have tried below and it seems to be clashing: postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('test_slot2', 'test_decoding'); ?column? -- init (1 row) postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('test_slot2', 'test_decoding', true); ERROR: replication slot "test_slot2" already exists Note that the third parameter in the second statement above indicates whether it is a temporary slot or not. What am I missing? -- With Regards, Amit Kapila.
Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION
On Sat, Jan 23, 2021 at 1:44 PM Bharath Rupireddy wrote: > > On Sat, Jan 23, 2021 at 11:50 AM Amit Kapila wrote: > > > > On Fri, Jan 22, 2021 at 2:59 PM Bharath Rupireddy > > wrote: > > > > > > On Fri, Jan 22, 2021 at 2:00 PM Amit Kapila > > > wrote: > > > > > > Yes you are right. Looks like the above commit is causing the issue. I > > > reverted that commit and did not see the drop publication bug, see use > > > case [1]. > > > > > > > Okay, thanks for confirming the same. I am planning to push the > > attached where I made changes in few comments, combined the code and > > test patch, and made a few changes in the commit message. Let me know > > if you have any suggestions? > > Thanks Amit. Looks like there were some typos in the test case > description, I corrected them. > Thanks, I have pushed the patch. -- With Regards, Amit Kapila.
RE: Parallel INSERT (INTO ... SELECT ...)
Hi, After doing some test to cover the code path in the PATCH 0001. I have some suggestions for the 0002 testcase. (1) + /* Check parallel-safety of any expressions in the partition key */ + if (get_partition_col_attnum(pkey, i) == 0) + { + Node *check_expr = (Node *) lfirst(partexprs_item); + + if (max_parallel_hazard_walker(check_expr, context)) + { + table_close(rel, lockmode); + return true; + } The testcase seems does not cover the above code(test when the table have parallel unsafe expression in the partition key). Personally, I use the following sql to cover this: - create table partkey_unsafe_key_expr_t (a int4, b name) partition by range ((fullname_parallel_unsafe('',a::varchar))); explain (costs off) insert into partkey_unsafe_key_expr_t select unique1, stringu1 from tenk1; - (2) I noticed that most of testcase test both (parallel safe/unsafe/restricted). But the index expression seems does not test the parallel restricted. How about add a testcase like: - create or replace function fullname_parallel_restricted(f text, l text) returns text as $$ begin return f || l; end; $$ language plpgsql immutable parallel restricted; create table names4(index int, first_name text, last_name text); create index names4_fullname_idx on names4 (fullname_parallel_restricted(first_name, last_name)); -- -- Test INSERT with parallel-restricted index expression -- (should create a parallel plan) -- explain (costs off) insert into names4 select * from names; - (3) + /* Recursively check each partition ... */ + pdesc = RelationGetPartitionDesc(rel); + for (i = 0; i < pdesc->nparts; i++) + { + if (rel_max_parallel_hazard_for_modify(pdesc->oids[i], + command_type, + context, + AccessShareLock)) + { + table_close(rel, lockmode); + return true; + } + } It seems we do not have a testcase to test (some parallel unsafe expression or.. in partition) Hoe about add one testcase to test parallel unsafe partition ? Best regards, houzj
Re: PG vs LLVM 12 on seawasp, next round
On Mon, Jan 18, 2021 at 09:29:53PM +0100, Fabien COELHO wrote: > >>>The "no such file" error seems more like a machine local issue to me. > >> > >>I'll look into it when I have time, which make take some time. Hopefully > >>over the holidays. > > > >This is still happening ... Any chance you can have a look at it? > > Indeed. I'll try to look (again) into it soon. I had a look but did not find > anything obvious in the short time frame I had. Last two months were a > little overworked for me so I let slip quite a few things. If you want to > disable the animal as Tom suggests, do as you want. Perhaps he was suggesting that you (buildfarm owner) disable the cron job that initiates new runs. That's what I do when one of my animals needs my intervention.
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 25, 2021 at 10:23 AM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > > Hello Greg-san, > > > Second group of comments (I'll reply to (1) - (4) later): > > > (5) > @@ -790,7 +790,8 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt) > ... > - if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE) > + if ((plannedstmt->commandType != CMD_SELECT && > + !IsModifySupportedInParallelMode(plannedstmt->commandType)) || plannedstmt->hasModifyingCTE) > PreventCommandIfParallelMode(CreateCommandName((Node *) plannedstmt)); > } > > Now that we're trying to allow parallel writes (INSERT), we should: > > * use ExecCheckXactReadOnly() solely for checking read-only transactions, as the function name represents. That is, move the call to PreventCommandIfParallelMode() up to standard_ExecutorStart(). > > * Update the comment above the call to ExecCheckXactReadOnly(). > > Hmmm, I not so sure. The patch changes just make the existing test for calling PreventCommandIfParallelMode() a bit more restrictive, to exclude the Parallel INSERT case. So the code previously wasn't just checking read-only transactions anyway, so it's not as if the patch has changed something fundamental in this function. And by moving the PreventCommandIfParallelMode() call to a higher level, then you're making a change to the existing order of error-handling (as ExecCheckXactReadOnly() is calling PreventCommandIfReadOnly() based on a few other range-table conditions, prior to testing whether to call PreventCommandIfParallelMode()). I don't want to introduce a bug by making the change that you're suggesting. > (6) > @@ -764,6 +777,22 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, > ... > + else > + { > + pei->processed_count = NULL; > + } > > The braces can be deleted. > Yes they can be deleted, and I guess I will, but for the record, I personally prefer the explicit brackets, even if just one line, because: - if more code ever needs to be added to the else, you'll need to add brackets anyway (and newbies might add extra lines tabbed in, thinking it's part of the else block ...). - I think it looks better and slightly easier to read, especially when there's a mix of cases with multiple code lines and single code lines Of course, these kind of things could be debated forever, but I don't think it's such a big deal. > > (7) > @@ -1400,6 +1439,16 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) > true); > queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options); > > + Assert(queryDesc->operation == CMD_SELECT || IsModifySupportedInParallelMode(queryDesc->operation)); > + if (IsModifySupportedInParallelMode(queryDesc->operation)) > + { > + /* > +* Record that the CurrentCommandId is used, at the start of the > +* parallel operation. > +*/ > + SetCurrentCommandIdUsedForWorker(); > + } > + > /* Setting debug_query_string for individual workers */ > debug_query_string = queryDesc->sourceText; > > @@ -765,12 +779,16 @@ GetCurrentCommandId(bool used) > if (used) > { > /* > -* Forbid setting currentCommandIdUsed in a parallel worker, because > -* we have no provision for communicating this back to the leader. We > -* could relax this restriction when currentCommandIdUsed was already > -* true at the start of the parallel operation. > +* If in a parallel worker, only allow setting currentCommandIdUsed if > +* currentCommandIdUsed was already true at the start of the parallel > +* operation (by way of SetCurrentCommandIdUsed()), otherwise forbid > +* setting currentCommandIdUsed because we have no provision for > +* communicating this back to the leader. Once currentCommandIdUsed is > +* set, the commandId used by leader and workers can't be changed, > +* because CommandCounterIncrement() then prevents any attempted > +* increment of the current commandId. > */ > - Assert(!IsParallelWorker()); > + Assert(!(IsParallelWorker() && !currentCommandIdUsed)); > currentCommandIdUsed = true; > } > return currentCommandId; > > What happens without these changes? The change to the above comment explains why the change is needed. Without these changes, a call in a parallel worker to GetCurrentCommandId() will result in an Assert being fired because (prior to the patch) currentCommandIdUsed is forbidden to be set in a parallel worker, and calling GetCurrentCommandId(true) (to signify the intent to use the returned CommandId to mark inserted/updated/deleted tuples) will result in currentCommandIdUsed
RE: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-01-25 10:34, kuroda.hay...@fujitsu.com wrote: Dear Ikeda-san, Thank you for updating the patch. This can be applied to master, and can be used on my RHEL7. wal_write_time and wal_sync_time increase normally :-). ``` postgres=# select * from pg_stat_wal; -[ RECORD 1 ]+-- wal_records | 121781 wal_fpi | 2287 wal_bytes| 36055146 wal_buffers_full | 799 wal_write| 12770 wal_write_time | 4.469 wal_sync | 11962 wal_sync_time| 132.352 stats_reset | 2021-01-25 00:51:40.674412+00 ``` Thanks for checking. I put a further comment: ``` @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i wal_buffers_full bigint - Number of times WAL data was written to disk because WAL buffers became full + Total number of times WAL data was written to disk because WAL buffers became full + + + + + role="column_definition"> + wal_write bigint + + + Total number of times WAL data was written to disk + + + + + role="column_definition"> + wal_write_time double precision + + + Total amount of time that has been spent in the portion of + WAL data was written to disk, in milliseconds + (if is enabled, otherwise zero). + + + + + role="column_definition"> + wal_sync bigint + + + Total number of times WAL data was synced to disk + (if is open_datasync or + open_sync, this value is zero because WAL data is synced + when to write it). + + + + + role="column_definition"> + wal_sync_time double precision + + + Total amount of time that has been spent in the portion of + WAL data was synced to disk, in milliseconds + (if is enabled, otherwise zero. + if is open_datasync or + open_sync, this value is zero too because WAL data is synced + when to write it). ``` Maybe "Total amount of time" should be used, not "Total number of time." Other views use "amount." Thanks. I checked columns' descriptions of other views. There are "Number of xxx", "Total number of xxx", "Total amount of time that xxx" and "Total time spent xxx". Since the "time" is used for showing spending time, not count, I'll change it to "Total number of WAL data written/synced to disk". Thought? Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Single transaction in the tablesync worker?
On Mon, Jan 25, 2021 at 8:23 AM Peter Smith wrote: > > On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila wrote: > > > > 2. > > @@ -98,11 +102,16 @@ > > #include "miscadmin.h" > > #include "parser/parse_relation.h" > > #include "pgstat.h" > > +#include "postmaster/interrupt.h" > > #include "replication/logicallauncher.h" > > #include "replication/logicalrelation.h" > > +#include "replication/logicalworker.h" > > #include "replication/walreceiver.h" > > #include "replication/worker_internal.h" > > +#include "replication/slot.h" > > > > I don't think the above includes are required. They seem to the > > remnant of the previous approach. > > > > OK. Fixed in the latest patch [v19]. > You seem to forgot removing #include "replication/slot.h". Check, if it is not required then remove that as well. -- With Regards, Amit Kapila.
Re: PG vs LLVM 12 on seawasp, next round
Noah Misch writes: > On Mon, Jan 18, 2021 at 09:29:53PM +0100, Fabien COELHO wrote: >> ... Last two months were a >> little overworked for me so I let slip quite a few things. If you want to >> disable the animal as Tom suggests, do as you want. > Perhaps he was suggesting that you (buildfarm owner) disable the cron job that > initiates new runs. That's what I do when one of my animals needs my > intervention. Indeed. I'm not sure there even is a provision to block an animal on the buildfarm-server side. If there is, you'd have to request that it be manually undone after you get around to fixing the animal. Frankly, if I were the BF admin, I would be in about as much hurry to do that as you've been to fix it. regards, tom lane
RE: About to add WAL write/fsync statistics to pg_stat_wal view
Dear Ikeda-san, > I checked columns' descriptions of other views. > There are "Number of xxx", "Total number of xxx", "Total amount of time > that xxx" and "Total time spent xxx". Right. > Since the "time" is used for showing spending time, not count, > I'll change it to "Total number of WAL data written/synced to disk". > Thought? I misread your patch, sorry. I prefer your suggestion. Please fix like that way with others. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-01-25 10:36, Masahiko Sawada wrote: On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda wrote: On 2021-01-22 14:50, Masahiko Sawada wrote: > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > wrote: >> >> Hi, >> >> I rebased the patch to the master branch. > > Thank you for working on this. I've read the latest patch. Here are > comments: > > --- > + if (track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > > * I think it should add the time in micro sec. > After running pgbench with track_wal_io_timing = on for 30 sec, > pg_stat_wal showed the following on my environment: > > postgres(1:61569)=# select * from pg_stat_wal; > -[ RECORD 1 ]+- > wal_records | 285947 > wal_fpi | 53285 > wal_bytes| 442008213 > wal_buffers_full | 0 > wal_write| 25516 > wal_write_time | 0 > wal_sync | 25437 > wal_sync_time| 14490 > stats_reset | 2021-01-22 10:56:13.29464+09 > > Since writes can complete less than a millisecond, wal_write_time > didn't increase. I think sync_time could also have the same problem. Thanks for your comments. I didn't notice that. I changed the unit from milliseconds to microseconds. > --- > + /* > +* Measure i/o timing to fsync WAL data. > +* > +* The wal receiver skip to collect it to avoid performance > degradation of standy servers. > +* If sync_method doesn't have its fsync method, to skip too. > +*/ > + if (!AmWalReceiverProcess() && track_wal_io_timing && > fsyncMethodCalled()) > + INSTR_TIME_SET_CURRENT(start); > > * Why does only the wal receiver skip it even if track_wal_io_timinig > is true? I think the performance degradation is also true for backend > processes. If there is another reason for that, I think it's better to > mention in both the doc and comment. > * How about checking track_wal_io_timing first? > * s/standy/standby/ I fixed it. As kuroda-san mentioned too, the skip is no need to be considered. I think you also removed the code to have the wal receiver report the stats. So with the latest patch, the wal receiver tracks those statistics but doesn't report. And maybe XLogWalRcvWrite() also needs to track I/O? Thanks, I forgot to add them. I'll fix it. > --- > + /* increment the i/o timing and the number of times to fsync WAL > data */ > + if (fsyncMethodCalled()) > + { > + if (!AmWalReceiverProcess() && track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > + > + WalStats.m_wal_sync++; > + } > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > incrementing m_wal_sync after each sync operation? I think if syncing the disk does not occur, m_wal_sync should not be incremented. It depends enableFsync and sync_method. enableFsync is checked in each fsync method like pg_fsync_no_writethrough(), so if incrementing m_wal_sync after each sync operation, it should be implemented in each fsync method. It leads to many duplicated codes. Right. I missed that each fsync function checks enableFsync. So, why don't you change the function to a flag whether to sync data to the disk will be occurred or not in issue_xlog_fsync()? Looks better. Since we don't necessarily need to increment m_wal_sync after doing fsync we can write the code without an additional variable as follows: if (enableFsync) { switch (sync_method) { case SYNC_METHOD_FSYNC: #ifdef HAVE_FSYNC_WRITETHROUGH case SYNC_METHOD_FSYNC_WRITETHROUGH: #endif #ifdef HAVE_FDATASYNC case SYNC_METHOD_FDATASYNC: #endif WalStats.m_wal_sync++; if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); break; default: break; } } (do fsync and error handling here) /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } IIUC, I think we can't handle the following case. When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and "track_wal_io_timing" is enabled, "start" doesn't be initialized. My understanding is something wrong, isn't it? I think we can change the first switch-case to an if statement. Yes, I'll change it. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
RE: POC: postgres_fdw insert batching
From: Amit Langote > Yes, it can be simplified by using a local join to prevent the update of the > foreign > partition from being pushed to the remote server, for which my example in the > previous email used a local trigger. Note that the update of the foreign > partition to be done locally is a prerequisite for this bug to occur. Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? + (PgFdwModifyState *) resultRelInfo->ri_FdwState : + NULL; This can be written as: + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; Regards Takayuki Tsunakawa
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-01-25 11:47, japin wrote: On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada wrote: On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda wrote: On 2021-01-22 14:50, Masahiko Sawada wrote: > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > wrote: >> >> Hi, >> >> I rebased the patch to the master branch. > > Thank you for working on this. I've read the latest patch. Here are > comments: > > --- > + if (track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > > * I think it should add the time in micro sec. > After running pgbench with track_wal_io_timing = on for 30 sec, > pg_stat_wal showed the following on my environment: > > postgres(1:61569)=# select * from pg_stat_wal; > -[ RECORD 1 ]+- > wal_records | 285947 > wal_fpi | 53285 > wal_bytes| 442008213 > wal_buffers_full | 0 > wal_write| 25516 > wal_write_time | 0 > wal_sync | 25437 > wal_sync_time| 14490 > stats_reset | 2021-01-22 10:56:13.29464+09 > > Since writes can complete less than a millisecond, wal_write_time > didn't increase. I think sync_time could also have the same problem. Thanks for your comments. I didn't notice that. I changed the unit from milliseconds to microseconds. > --- > + /* > +* Measure i/o timing to fsync WAL data. > +* > +* The wal receiver skip to collect it to avoid performance > degradation of standy servers. > +* If sync_method doesn't have its fsync method, to skip too. > +*/ > + if (!AmWalReceiverProcess() && track_wal_io_timing && > fsyncMethodCalled()) > + INSTR_TIME_SET_CURRENT(start); > > * Why does only the wal receiver skip it even if track_wal_io_timinig > is true? I think the performance degradation is also true for backend > processes. If there is another reason for that, I think it's better to > mention in both the doc and comment. > * How about checking track_wal_io_timing first? > * s/standy/standby/ I fixed it. As kuroda-san mentioned too, the skip is no need to be considered. I think you also removed the code to have the wal receiver report the stats. So with the latest patch, the wal receiver tracks those statistics but doesn't report. And maybe XLogWalRcvWrite() also needs to track I/O? > --- > + /* increment the i/o timing and the number of times to fsync WAL > data */ > + if (fsyncMethodCalled()) > + { > + if (!AmWalReceiverProcess() && track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > + > + WalStats.m_wal_sync++; > + } > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > incrementing m_wal_sync after each sync operation? I think if syncing the disk does not occur, m_wal_sync should not be incremented. It depends enableFsync and sync_method. enableFsync is checked in each fsync method like pg_fsync_no_writethrough(), so if incrementing m_wal_sync after each sync operation, it should be implemented in each fsync method. It leads to many duplicated codes. Right. I missed that each fsync function checks enableFsync. So, why don't you change the function to a flag whether to sync data to the disk will be occurred or not in issue_xlog_fsync()? Looks better. Since we don't necessarily need to increment m_wal_sync after doing fsync we can write the code without an additional variable as follows: if (enableFsync) { switch (sync_method) { case SYNC_METHOD_FSYNC: #ifdef HAVE_FSYNC_WRITETHROUGH case SYNC_METHOD_FSYNC_WRITETHROUGH: #endif #ifdef HAVE_FDATASYNC case SYNC_METHOD_FDATASYNC: #endif WalStats.m_wal_sync++; if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); break; default: break; } } (do fsync and error handling here) /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } I think we can change the first switch-case to an if statement. +1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch. Thanks, I'll change it. -- Masahiro Ikeda NTT DATA CORPORATION
Re: Is Recovery actually paused?
On Mon, Jan 25, 2021 at 6:12 AM Masahiko Sawada wrote: > > On Sun, Jan 17, 2021 at 5:19 PM Dilip Kumar wrote: > > > > On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada > > wrote: > > > > > > On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar wrote: > > > > > > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar > > > > wrote: > > > > > > > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA > > > > > wrote: > > > > > > > > > > > > On Thu, 10 Dec 2020 11:25:23 +0530 > > > > > > Dilip Kumar wrote: > > > > > > > > > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused > > > > > > > > > to wait. > > > > > > > > > Especially, if max_standby_streaming_delay is -1, this will > > > > > > > > > be blocked forever, > > > > > > > > > although this setting may not be usual. In addition, some > > > > > > > > > users may set > > > > > > > > > recovery_min_apply_delay for a large. If such users call > > > > > > > > > pg_is_wal_replay_paused, > > > > > > > > > it could wait for a long time. > > > > > > > > > > > > > > > > > > At least, I think we need some descriptions on document to > > > > > > > > > explain > > > > > > > > > pg_is_wal_replay_paused could wait while a time. > > > > > > > > > > > > > > > > Ok > > > > > > > > > > > > > > Fixed this, added some comments in .sgml as well as in function > > > > > > > header > > > > > > > > > > > > Thank you for fixing this. > > > > > > > > > > > > Also, is it better to fix the description of pg_wal_replay_pause > > > > > > from > > > > > > "Pauses recovery." to "Request to pause recovery." in according with > > > > > > pg_is_wal_replay_paused? > > > > > > > > > > Okay > > > > > > > > > > > > > > > > > > > > Also, how about adding a new boolean argument to > > > > > > > > > pg_is_wal_replay_paused to > > > > > > > > > control whether this waits for recovery to get paused or not? > > > > > > > > > By setting its > > > > > > > > > default value to true or false, users can use the old format > > > > > > > > > for calling this > > > > > > > > > and the backward compatibility can be maintained. > > > > > > > > > > > > > > > > So basically, if the wait_recovery_pause flag is false then we > > > > > > > > will > > > > > > > > immediately return true if the pause is requested? I agree > > > > > > > > that it is > > > > > > > > good to have an API to know whether the recovery pause is > > > > > > > > requested or > > > > > > > > not but I am not sure is it good idea to make this API serve > > > > > > > > both the > > > > > > > > purpose? Anyone else have any thoughts on this? > > > > > > > > > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has another > > > > > > purpose; > > > > > > this waits recovery to actually get paused. If we want to limit > > > > > > this API's > > > > > > purpose only to return the pause state, it seems better to fix this > > > > > > to return > > > > > > the actual state at the cost of lacking the backward compatibility. > > > > > > If we want > > > > > > to know whether pause is requested, we may add a new API like > > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > > > > > > recovery to actually > > > > > > get paused, we may add an option to pg_wal_replay_pause() for this > > > > > > purpose. > > > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I > > > > > > don't care either. > > > > > > > > > > I don't think that it will be blocked ever, because > > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > > recovery process will not be stuck on waiting for the WAL. > > > > > > > > > > > > > > As another comment, while pg_is_wal_replay_paused is > > > > > > > > > blocking, I can not cancel > > > > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the > > > > > > > > > waiting loop. > > > > > > > > > > > > How about this fix? I think users may want to cancel > > > > > > pg_is_wal_replay_paused() during > > > > > > this is blocking. > > > > > > > > > > Yeah, we can do this. I will send the updated patch after putting > > > > > some more thought into these comments. Thanks again for the feedback. > > > > > > > > > > > > > Please find the updated patch. > > > > > > I've looked at the patch. Here are review comments: > > > > > > + /* Recovery pause state */ > > > + RecoveryPauseState recoveryPause; > > > > > > Now that the value can have tri-state, how about renaming it to > > > recoveryPauseState? > > > > This makes sense to me. > > > > > --- > > > bool > > > RecoveryIsPaused(void) > > > +{ > > > + boolrecoveryPause; > > > + > > > + SpinLockAcquire(&XLogCtl->info_lck); > > > + recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ? > > > true : false; > > > + SpinLockRelease(&XLogCtl->info_lck); > > > + > > > + return recoveryPause; > > > +} > > >
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-01-25 13:15, Masahiro Ikeda wrote: On 2021-01-25 10:36, Masahiko Sawada wrote: On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda wrote: On 2021-01-22 14:50, Masahiko Sawada wrote: > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > wrote: >> >> Hi, >> >> I rebased the patch to the master branch. > > Thank you for working on this. I've read the latest patch. Here are > comments: > > --- > + if (track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > > * I think it should add the time in micro sec. > After running pgbench with track_wal_io_timing = on for 30 sec, > pg_stat_wal showed the following on my environment: > > postgres(1:61569)=# select * from pg_stat_wal; > -[ RECORD 1 ]+- > wal_records | 285947 > wal_fpi | 53285 > wal_bytes| 442008213 > wal_buffers_full | 0 > wal_write| 25516 > wal_write_time | 0 > wal_sync | 25437 > wal_sync_time| 14490 > stats_reset | 2021-01-22 10:56:13.29464+09 > > Since writes can complete less than a millisecond, wal_write_time > didn't increase. I think sync_time could also have the same problem. Thanks for your comments. I didn't notice that. I changed the unit from milliseconds to microseconds. > --- > + /* > +* Measure i/o timing to fsync WAL data. > +* > +* The wal receiver skip to collect it to avoid performance > degradation of standy servers. > +* If sync_method doesn't have its fsync method, to skip too. > +*/ > + if (!AmWalReceiverProcess() && track_wal_io_timing && > fsyncMethodCalled()) > + INSTR_TIME_SET_CURRENT(start); > > * Why does only the wal receiver skip it even if track_wal_io_timinig > is true? I think the performance degradation is also true for backend > processes. If there is another reason for that, I think it's better to > mention in both the doc and comment. > * How about checking track_wal_io_timing first? > * s/standy/standby/ I fixed it. As kuroda-san mentioned too, the skip is no need to be considered. I think you also removed the code to have the wal receiver report the stats. So with the latest patch, the wal receiver tracks those statistics but doesn't report. And maybe XLogWalRcvWrite() also needs to track I/O? Thanks, I forgot to add them. I'll fix it. > --- > + /* increment the i/o timing and the number of times to fsync WAL > data */ > + if (fsyncMethodCalled()) > + { > + if (!AmWalReceiverProcess() && track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > + > + WalStats.m_wal_sync++; > + } > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > incrementing m_wal_sync after each sync operation? I think if syncing the disk does not occur, m_wal_sync should not be incremented. It depends enableFsync and sync_method. enableFsync is checked in each fsync method like pg_fsync_no_writethrough(), so if incrementing m_wal_sync after each sync operation, it should be implemented in each fsync method. It leads to many duplicated codes. Right. I missed that each fsync function checks enableFsync. So, why don't you change the function to a flag whether to sync data to the disk will be occurred or not in issue_xlog_fsync()? Looks better. Since we don't necessarily need to increment m_wal_sync after doing fsync we can write the code without an additional variable as follows: if (enableFsync) { switch (sync_method) { case SYNC_METHOD_FSYNC: #ifdef HAVE_FSYNC_WRITETHROUGH case SYNC_METHOD_FSYNC_WRITETHROUGH: #endif #ifdef HAVE_FDATASYNC case SYNC_METHOD_FDATASYNC: #endif WalStats.m_wal_sync++; if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); break; default: break; } } (do fsync and error handling here) /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } IIUC, I think we can't handle the following case. When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and "track_wal_io_timing" is enabled, "start" doesn't be initialized. My understanding is something wrong, isn't it? I thought the following is better. ``` /* Measure i/o timing to sync WAL data.*/ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); (do fsync and error handling here
Re: Is Recovery actually paused?
On Mon, Jan 25, 2021 at 8:42 AM Kyotaro Horiguchi wrote: > > At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar wrote > in > > On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy > > wrote: > > > > > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar wrote: > > > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy > > > > wrote: > > > >> > > > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar > > > >> wrote: > > > >> > Please find the patch for the same. I haven't added a test case for > > > >> > this yet. I mean we can write a test case to pause the recovery and > > > >> > get the status. But I am not sure that we can really write a > > > >> > reliable > > > >> > test case for 'pause requested' and 'paused'. > > > >> > > > >> +1 to just show the recovery pause state in the output of > > > >> pg_is_wal_replay_paused. But, should the function name > > > >> "pg_is_wal_replay_paused" be something like > > > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists > > > >> in a function, I expect a boolean output. Others may have better > > > >> thoughts. > > > >> > > > >> IIUC the above change, ensuring the recovery is paused after it's > > > >> requested lies with the user. IMHO, the main problem we are trying to > > > >> solve is not met > > > > > > > > > > > > Basically earlier their was no way for the user yo know whether the > > > > recovery is actually paused or not because it was always returning true > > > > after pause requested. Now, we will return whether pause requested or > > > > actually paused. So > for tool designer who want to wait for recovery > > > > to get paused can have a loop and wait until the recovery state reaches > > > > to paused. That will give a better control. > > > > > > I get it and I agree to have that change. My point was whether we can > > > have a new function pg_wal_replay_pause_and_wait that waits until > > > recovery is actually paused ((along with pg_is_wal_replay_paused > > > returning the actual state than a true/false) so that tool developers > > > don't need to have the waiting code outside, if at all they care about > > > it? Others may have better thoughts than me. > > > > I think the previous patch was based on that idea where we thought > > that we can pass an argument to pg_is_wal_replay_paused which can > > decide whether to wait or return without the wait. I think this > > version looks better to me where we give the status instead of > > waiting. I am not sure whether we want another version of > > pg_wal_replay_pause which will wait for actually it to get paused. I > > mean there is always a scope to include the functionality in the > > database which can be achieved by the tool, but this patch was trying > > to solve the problem that there was no way to know the status so I > > think returning the correct status should be the scope of this. > > I understand that the requirement here is that no record is replayed > after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused() > returns true, and delays taken while recovery don't delay the state > change. That requirements are really synchronous. > > On the other hand the machinery is designed to be asynchronous. > > >* Note that we intentionally don't take the info_lck spinlock > >* here. We might therefore read a slightly stale value of > >* the recoveryPause flag, but it can't be very stale (no > >* worse than the last spinlock we did acquire). Since a > >* pause request is a pretty asynchronous thing anyway, > >* possibly responding to it one WAL record later than we > >* otherwise would is a minor issue, so it doesn't seem worth > >* adding another spinlock cycle to prevent that. > > As the result, this patch tries to introduce several new checkpoints > to some delaying point so that that waits can find pause request in a > timely manner. I think we had better use locking (or atomics) for the > information instead of such scattered checkpoints if we expect that > machinery to work in a such syhcnronous manner. > > That would make the tri-state state variable and many checkpoints > unnecessary. Maybe. I don't think the intention was so to make it synchronous, I think the main intention was that pg_is_wal_replay_paused can return us the correct state, in short user can know that whether any more wal will be replayed after pg_is_wal_replay_paused returns true or some other state. I agree that along with that we have also introduced some extra checkpoint where the recovery process is waiting for WAL and apply delay and from the pg_wal_replay_pause we had sent a signal to wakeup the recovery process. So I am not sure is it worth adding the lock/atomic variable to make this synchronous. Any other thoughts on this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda wrote: > > On 2021-01-25 13:15, Masahiro Ikeda wrote: > > On 2021-01-25 10:36, Masahiko Sawada wrote: > >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda > >> wrote: > >>> > >>> On 2021-01-22 14:50, Masahiko Sawada wrote: > >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > >>> > wrote: > >>> >> > >>> >> Hi, > >>> >> > >>> >> I rebased the patch to the master branch. > >>> > > >>> > Thank you for working on this. I've read the latest patch. Here are > >>> > comments: > >>> > > >>> > --- > >>> > + if (track_wal_io_timing) > >>> > + { > >>> > + INSTR_TIME_SET_CURRENT(duration); > >>> > + INSTR_TIME_SUBTRACT(duration, start); > >>> > + WalStats.m_wal_write_time += > >>> > INSTR_TIME_GET_MILLISEC(duration); > >>> > + } > >>> > > >>> > * I think it should add the time in micro sec. > >>> > After running pgbench with track_wal_io_timing = on for 30 sec, > >>> > pg_stat_wal showed the following on my environment: > >>> > > >>> > postgres(1:61569)=# select * from pg_stat_wal; > >>> > -[ RECORD 1 ]+- > >>> > wal_records | 285947 > >>> > wal_fpi | 53285 > >>> > wal_bytes| 442008213 > >>> > wal_buffers_full | 0 > >>> > wal_write| 25516 > >>> > wal_write_time | 0 > >>> > wal_sync | 25437 > >>> > wal_sync_time| 14490 > >>> > stats_reset | 2021-01-22 10:56:13.29464+09 > >>> > > >>> > Since writes can complete less than a millisecond, wal_write_time > >>> > didn't increase. I think sync_time could also have the same problem. > >>> > >>> Thanks for your comments. I didn't notice that. > >>> I changed the unit from milliseconds to microseconds. > >>> > >>> > --- > >>> > + /* > >>> > +* Measure i/o timing to fsync WAL data. > >>> > +* > >>> > +* The wal receiver skip to collect it to avoid performance > >>> > degradation of standy servers. > >>> > +* If sync_method doesn't have its fsync method, to skip too. > >>> > +*/ > >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing && > >>> > fsyncMethodCalled()) > >>> > + INSTR_TIME_SET_CURRENT(start); > >>> > > >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig > >>> > is true? I think the performance degradation is also true for backend > >>> > processes. If there is another reason for that, I think it's better to > >>> > mention in both the doc and comment. > >>> > * How about checking track_wal_io_timing first? > >>> > * s/standy/standby/ > >>> > >>> I fixed it. > >>> As kuroda-san mentioned too, the skip is no need to be considered. > >> > >> I think you also removed the code to have the wal receiver report the > >> stats. So with the latest patch, the wal receiver tracks those > >> statistics but doesn't report. > >> And maybe XLogWalRcvWrite() also needs to track I/O? > > > > Thanks, I forgot to add them. > > I'll fix it. > > > > > >>> > >>> > --- > >>> > + /* increment the i/o timing and the number of times to fsync WAL > >>> > data */ > >>> > + if (fsyncMethodCalled()) > >>> > + { > >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing) > >>> > + { > >>> > + INSTR_TIME_SET_CURRENT(duration); > >>> > + INSTR_TIME_SUBTRACT(duration, start); > >>> > + WalStats.m_wal_sync_time += > >>> > INSTR_TIME_GET_MILLISEC(duration); > >>> > + } > >>> > + > >>> > + WalStats.m_wal_sync++; > >>> > + } > >>> > > >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about > >>> > incrementing m_wal_sync after each sync operation? > >>> > >>> I think if syncing the disk does not occur, m_wal_sync should not be > >>> incremented. > >>> It depends enableFsync and sync_method. > >>> > >>> enableFsync is checked in each fsync method like > >>> pg_fsync_no_writethrough(), > >>> so if incrementing m_wal_sync after each sync operation, it should be > >>> implemented > >>> in each fsync method. It leads to many duplicated codes. > >> > >> Right. I missed that each fsync function checks enableFsync. > >> > >>> So, why don't you change the function to a flag whether to > >>> sync data to the disk will be occurred or not in issue_xlog_fsync()? > >> > >> Looks better. Since we don't necessarily need to increment m_wal_sync > >> after doing fsync we can write the code without an additional variable > >> as follows: > >> > >> if (enableFsync) > >> { > >> switch (sync_method) > >> { > >> case SYNC_METHOD_FSYNC: > >> #ifdef HAVE_FSYNC_WRITETHROUGH > >> case SYNC_METHOD_FSYNC_WRITETHROUGH: > >> #endif > >> #ifdef HAVE_FDATASYNC > >> case SYNC_METHOD_FDATASYNC: > >> #endif > >> WalStats.m_wal_sync++; > >> if (track_wal_io_timing) > >> INSTR_TIME_SET_CURRENT(start); > >> break; > >> default: > >>
Re: Is it useful to record whether plans are generic or custom?
At Tue, 12 Jan 2021 20:36:58 +0900, torikoshia wrote in > I suppose it would be normal practice to store past results of > pg_stat_statements for future comparisons. > If this is the case, I think that if we only add the number of > generic plan execution, it will give us a hint to notice the cause > of performance degradation due to changes in the plan between > generic and custom. Agreed. > Attached a patch that just adds a generic call counter to > pg_stat_statements. > > Any thoughts? Note that ActivePortal is the closest nested portal. So it gives the wrong result for nested portals. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Is Recovery actually paused?
At Mon, 25 Jan 2021 10:05:19 +0530, Dilip Kumar wrote in > On Mon, Jan 25, 2021 at 8:42 AM Kyotaro Horiguchi > wrote: > > > > At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar > > wrote in > > > On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy > > > wrote: > > > > > > > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar > > > > wrote: > > > > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy > > > > > wrote: > > > > >> > > > > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar > > > > >> wrote: > > > > >> > Please find the patch for the same. I haven't added a test case > > > > >> > for > > > > >> > this yet. I mean we can write a test case to pause the recovery > > > > >> > and > > > > >> > get the status. But I am not sure that we can really write a > > > > >> > reliable > > > > >> > test case for 'pause requested' and 'paused'. > > > > >> > > > > >> +1 to just show the recovery pause state in the output of > > > > >> pg_is_wal_replay_paused. But, should the function name > > > > >> "pg_is_wal_replay_paused" be something like > > > > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" > > > > >> exists > > > > >> in a function, I expect a boolean output. Others may have better > > > > >> thoughts. > > > > >> > > > > >> IIUC the above change, ensuring the recovery is paused after it's > > > > >> requested lies with the user. IMHO, the main problem we are trying to > > > > >> solve is not met > > > > > > > > > > > > > > > Basically earlier their was no way for the user yo know whether the > > > > > recovery is actually paused or not because it was always returning > > > > > true after pause requested. Now, we will return whether pause > > > > > requested or actually paused. So > for tool designer who want to > > > > > wait for recovery to get paused can have a loop and wait until the > > > > > recovery state reaches to paused. That will give a better control. > > > > > > > > I get it and I agree to have that change. My point was whether we can > > > > have a new function pg_wal_replay_pause_and_wait that waits until > > > > recovery is actually paused ((along with pg_is_wal_replay_paused > > > > returning the actual state than a true/false) so that tool developers > > > > don't need to have the waiting code outside, if at all they care about > > > > it? Others may have better thoughts than me. > > > > > > I think the previous patch was based on that idea where we thought > > > that we can pass an argument to pg_is_wal_replay_paused which can > > > decide whether to wait or return without the wait. I think this > > > version looks better to me where we give the status instead of > > > waiting. I am not sure whether we want another version of > > > pg_wal_replay_pause which will wait for actually it to get paused. I > > > mean there is always a scope to include the functionality in the > > > database which can be achieved by the tool, but this patch was trying > > > to solve the problem that there was no way to know the status so I > > > think returning the correct status should be the scope of this. > > > > I understand that the requirement here is that no record is replayed > > after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused() > > returns true, and delays taken while recovery don't delay the state > > change. That requirements are really synchronous. > > > > On the other hand the machinery is designed to be asynchronous. > > > > >* Note that we intentionally don't take the info_lck spinlock > > >* here. We might therefore read a slightly stale value of > > >* the recoveryPause flag, but it can't be very stale (no > > >* worse than the last spinlock we did acquire). Since a > > >* pause request is a pretty asynchronous thing anyway, > > >* possibly responding to it one WAL record later than we > > >* otherwise would is a minor issue, so it doesn't seem worth > > >* adding another spinlock cycle to prevent that. > > > > As the result, this patch tries to introduce several new checkpoints > > to some delaying point so that that waits can find pause request in a > > timely manner. I think we had better use locking (or atomics) for the > > information instead of such scattered checkpoints if we expect that > > machinery to work in a such syhcnronous manner. > > > > That would make the tri-state state variable and many checkpoints > > unnecessary. Maybe. > > I don't think the intention was so to make it synchronous, I think > the main intention was that pg_is_wal_replay_paused can return us the > correct state, in short user can know that whether any more wal will > be replayed after pg_is_wal_replay_paused returns true or some other > state. I agree that along with that we have also introduced some I meant that kind of correctness in a time-series by using the word "synchronous". So it can be implemented both by adopting many checkpoints and by just makeing the state-change synchronou
Re: ModifyTable overheads in generic plans
On Tue, Dec 22, 2020 at 5:16 PM Amit Langote wrote: > On Mon, Dec 7, 2020 at 3:53 PM Amit Langote wrote: > > On Thu, Nov 12, 2020 at 5:04 PM Amit Langote > > wrote: > > > Attached new 0002 which does these adjustments. I went with > > > ri_RootTargetDesc to go along with ri_RelationDesc. > > > > > > Also, I have updated the original 0002 (now 0003) to make > > > GetChildToRootMap() use ri_RootTargetDesc instead of > > > ModifyTableState.rootResultRelInfo.ri_RelationDesc, so that even > > > AfterTriggerSaveEvent() can now use that function. This allows us to > > > avoid having to initialize ri_ChildToRootMap anywhere but inside > > > GetChildRootMap(), with that long comment defending doing so. :-) > > > > These needed to be rebased due to recent copy.c upheavals. Attached. > > Needed to be rebased again. And again, this time over the recent batch insert API related patches. -- Amit Langote EDB: http://www.enterprisedb.com v13-0001-Set-ForeignScanState.resultRelInfo-lazily.patch Description: Binary data v13-0002-Rethink-ResultRelInfo.ri_PartitionRoot.patch Description: Binary data v13-0003-Initialize-result-relation-information-lazily.patch Description: Binary data
RE: Parallel INSERT (INTO ... SELECT ...)
From: Greg Nancarrow > > (1) > Yes, you're right the wording is not right (and I don't really like > the wording used before the patch). > > Perhaps it could say: > > (Note that we do allow CREATE TABLE AS, INSERT INTO...SELECT, SELECT > INTO, and CREATE MATERIALIZED VIEW to use parallel plans. However, as > of now, other than in the case of INSERT INTO...SELECT, only the leader > backend > writes into a completely new table. In the future, we can extend it to > allow workers for the > other commands to write into the table. However, to allow parallel > updates and deletes, we > have to solve other problems, especially around combo CIDs.) That looks good to me, thanks. > > (4) > You could, but I'm not sure it would make the code easier to read, > especially for those who don't know !isParallelWorker() means it's a > parallel leader. ... > I really can't see a problem. PrepareParallelMode() is only needed > prior to execution of a parallel plan, so it's not needed for "other > call sites" using EnterParallelMode(). My frank first impressions were (and are): * Why do we have to call a separate function for preparation despite the actual entering follows immediately? We can do necessary preparation in the entering function. * Those who read the parallel index build and parallel VACUUM code for the first time might be startled at the missing PrepareParallelMode() call: "Oh, EnterParallelMode() is called without preparation unlike the other site I saw the other day. Isn't this a but?" > Perhaps the name can be changed to disassociate it from generic > EnterParallelMode() usage. So far, I've only thought of long names > like: PrepareParallelModePlanExec(). > Ideas? What PrepareParallelMode() handles is the XID and command ID, which are managed by access/transam/ module and are not executor-specific. It's natural (or at least not unnatural) that EnterParallelMode() prepares them, because EnterParallelMode() is part of access/transam/. Regards Takayuki Tsunakawa
Re: Single transaction in the tablesync worker?
On Mon, Jan 25, 2021 at 8:03 AM Peter Smith wrote: > > Hi Amit. > > PSA the v19 patch for the Tablesync Solution1. > I see one race condition in this patch where we try to drop the origin via apply process and DropSubscription. I think it can lead to the error "cache lookup failed for replication origin with oid %u". The same problem can happen via exposed API pg_replication_origin_drop but probably because this is not used concurrently so nobody faced this issue. I think for the matter of this patch we can try to suppress such an error either via try..catch, or by adding missing_ok argument to replorigin_drop API, or we can just add to comments that such a race exists. Additionally, we should try to start a new thread for the existence of this problem in pg_replication_origin_drop. What do you think? -- With Regards, Amit Kapila.
Re: Release SPI plans for referential integrity with DISCARD ALL
Hello, Thank you for your comments. Following Corey's advice, I applied Amit's patches proposed in this email [1], and confirmed our memory pressure problem was solved. So dropping cached plan with DISCARD is not necessary anymore. [1] https://www.postgresql.org/message-id/CA%2BHiwqG1qQuBwApueaUfA855UJ4TiSgFkPF34hQDWx3tOChV5w%40mail.gmail.com -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 25, 2021 at 2:22 PM Hou, Zhijie wrote: > Hi, > > After doing some test to cover the code path in the PATCH 0001. > I have some suggestions for the 0002 testcase. > > > (1) > + /* Check parallel-safety of any expressions in the > partition key */ > + if (get_partition_col_attnum(pkey, i) == 0) > + { > + Node *check_expr = (Node *) > lfirst(partexprs_item); > + > + if (max_parallel_hazard_walker(check_expr, > context)) > + { > + table_close(rel, lockmode); > + return true; > + } > > The testcase seems does not cover the above code(test when the table have > parallel unsafe expression in the partition key). > > Personally, I use the following sql to cover this: > - > create table partkey_unsafe_key_expr_t (a int4, b name) partition by range > ((fullname_parallel_unsafe('',a::varchar))); > explain (costs off) insert into partkey_unsafe_key_expr_t select unique1, > stringu1 from tenk1; > - > > Thanks. It looks like that test case was accidently missed (since the comment said to test the index expressions, but it actually tested the support functions). I'll update the test code (and comments) accordingly, using your suggestion. > > (2) > I noticed that most of testcase test both (parallel > safe/unsafe/restricted). > But the index expression seems does not test the parallel restricted. > How about add a testcase like: > - > create or replace function fullname_parallel_restricted(f text, l text) > returns text as $$ > begin > return f || l; > end; > $$ language plpgsql immutable parallel restricted; > > create table names4(index int, first_name text, last_name text); > create index names4_fullname_idx on names4 > (fullname_parallel_restricted(first_name, last_name)); > > -- > -- Test INSERT with parallel-restricted index expression > -- (should create a parallel plan) > -- > explain (costs off) insert into names4 select * from names; > - > > Thanks, looks like that test case is missing, I'll add it as you suggest. > (3) > + /* Recursively check each partition ... */ > + pdesc = RelationGetPartitionDesc(rel); > + for (i = 0; i < pdesc->nparts; i++) > + { > + if > (rel_max_parallel_hazard_for_modify(pdesc->oids[i], > + > command_type, > + > context, > + > AccessShareLock)) > + { > + table_close(rel, lockmode); > + return true; > + } > + } > > It seems we do not have a testcase to test (some parallel unsafe > expression or.. in partition) > Hoe about add one testcase to test parallel unsafe partition ? > > > OK, I have to create a more complex table to test those other potential parallel-safety issues of partitions (other than what was tested before the recursive call, or support functions and expression in index key), but since it's a recursive call, invoking code that's already been tested, I would not anticipate any problems. Thanks, Greg Nancarrow Fujitsu Australia
The mysterious pg_proc.protrftypes
Hi, I'm trying to understand how pg_proc.protrftypes works. The documentation says "Data type OIDs for which to apply transforms.". For this column, there is no reference to any catalog table? I would guess it should be "(references pg_type.oid)", right? I tried to generate a value for this column to verify my hypothesis, but I struggle to find an example that produces a not null value here. I grepped the sources and found the "CREATE TRANSFORM FOR type_name" command, and found an extension using it named "bool_plperl" which I installed. I assumed this would cause a value, but no. Both of bool_plperl's two functions get null pg_proc.protrftypes values. I've tried running the full regression "make installcheck", but protrftypes doesn't seem to be covered: $ cd postgresql $ make installcheck ... === All 203 tests passed. === $ psql regression regression=# SELECT COUNT(*) FROM pg_proc WHERE protrftypes IS NOT NULL; count --- 0 (1 row) Can someone please show me how to generate a function with a not null pg_proc.protrftypes value? Many thanks. /Joel
Re: Heap's backwards scan scans the incorrect pages with heap_setscanlimits()
On Thu, 21 Jan 2021 at 13:16, David Rowley wrote: > Proposed patch attached. I ended up pushing a slightly revised version of this which kept the code the same as before when rs_numblocks had not been changed. I backpatched to 9.5 as it seemed low risk and worthy of stopping some head-scratching and a future report for any extension authors that make use of heap_setscanlimits() with backwards scans at some point in the future. David
RE: libpq debug log
First, some possibly major questions: (23) From: 'Alvaro Herrera' > Maybe we can create a new file specifically for this to avoid mixing > unrelated stuff in fe-misc.c -- say fe-trace.c where all this new > tracing stuff goes. What do you think about this suggestion? I think this is reasonable, including moving PQtrace/PQuntrace to the new file. OTOH, fe-misc also has humble reasons to contain them. One is that the file header comment is as follows, accepting miscellaneous stuff. Another is that although most functions in fe-misc.c are related to protocol data transmission and receipt, several functions at the end of the file are already not related. * fe-misc.c * * DESCRIPTION * miscellaneous useful functions (24) -void PQtrace(PGconn *conn, FILE *stream); +void PQtrace(PGconn *conn, FILE *stream, bits32 flags); + + flags contains flag bits describing the operating mode + of tracing. If (flags & TRACE_SUPPRESS_TIMESTAMPS) is + true, then timestamps are not printed with each message. + As I asked in the previous mail, I'm afraid we cannot change the signature of existing API functions. If we want this flag bits, we have to add something like PQtraceEx(), don't we? The flag name differs between in the manual and in the source code: +#define PQTRACE_SUPPRESS_TIMESTAMPS(1 << 0) P.S. Also, please note this as: > Also, why don't you try running the regression tests with a temporary > modification to PQtrace() to output the trace to a file? The sole purpose is > to confirm that this patch doesn't make the test crash (core dump). Regards Takayuki Tsunakawa
Re: The mysterious pg_proc.protrftypes
po 25. 1. 2021 v 8:05 odesílatel Joel Jacobson napsal: > Hi, > > I'm trying to understand how pg_proc.protrftypes works. > > The documentation says "Data type OIDs for which to apply transforms.". > For this column, there is no reference to any catalog table? > I would guess it should be "(references pg_type.oid)", right? > > I tried to generate a value for this column to verify my hypothesis, > but I struggle to find an example that produces a not null value here. > > I grepped the sources and found the "CREATE TRANSFORM FOR type_name" > command, > and found an extension using it named "bool_plperl" which I installed. > > I assumed this would cause a value, but no. > > Both of bool_plperl's two functions get null pg_proc.protrftypes values. > > I've tried running the full regression "make installcheck", > but protrftypes doesn't seem to be covered: > > $ cd postgresql > $ make installcheck > ... > === > All 203 tests passed. > === > $ psql regression > regression=# SELECT COUNT(*) FROM pg_proc WHERE protrftypes IS NOT NULL; > count > --- > 0 > (1 row) > > Can someone please show me how to generate a function with a not null > pg_proc.protrftypes value? > you should to use TRANSFORM clause in CREATE FUNCTION statement https://www.postgresql.org/docs/current/sql-createfunction.html CREATE EXTENSION hstore_plperl CASCADE; CREATE FUNCTION test2() RETURNS hstore LANGUAGE plperl TRANSFORM FOR TYPE hstore AS $$ $val = {a => 1, b => 'boo', c => undef}; return $val; $$; Regards Pavel > Many thanks. > > /Joel > > > > > > >
Re: Identify missing publications from publisher while create/alter subscription.
On Thu, Jan 21, 2021 at 10:21 PM Bharath Rupireddy wrote: > > On Thu, Jan 21, 2021 at 6:56 PM vignesh C wrote: > > > > Hi, > > > > Creating/altering subscription is successful when we specify a > > publication which does not exist in the publisher. I felt we should > > throw an error in this case, that will help the user to check if there > > is any typo in the create subscription command or to create the > > publication before the subscription is created. > > If the above analysis looks correct, then please find a patch that > > checks if the specified publications are present in the publisher and > > throws an error if any of the publications is missing in the > > publisher. > > Thoughts? > > I was having similar thoughts (while working on the logical > replication bug on alter publication...drop table behaviour) on why > create subscription succeeds without checking the publication > existence. I checked in documentation, to find if there's a strong > reason for that, but I couldn't. Maybe it's because of the principle > "first let users create subscriptions, later the publications can be > created on the publisher system", similar to this behaviour > "publications can be created without any tables attached to it > initially, later they can be added". > > Others may have better thoughts. > > If we do check publication existence for CREATE SUBSCRIPTION, I think > we should also do it for ALTER SUBSCRIPTION ... SET PUBLICATION. > > I wonder, why isn't dropping a publication from a list of publications > that are with subscription is not allowed? > > Some comments so far on the patch: > > 1) I see most of the code in the new function check_publications() and > existing fetch_table_list() is the same. Can we have a generic > function, with maybe a flag to separate out the logic specific for > checking publication and fetching table list from the publisher. I have made the common code between the check_publications and fetch_table_list into a common function get_appended_publications_query. I felt the rest of the code is better off kept as it is. The Attached patch has the changes for the same and also the change to check publication exists during alter subscription set publication. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From 6ef5a35d82065ec7e497ebe38e575afa45ad9469 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Thu, 21 Jan 2021 18:38:54 +0530 Subject: [PATCH v2] Identify missing publications from publisher while create/alter subscription. Creating/altering subscription is successful when we specify a publication which does not exist in the publisher. This patch checks if the specified publications are present in the publisher and throws an error if any of the publication is missing in the publisher. One of the Alter Subscritpion test from regression was moved to tap test as the error message "/tmp/pg_regress-P6KQrj/.s.PGSQL.58080" varies from exeuction to execution. --- src/backend/commands/subscriptioncmds.c| 158 - src/test/regress/expected/subscription.out | 33 +++--- src/test/regress/sql/subscription.sql | 1 - src/test/subscription/t/100_bugs.pl| 78 +- 4 files changed, 224 insertions(+), 46 deletions(-) diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 082f785..59076b5 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -46,6 +46,7 @@ #include "utils/syscache.h" static List *fetch_table_list(WalReceiverConn *wrconn, List *publications); +static void check_publications(WalReceiverConn *wrconn, List *publications); /* * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands. @@ -490,6 +491,9 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel) PG_TRY(); { + /* Verify specified publications exists in the publisher. */ + check_publications(wrconn, publications); + /* * Set sync state based on if we were asked to do data copy or * not. @@ -557,7 +561,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel) } static void -AlterSubscription_refresh(Subscription *sub, bool copy_data) +AlterSubscription_refresh(Subscription *sub, bool copy_data, bool check_pub) { char *err; List *pubrel_names; @@ -576,6 +580,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data) ereport(ERROR, (errmsg("could not connect to the publisher: %s", err))); + /* Verify specified publications exists in the publisher. */ + if (check_pub) + check_publications(wrconn, sub->publications); + /* Get the table list from publisher. */ pubrel_names = fetch_table_list(wrconn, sub->publications); @@ -822,6 +830,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt) { bool copy_data; bool refresh; +char *err; +WalReceiverConn *wrconn; parse_subscription_options(stmt->options,
Re: The mysterious pg_proc.protrftypes
On Mon, Jan 25, 2021, at 08:14, Pavel Stehule wrote: >you should to use TRANSFORM clause in CREATE FUNCTION statement Thanks, it worked, and like expected it references the pg_type.oid of the transform. Attached patch adds "(references pg_type.oid)" to the documentation for pg_proc.protrftypes. Suggested commit message: "Document the fact that pg_proc.protrftypes references pg_type.oid" /Joel pg-proc-protrftypes-references-pg-type-oid.patch Description: Binary data
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/23 13:40, Bharath Rupireddy wrote: On Fri, Jan 22, 2021 at 6:43 PM Fujii Masao wrote: Please review the v16 patch set further. Thanks! Will review that later. + /* +* For the given server, if we closed connection or it is still in +* use, then no need of scanning the cache further. We do this +* because the cache can not have multiple cache entries for a +* single foreign server. +*/ On second thought, ISTM that single foreign server can have multiple cache entries. For example, CREATE ROLE foo1 SUPERUSER; CREATE ROLE foo2 SUPERUSER; CREATE EXTENSION postgres_fdw; CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port '5432'); CREATE USER MAPPING FOR foo1 SERVER loopback OPTIONS (user 'postgres'); CREATE USER MAPPING FOR foo2 SERVER loopback OPTIONS (user 'postgres'); CREATE TABLE t (i int); CREATE FOREIGN TABLE ft (i int) SERVER loopback OPTIONS (table_name 't'); SET SESSION AUTHORIZATION foo1; SELECT * FROM ft; SET SESSION AUTHORIZATION foo2; SELECT * FROM ft; Then you can see there are multiple open connections for the same server as follows. So we need to scan all the entries even when the serverid is specified. SELECT * FROM postgres_fdw_get_connections(); server_name | valid -+--- loopback| t loopback| t (2 rows) This is a great finding. Thanks a lot. I will remove hash_seq_term(&scan); in disconnect_cached_connections and add this as a test case for postgres_fdw_get_connections function, just to show there can be multiple connections with a single server name. This means that user (even non-superuser) can disconnect the connection established by another user (superuser), by using postgres_fdw_disconnect_all(). Is this really OK? Yeah, connections can be discarded by non-super users using postgres_fdw_disconnect_all and postgres_fdw_disconnect. Given the fact that a non-super user requires a password to access foreign tables [1], IMO a non-super user changing something related to a super user makes no sense at all. If okay, we can have a check in disconnect_cached_connections something like below: Also like pg_terminate_backend(), we should disallow non-superuser to disconnect the connections established by other non-superuser if the requesting user is not a member of the other? Or that's overkill because the target to discard is just a connection and it can be established again if necessary? For now I'm thinking that it might better to add the restriction like pg_terminate_backend() at first and relax that later if possible. But I'd like hear more opinions about this. +static bool +disconnect_cached_connections(Oid serverid) +{ +HASH_SEQ_STATUSscan; +ConnCacheEntry*entry; +boolall = !OidIsValid(serverid); +boolresult = false; + +if (!superuser()) +ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be superuser to discard open connections"))); + +if (!ConnectionHash) Having said that, it looks like dblink_disconnect doesn't perform superuser checks. Also non-superuser (set by SET ROLE or SET SESSION AUTHORIZATION) seems to be able to run SQL using the dblink connection established by superuser. If we didn't think that this is a problem, we also might not need to care about issue even for postgres_fdw. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: About to add WAL write/fsync statistics to pg_stat_wal view
Hi, thanks for the reviews. I updated the attached patch. The summary of the changes is following. 1. fix document I followed another view's comments. 2. refactor issue_xlog_fsync() I removed "sync_called" variables, narrowed the "duration" scope and change the switch statement to if statement. 3. make wal-receiver report WAL statistics I add the code to collect the statistics for a written operation in XLogWalRcvWrite() and to report stats in WalReceiverMain(). Since WalReceiverMain() can loop fast, to avoid loading stats collector, I add "force" argument to the pgstat_send_wal function. If "force" is false, it can skip reporting until at least 500 msec since it last reported. WalReceiverMain() almost calls pgstat_send_wal() with "force" as false. Regards, -- Masahiro Ikeda NTT DATA CORPORATION--- v5-0001-Add-statistics-related-to-write-sync-wal-records.patch 2021-01-23 09:26:01.919248712 +0900 +++ v6-0001-Add-statistics-related-to-write-sync-wal-records.patch 2021-01-25 16:27:50.749429666 +0900 @@ -1,6 +1,6 @@ -From ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001 +From e9aad92097c5cff5565b67ce1a8ec6d7b4c8a4d9 Mon Sep 17 00:00:00 2001 From: Masahiro Ikeda -Date: Fri, 22 Jan 2021 21:38:31 +0900 +Date: Mon, 25 Jan 2021 16:26:04 +0900 Subject: [PATCH] Add statistics related to write/sync wal records. This patch adds following statistics to pg_stat_wal view @@ -24,20 +24,22 @@ (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID) --- - doc/src/sgml/config.sgml | 21 +++ - doc/src/sgml/monitoring.sgml | 48 ++- - src/backend/access/transam/xlog.c | 59 ++- + doc/src/sgml/config.sgml | 21 + doc/src/sgml/monitoring.sgml | 48 - + src/backend/access/transam/xlog.c | 51 ++- src/backend/catalog/system_views.sql | 4 ++ - src/backend/postmaster/pgstat.c | 4 ++ - src/backend/postmaster/walwriter.c| 3 + - src/backend/utils/adt/pgstatfuncs.c | 24 +++- - src/backend/utils/misc/guc.c | 9 +++ + src/backend/postmaster/checkpointer.c | 2 +- + src/backend/postmaster/pgstat.c | 26 -- + src/backend/postmaster/walwriter.c| 3 ++ + src/backend/replication/walreceiver.c | 30 +++ + src/backend/utils/adt/pgstatfuncs.c | 24 +++-- + src/backend/utils/misc/guc.c | 9 src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + src/include/catalog/pg_proc.dat | 14 ++--- - src/include/pgstat.h | 8 +++ - src/test/regress/expected/rules.out | 6 +- - 13 files changed, 189 insertions(+), 13 deletions(-) + src/include/pgstat.h | 10 +++- + src/test/regress/expected/rules.out | 6 ++- + 15 files changed, 232 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 82864bbb24..43f3fbcaf8 100644 @@ -72,7 +74,7 @@ track_functions (enum) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml -index f05140dd42..36764dc64f 100644 +index f05140dd42..5a8fc4eb0c 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i @@ -80,7 +82,7 @@ - Number of times WAL data was written to disk because WAL buffers became full -+ Total number of times WAL data was written to disk because WAL buffers became full ++ Total number of WAL data written to disk because WAL buffers became full + + + @@ -89,7 +91,7 @@ + wal_write bigint + + -+ Total number of times WAL data was written to disk ++ Total number of WAL data written to disk + + + @@ -109,7 +111,7 @@ + wal_sync bigint + + -+ Total number of times WAL data was synced to disk ++ Total number of WAL data synced to disk + (if is open_datasync or + open_sync, this value is zero because WAL data is synced + when to write it). @@ -131,7 +133,7 @@ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c -index 470e113b33..b3890f97a2 100644 +index 470e113b33..1c4860bee7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -110,6 +110,7 @@ int CommitDelay = 0; /* precommit delay in microseconds */ @@ -142,16 +144,15 @@ #ifdef WAL_DEBUG bool XLOG_DEBUG = false; -@@ -2540,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) +@@ -2540,6 +2541,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) Size nbytes; Size nleft; int
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 25, 2021 at 4:37 PM tsunakawa.ta...@fujitsu.com wrote: > > > > > (4) > > You could, but I'm not sure it would make the code easier to read, > > especially for those who don't know !isParallelWorker() means it's a > > parallel leader. > ... > > I really can't see a problem. PrepareParallelMode() is only needed > > prior to execution of a parallel plan, so it's not needed for "other > > call sites" using EnterParallelMode(). > My frank first impressions were (and are): > > * Why do we have to call a separate function for preparation despite the > actual entering follows immediately? We can do necessary preparation in the > entering function. > > * Those who read the parallel index build and parallel VACUUM code for the > first time might be startled at the missing PrepareParallelMode() call: "Oh, > EnterParallelMode() is called without preparation unlike the other site I saw > the other day. Isn't this a but?" > > > > Perhaps the name can be changed to disassociate it from generic > > EnterParallelMode() usage. So far, I've only thought of long names > > like: PrepareParallelModePlanExec(). > > Ideas? > > What PrepareParallelMode() handles is the XID and command ID, which are > managed by access/transam/ module and are not executor-specific. It's > natural (or at least not unnatural) that EnterParallelMode() prepares them, > because EnterParallelMode() is part of access/transam/. > > EnterParallelMode() is part of a generic interface for execution of a parallel operation, and EnterParallelMode() is called in several different places to enter parallel mode prior to execution of different parallel operations. At the moment it is assumed that EnterParallelMode() just essentially sets a flag to prohibit certain unsafe operations when doing the parallel operation. If I move PrepareParallelMode() into EnterParallelMode() then I need to pass in contextual information to distinguish who the caller is, and possibly extra information needed by that caller - and change the function call for each caller, and probably update the comments for each, and in other places, etc. etc. I think that it just complicates things doing this. The other callers of EnterParallelMode() are obviously currently doing their own "pre" parallel-mode code themselves, specific to whatever parallel operation they are doing - but nobody has thought it necessary to have to hook this code into EnterParallelMode(). I think the "PrepareParallelMode()" name can just be changed to something specific to plan execution, so nobody gets confused with a name like "PrepareParallelMode()", which as you point out sounds generic to all callers of EnterParallelMode(). Regards, Greg Nancarrow Fujitsu Australia
[PATCH] Document the fact that pg_proc.protrftypes references pg_type.oid
Hi, This patch fixes $SUBJECT. /Joel pg-proc-protrftypes-references-pg-type-oid.patch Description: Binary data