Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Feb 1, 2021 at 8:19 PM Hou, Zhijie wrote: > > Hi, > > When developing the reloption patch, I noticed some issues in the patch. > > 1). > > - Reduce Insert parallel-safety checks required for some SQL, by noting > > that the subquery must operate on a relation (check for RTE_RELATION in > > subquery range-table) > > + foreach(lcSub, rte->subquery->rtable) > + { > + rteSub = lfirst_node(RangeTblEntry, lcSub); > + if (rteSub->rtekind == RTE_RELATION) > + { > + hasSubQueryOnRelation = true; > + break; > + } > + } > It seems we can not only search RTE_RELATION in rtable, > because RTE_RELATION may exist in other place like: > > --- > --** explain insert into target select (select * from test); > Subplan's subplan > > --** with cte as (select * from test) insert into target select * from cte; > In query's ctelist. > --- > > May be we should use a walker function [1] to > search the subquery and ctelist. > > > > [1] > static bool > relation_walker(Node *node) > { > if (node == NULL) > return false; > > else if (IsA(node, RangeTblEntry)) > { > RangeTblEntry *rte = (RangeTblEntry *) node; > if (rte->rtekind == RTE_RELATION) > return true; > > return false; > } > > else if (IsA(node, Query)) > { > Query *query = (Query *) node; > > /* Recurse into subselects */ > return query_tree_walker(query, relation_walker, > NULL, > QTW_EXAMINE_RTES_BEFORE); > } > > /* Recurse to check arguments */ > return expression_tree_walker(node, > > relation_walker, > NULL); > } > I've had a further look at this, and this walker function is doing a lot of work recursing the parse tree, and I'm not sure that it reliably retrieves the information that we;re looking for, for all cases of different SQL queries. Unless it can be made much more efficient and specific to our needs, I think we should not try to do this optimization, because there's too much overhead. Also, keep in mind that for the current parallel SELECT functionality in Postgres, I don't see any similar optimization being attempted (and such optimization should be attempted at the SELECT level). So I don't think we should be attempting such optimization in this patch (but could be attempted in a separate patch, just related to current parallel SELECT functionality). Regards, Greg Nancarrow Fujitsu Australia
Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)
Hi Euler, I've tried to login to the CF interface a couple of times now, but seems to have lost my password. I've tried to use the "Password reset" form [1], but I don't get any email. The email is correct, because when I try to re-register it says it's taken. Not sure who I should ask for help. Anyone? /Joel [1] https://www.postgresql.org/account/reset/ On Tue, Feb 2, 2021, at 03:10, Euler Taveira wrote: > On Sun, Jan 31, 2021, at 10:27 AM, Joel Jacobson wrote: >> Here comes the patch again, now including data. > Joel, register this patch into the next CF [1] so we don't lose track of it. > > > [1] https://commitfest.postgresql.org/32/ > > > -- > Euler Taveira > EDB https://www.enterprisedb.com/ > Kind regards, Joel
Re: Transactions involving multiple postgres foreign servers, take 2
On 2021/01/27 14:08, Masahiko Sawada wrote: On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao wrote: You fixed some issues. But maybe you forgot to attach the latest patches? Yes, I've attached the updated patches. Thanks for updating the patch! I tried to review 0001 and 0002 as the self-contained change. + * An FDW that implements both commit and rollback APIs can request to register + * the foreign transaction by FdwXactRegisterXact() to participate it to a + * group of distributed tranasction. The registered foreign transactions are + * identified by OIDs of server and user. I'm afraid that the combination of OIDs of server and user is not unique. IOW, more than one foreign transactions can have the same combination of OIDs of server and user. For example, the following two SELECT queries start the different foreign transactions but their user OID is the same. OID of user mapping should be used instead of OID of user? CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw; CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 'postgres'); CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres'); CREATE TABLE t(i int); CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't'); BEGIN; SELECT * FROM ft; DROP USER MAPPING FOR postgres SERVER loopback ; SELECT * FROM ft; COMMIT; + /* Commit foreign transactions if any */ + AtEOXact_FdwXact(true); Don't we need to pass XACT_EVENT_PARALLEL_PRE_COMMIT or XACT_EVENT_PRE_COMMIT flag? Probably we don't need to do this if postgres_fdw is only user of this new API. But if we make this new API generic one, such flags seem necessary so that some foreign data wrappers might have different behaviors for those flags. Because of the same reason as above, AtEOXact_FdwXact() should also be called after CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_COMMIT : XACT_EVENT_COMMIT)? + /* +* Abort foreign transactions if any. This needs to be done before marking +* this transaction as not running since FDW's transaction callbacks might +* assume this transaction is still in progress. +*/ + AtEOXact_FdwXact(false); Same as above. +/* + * This function is called at PREPARE TRANSACTION. Since we don't support + * preparing foreign transactions yet, raise an error if the local transaction + * has any foreign transaction. + */ +void +AtPrepare_FdwXact(void) +{ + if (FdwXactParticipants != NIL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot PREPARE a transaction that has operated on foreign tables"))); +} This means that some foreign data wrappers suppporting the prepare transaction (though I'm not sure if such wappers actually exist or not) cannot use the new API? If we want to allow those wrappers to use new API, AtPrepare_FdwXact() should call the prepare callback and each wrapper should emit an error within the callback if necessary. + foreach(lc, FdwXactParticipants) + { + FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc); + + if (fdw_part->server->serverid == serverid && + fdw_part->usermapping->userid == userid) Isn't this ineffecient when starting lots of foreign transactions because we need to scan all the entries in the list every time? +static ConnCacheEntry * +GetConnectionCacheEntry(Oid umid) +{ + boolfound; + ConnCacheEntry *entry; + ConnCacheKey key; + + /* First time through, initialize connection cache hashtable */ + if (ConnectionHash == NULL) + { + HASHCTL ctl; + + ctl.keysize = sizeof(ConnCacheKey); + ctl.entrysize = sizeof(ConnCacheEntry); + ConnectionHash = hash_create("postgres_fdw connections", 8, +&ctl, + HASH_ELEM | HASH_BLOBS); Currently ConnectionHash is created under TopMemoryContext. With the patch, since GetConnectionCacheEntry() can be called in other places, ConnectionHash may be created under the memory context other than TopMemoryContext? If so, that's safe? - if (PQstatus(entry->conn) != CONNECTION_OK || - PQtransactionStatus(entry->conn) != PQTRANS_IDLE || - entry->changing_xact_state || - entry->invalidated) ... + if (PQstatus(entry->conn) != CONNECTION_OK || + PQtransactionStatus(entry->conn) != PQTRANS_IDLE || + entry->changing_xact_state) Why did you get rid of the condition "entry->invalidated"? I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independ
RE: Parallel INSERT (INTO ... SELECT ...)
> > I've had a further look at this, and this walker function is doing a lot > of work recursing the parse tree, and I'm not sure that it reliably retrieves > the information that we;re looking for, for all cases of different SQL > queries. Unless it can be made much more efficient and specific to our needs, > I think we should not try to do this optimization, because there's too much > overhead. Also, keep in mind that for the current parallel SELECT > functionality in Postgres, I don't see any similar optimization being > attempted (and such optimization should be attempted at the SELECT level). > So I don't think we should be attempting such optimization in this patch > (but could be attempted in a separate patch, just related to current parallel > SELECT functionality). Yes, I agreed, I was worried about the overhead it may bring too, we can remove this from the current patch. Best regards, houzj
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 2/2/21 11:57, tsunakawa.ta...@fujitsu.com wrote: Hello, Andrey-san, From: Tang, Haiying Sometimes before i suggested additional optimization [1] which can additionally speed up COPY by 2-4 times. Maybe you can perform the benchmark for this solution too? ... But the patch no longer applies, I tried to apply on e42b3c3bd6(2021/1/26) but failed. Can you send a rebased version? When do you think you can submit the rebased version of them? (IIUC at the off-list HighGo meeting, you're planning to post them late this week after the global snapshot patch.) Just in case you are not going to do them for the moment, can we rebase and/or further modify them so that they can be committed in PG 14? Of course, you can rebase it. -- regards, Andrey Lepikhov Postgres Professional
Re: Single transaction in the tablesync worker?
On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian wrote: > > On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila wrote: > > I have updated the patch to display WARNING for each of the tablesync > > slots during DropSubscription. As discussed, I have moved the drop > > slot related code towards the end in AlterSubscription_refresh. Apart > > from this, I have fixed one more issue in tablesync code where in > > after catching the exception we were not clearing the transaction > > state on the publisher, see changes in LogicalRepSyncTableStart. I > > have also fixed other comments raised by you. Additionally, I have > > removed the test because it was creating the same name slot as the > > tablesync worker and tablesync worker removed the same due to new > > logic in LogicalRepSyncStart. Earlier, it was not failing because of > > the bug in that code which I have fixed in the attached. > > > > I was testing this patch. I had a table on the subscriber which had a > row that would cause a PK constraint > violation during the table copy. This is resulting in the subscriber > trying to rollback the table copy and failing. > I am not getting this error. I have tried by below test: Publisher === CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); BEGIN; INSERT INTO mytbl1(somedata, text) VALUES (1, 1); INSERT INTO mytbl1(somedata, text) VALUES (1, 2); COMMIT; CREATE PUBLICATION mypublication FOR TABLE mytbl1; Subscriber = CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); BEGIN; INSERT INTO mytbl1(somedata, text) VALUES (1, 1); INSERT INTO mytbl1(somedata, text) VALUES (1, 2); COMMIT; CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypublication; It generates the PK violation the first time and then I removed the conflicting rows in the subscriber and it passed. See logs below. 2021-02-02 13:51:34.316 IST [20796] LOG: logical replication table synchronization worker for subscription "mysub", table "mytbl1" has started 2021-02-02 13:52:43.625 IST [20796] ERROR: duplicate key value violates unique constraint "mytbl1_pkey" 2021-02-02 13:52:43.625 IST [20796] DETAIL: Key (id)=(1) already exists. 2021-02-02 13:52:43.625 IST [20796] CONTEXT: COPY mytbl1, line 1 2021-02-02 13:52:43.695 IST [27840] LOG: background worker "logical replication worker" (PID 20796) exited with exit code 1 2021-02-02 13:52:43.884 IST [6260] LOG: logical replication table synchronization worker for subscription "mysub", table "mytbl1" has started 2021-02-02 13:53:54.680 IST [6260] LOG: logical replication table synchronization worker for subscription "mysub", table "mytbl1" has finished Also, a similar test exists in 0004_sync.pl, is that also failing for you? Can you please provide detailed steps that led to this failure? > > And one more thing I see is that now we error out in PG_CATCH() in > LogicalRepSyncTableStart() with the above error and as a result, the > tablesync slot is not dropped. Hence causing the slot create to fail > in the next restart. > I think this can be avoided. We could either attempt a rollback only > on specific failures and drop slot prior to erroring out. > Hmm, we have to first rollback before attempting any other operation because the transaction on the publisher is in an errored state. -- With Regards, Amit Kapila.
Re: proposal: schema variables
Hi rebase and set PK for pg_variable table Regards Pavel schema-variables-20210202.patch.gz Description: application/gzip
join plan with unexpected var clauses
Hi, At a customer we came across a curious plan (see attached testcase). Given the testcase we see that the outer semi join tries to join the outer with the inner table id columns, even though the middle table id column is also there. Is this expected behavior? The reason i'm asking is two-fold: - the inner hash table now is bigger than i'd expect and has columns that you would normally not select on. - the middle join now projects the inner as result, which is quite suprising and seems invalid from a SQL standpoint. Plan: Finalize Aggregate Output: count(*) -> Gather Output: (PARTIAL count(*)) Workers Planned: 4 -> Partial Aggregate Output: PARTIAL count(*) -> Parallel Hash Semi Join Hash Cond: (_outer.id3 = _inner.id2) -> Parallel Seq Scan on public._outer Output: _outer.id3, _outer.extra1 -> Parallel Hash Output: middle.id1, _inner.id2 -> Parallel Hash Semi Join Output: middle.id1, _inner.id2 Hash Cond: (middle.id1 = _inner.id2) -> Parallel Seq Scan on public.middle Output: middle.id1 -> Parallel Hash Output: _inner.id2 -> Parallel Seq Scan on public._inner Output: _inner.id2 Kind regards, Luc Swarm64 testcase.sql Description: application/sql
RE: [POC] Fast COPY FROM command for the table with foreign partitions
From: Andrey Lepikhov > Of course, you can rebase it. Thank you. I might modify the basic part to incorporate my past proposal about improving the layering or modularity related to ri_useMultiInsert. (But I may end up giving up due to lack of energy.) Also, I might defer working on the extended part (v9 0003 and 0004) and further separate them in a different thread, if it seems to take longer. Regards Takayuki Tsunakawa
Re: memory leak in auto_explain
On Tue, 02 Feb 2021 at 07:12, Jeff Janes wrote: > On Mon, Feb 1, 2021 at 6:09 PM Jeff Janes wrote: > >> >> >> create or replace function gibberish(int) returns text language SQL as $_$ >> select left(string_agg(md5(random()::text),),$1) from >> generate_series(0,$1/32) $_$; >> >> create table j1 as select x, md5(random()::text) as t11, gibberish(1500) >> as t12 from generate_series(1,20e6) f(x); >> > > I should have added, found it on HEAD, verified it also in 12.5. > Here's my analysis: 1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL function memory context, which is a long-lived, cause the memory grow up. /* * Switch to context in which the fcache lives. This ensures that our * tuplestore etc will have sufficient lifetime. The sub-executor is * responsible for deleting per-tuple information. (XXX in the case of a * long-lived FmgrInfo, this policy represents more memory leakage, but * it's not entirely clear where to keep stuff instead.) */ oldcontext = MemoryContextSwitchTo(fcache->fcontext); 2) I try to call pfree() to release ExplainState memory, however, it does not make sence, I do not know why this does not work? So I try to create it in queryDesc->estate->es_query_cxt memory context like queryDesc->totaltime, and it works. Attached fix the memory leakage in auto_explain. Any thoughts? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From 2589cf9c78bbd308b6f300277500c920152ee6f2 Mon Sep 17 00:00:00 2001 From: Japin Li Date: Tue, 2 Feb 2021 17:27:21 +0800 Subject: [PATCH v1] Fix memory leak in auto_explain --- contrib/auto_explain/auto_explain.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index faa6231d87..033540a3f4 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -383,6 +383,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc) msec = queryDesc->totaltime->total * 1000.0; if (msec >= auto_explain_log_min_duration) { + MemoryContext oldctx = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt); ExplainState *es = NewExplainState(); es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze); @@ -426,6 +427,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc) errhidestmt(true))); pfree(es->str->data); + MemoryContextSwitchTo(oldctx); } } -- 2.30.0
Re: Two patches to speed up pg_rewind.
On Jan 28, 2021, at 3:31 PM, Michael Paquier mailto:mich...@paquier.xyz>> wrote: On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote: Second one is use copy_file_range() for the local rewind case to replace read()+write(). This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other code could use copy_file_range() if needed. copy_file_range() was introduced In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap() might be better than read()+write() but copy_file_range() is more interesting given that it could skip the data copying in some file systems - this could benefit more on Linux fs on network-based block storage. Have you done some measurements? I did not test pg_rewind but for patch 2, I tested copy_fiile_range() vs read()+write() on XFS in Ubuntu 20.04.1 when working on the patches, Here is the test time of 1G file (fully populated with random data) copy. The test is a simple C program. copy_file_range() loop (actually it finished after one call) + fsync() 0m0.048s For read()+write() loop with read/write buffer size 32K + fsync() 0m5.004s For patch 1, it skips syncing less files so it surely benefits the performance.
Re: Single transaction in the tablesync worker?
After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also tried a simple test where I do a DROP TABLE with very bad timing for the tablesync worker. It seems that doing this can cause the sync worker's MyLogicalRepWorker->relid to become invalid. In my test this caused a stack trace within some logging, but I imagine other bad things can happen if the tablesync worker can be executed with an invalid relid. Possibly this is an existing PG bug which has just never been seen before; The ereport which has failed here is not new code. PSA the log for the test steps and the stack trace details. [ac0202] https://www.postgresql.org/message-id/CAFPTHDYzjaNfzsFHpER9idAPB8v5j%3DSUbWL0AKj5iVy0BKbTpg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia Test Scenario 1. INSERT data so tablesync should copy something 2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out from under the sync worker! 3. Continue the sync worker see what happens == ## ## Insert data ## [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');" INSERT 0 1 ## ## SUBSCRIBE and continue to breakpoint at top of tablesync function LogicalRepSyncTableStart ## [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2021-02-02 19:29:16.578 AEDT [26402] LOG: logical decoding found consistent point at 0/165F800 2021-02-02 19:29:16.578 AEDT [26402] DETAIL: There are no running transactions. 2021-02-02 19:29:16.578 AEDT [26402] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2021-02-02 19:29:16.587 AEDT [26405] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-02-02 19:29:16.587 AEDT [26405] LOG: !!>> The apply worker process has PID = 26405 2021-02-02 19:29:16.597 AEDT [26411] LOG: starting logical decoding for slot "tap_sub" 2021-02-02 19:29:16.597 AEDT [26411] DETAIL: Streaming transactions committing after 0/165F838, reading WAL from 0/165F800. 2021-02-02 19:29:16.597 AEDT [26411] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-02 19:29:16.598 AEDT [26411] LOG: logical decoding found consistent point at 0/165F800 2021-02-02 19:29:16.598 AEDT [26411] DETAIL: There are no running transactions. 2021-02-02 19:29:16.598 AEDT [26411] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-02 19:29:16.598 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:16.598 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables [postgres@CentOS7-x64 ~]$ 2021-02-02 19:29:16.602 AEDT [26415] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-02-02 19:29:16.602 AEDT [26415] LOG: !!>> The tablesync worker process has PID = 26415 2021-02-02 19:29:16.602 AEDT [26415] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 26415 now! 2021-02-02 19:29:17.610 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:17.610 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:18.611 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:18.611 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:19.612 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:19.612 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:20.613 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:20.613 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:21.614 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:21.614 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:22.615 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:22.615 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:23.615 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:23.615 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:24.658 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:24.658 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:25.661 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:25.661 AEDT [26405] LOG: !!>> apply worker: called process_syncing_tables 2021-02-02 19:29:26.662 AEDT [26405] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-02 19:29:26.662 AEDT [26405
Re: Single transaction in the tablesync worker?
On Tue, Feb 2, 2021 at 7:40 PM Amit Kapila wrote: > > On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian wrote: > > > > On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila wrote: > > > I have updated the patch to display WARNING for each of the tablesync > > > slots during DropSubscription. As discussed, I have moved the drop > > > slot related code towards the end in AlterSubscription_refresh. Apart > > > from this, I have fixed one more issue in tablesync code where in > > > after catching the exception we were not clearing the transaction > > > state on the publisher, see changes in LogicalRepSyncTableStart. I > > > have also fixed other comments raised by you. Additionally, I have > > > removed the test because it was creating the same name slot as the > > > tablesync worker and tablesync worker removed the same due to new > > > logic in LogicalRepSyncStart. Earlier, it was not failing because of > > > the bug in that code which I have fixed in the attached. > > > > > > > I was testing this patch. I had a table on the subscriber which had a > > row that would cause a PK constraint > > violation during the table copy. This is resulting in the subscriber > > trying to rollback the table copy and failing. > > > > I am not getting this error. I have tried by below test: I am sorry, my above steps were not correct. I think the reason for the failure I was seeing were some other steps I did prior to this. I will recreate this and update you with the appropriate steps. regards, Ajin Cherian Fujitsu Australia
Improve new hash partition bound check error messages
I had a bit of trouble parsing the error message "every hash partition modulus must be a factor of the next larger modulus", so I went into the code, added some comments and added errdetail messages for each case. I think it's a bit clearer now. From e7f392e2f8950236a22f007cc3aed36729da22e1 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 2 Feb 2021 11:21:21 +0100 Subject: [PATCH] Improve new hash partition bound check error messages For the error message "every hash partition modulus must be a factor of the next larger modulus", add a detail message that shows the particular numbers involved. Also comment the code more. --- src/backend/partitioning/partbounds.c | 58 +++--- src/test/regress/expected/alter_table.out | 1 + src/test/regress/expected/create_table.out | 2 + 3 files changed, 43 insertions(+), 18 deletions(-) diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index 0c3f212ff2..7425e34040 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -2832,14 +2832,9 @@ check_new_partition_bound(char *relname, Relation parent, if (partdesc->nparts > 0) { - Datum **datums = boundinfo->datums; - int ndatums = boundinfo->ndatums; int greatest_modulus; int remainder; int offset; - boolvalid_modulus = true; - int prev_modulus, /* Previous largest modulus */ - next_modulus; /* Next largest modulus */ /* * Check rule that every modulus must be a factor of the @@ -2849,7 +2844,9 @@ check_new_partition_bound(char *relname, Relation parent, * modulus 15, but you cannot add both a partition with * modulus 10 and a partition with modulus 15, because 10 * is not a factor of 15. -* +*/ + + /* * Get the greatest (modulus, remainder) pair contained in * boundinfo->datums that is less than or equal to the * (spec->modulus, spec->remainder) pair. @@ -2859,26 +2856,51 @@ check_new_partition_bound(char *relname, Relation parent, spec->remainder); if (offset < 0) { - next_modulus = DatumGetInt32(datums[0][0]); - valid_modulus = (next_modulus % spec->modulus) == 0; + int next_modulus; + + /* +* All existing moduli are greater or equal, so the +* new one must be a factor of the smallest one, which +* is first in the boundinfo. +*/ + next_modulus = DatumGetInt32(boundinfo->datums[0][0]); + if (next_modulus % spec->modulus != 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("every hash partition modulus must be a factor of the next larger modulus"), + errdetail("The new modulus %d is not a factor of the existing modulus %d.", + spec->modulus, next_modulus))); } else { - prev_modulus = DatumGetInt32(datums[offset][0]); - valid_modulus = (spec->modulus % prev_modulus) == 0; +
Re: Add primary keys to system catalogs
On 2021-02-01 15:24, Tom Lane wrote: Peter Eisentraut writes: On 2021-01-30 22:56, Tom Lane wrote: Hmm, shouldn't there have been a catversion bump in there? I suppose yes on the grounds that it introduces something new in a freshly initdb-ed database. But I thought it wasn't necessary because there is no dependency between the binaries and the on-disk state. I've generally worked on the theory that a catversion bump is indicated if you need to initdb in order to pass the updated regression tests. Which one did in this case. Yeah, that's a good way of looking at it. I'll keep that in mind. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
Re: Add primary keys to system catalogs
On 2021-01-22 16:42, Tom Lane wrote: pg_depend pg_shdepend Yeah, this is noted in the patch's own regression tests. Wouldn't it be possible to add primary keys to these two as well? Neither of the existing indexes is suitable, not being unique. We could imagine adding a unique index across the whole column set, but that would be an awfully large price to pay for neatnik-ism. Also, at least for pg_depend (less sure about pg_shdepend), some code cleanup would be required, because I don't think that we try very hard to avoid making duplicate dependency entries. On the whole I feel this'd be counterproductive. I did attempt to make a six- or seven-column unique constraint on pg_depend a while ago. This fails pretty much immediately during initdb. None of the code that adds dependencies, in particular recordMultipleDependencies(), checks if the dependency is already there. I do wonder, however, under what circumstances code would be put into a situation where it would add the exact same dependency again, and also under what circumstances it would add a dependency between the same objects but a different deptype, and how that would work during recursive deletion. Now that we have the refobjversion column, the presence of duplicate dependencies might be even more dubious. I think that could be worth analyzing. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
Re: row filtering for logical replication
On Tue, 02 Feb 2021 at 13:02, Michael Paquier wrote: > On Mon, Feb 01, 2021 at 04:11:50PM -0300, Euler Taveira wrote: >> After the commit 3696a600e2, the last patch does not apply cleanly. I'm >> attaching another version to address the documentation issues. > > I have bumped into this thread, and applied 0001. My guess is that > one of the patches developped originally for logical replication > defined atttypmod in LogicalRepRelation, but has finished by not using > it. Nice catch. Since the 0001 patch already be commited (4ad31bb2ef), we can remove it. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Typo in tablesync comment
On Tue, Feb 2, 2021, at 2:19 AM, Michael Paquier wrote: > On Tue, Feb 02, 2021 at 10:38:31AM +1100, Peter Smith wrote: > > PSA a trivial patch to correct what seems like a typo in the tablesync > > comment. > > - * subscribed tables and their state. Some transient state during data > - * synchronization is kept in shared memory. The states SYNCWAIT and > + * subscribed tables and their state. Some transient states during > data > + * synchronization are kept in shared memory. The states SYNCWAIT and > > This stuff refers to SUBREL_STATE_* in pg_subscription_rel.h, and FWIW > I find confusing the term "transient" in this context as a state may > last for a rather long time, depending on the time it takes to > synchronize the relation, no? I am wondering if we could do better > here, say: > "The state tracking the progress of the relation synchronization is > additionally stored in shared memory, with SYNCWAIT and CATCHUP only > appearing in memory." WFM. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)
On Tue, Feb 2, 2021, at 5:13 AM, Joel Jacobson wrote: > I've tried to login to the CF interface a couple of times now, but seems to > have lost my password. > I've tried to use the "Password reset" form [1], but I don't get any email. > The email is correct, because when I try to re-register it says it's taken. > > Not sure who I should ask for help. Anyone? You should probably email: webmaster (at) postgresql (dot) org -- Euler Taveira EDB https://www.enterprisedb.com/
Re: row filtering for logical replication
On Tue, 02 Feb 2021 at 19:16, japin wrote: > On Tue, 02 Feb 2021 at 13:02, Michael Paquier wrote: >> On Mon, Feb 01, 2021 at 04:11:50PM -0300, Euler Taveira wrote: >>> After the commit 3696a600e2, the last patch does not apply cleanly. I'm >>> attaching another version to address the documentation issues. >> >> I have bumped into this thread, and applied 0001. My guess is that >> one of the patches developped originally for logical replication >> defined atttypmod in LogicalRepRelation, but has finished by not using >> it. Nice catch. > > Since the 0001 patch already be commited (4ad31bb2ef), we can remove it. In 0003 patch, function GetPublicationRelationQuals() has been defined, but it never used. So why should we define it? $ grep 'GetPublicationRelationQuals' -rn src/ src/include/catalog/pg_publication.h:116:extern List *GetPublicationRelationQuals(Oid pubid); src/backend/catalog/pg_publication.c:347:GetPublicationRelationQuals(Oid pubid) If we must keep it, here are some comments on it. (1) value_datum = heap_getattr(tup, Anum_pg_publication_rel_prqual, RelationGetDescr(pubrelsrel), &isnull); It looks too long, we can split it into two lines. (2) Since qual_value only used in "if (!isnull)" branch, so we can narrow it's scope. (3) Should we free the memory for qual_value? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Improve new hash partition bound check error messages
On Tue, Feb 2, 2021 at 7:36 PM Peter Eisentraut wrote: > I had a bit of trouble parsing the error message "every hash partition > modulus must be a factor of the next larger modulus", so I went into the > code, added some comments and added errdetail messages for each case. I > think it's a bit clearer now. That is definitely an improvement, thanks. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Improve new hash partition bound check error messages
On 02/02/2021 12:35, Peter Eisentraut wrote: I had a bit of trouble parsing the error message "every hash partition modulus must be a factor of the next larger modulus", so I went into the code, added some comments and added errdetail messages for each case. I think it's a bit clearer now. Yeah, that error message is hard to understand. This is an improvement, but it still took me a while to understand it. Let's look at the example in the regression test: -- check partition bound syntax for the hash partition CREATE TABLE hash_parted ( a int ) PARTITION BY HASH (a); CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 10, REMAINDER 0); CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 50, REMAINDER 1); CREATE TABLE hpart_3 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 200, REMAINDER 2); With this patch, you get this: CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 25, REMAINDER 3); ERROR: every hash partition modulus must be a factor of the next larger modulus DETAIL: The existing modulus 10 is not a factor of the new modulus 25. CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 150, REMAINDER 3); ERROR: every hash partition modulus must be a factor of the next larger modulus DETAIL: The new modulus 150 is not factor of the existing modulus 200. How about this? CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 25, REMAINDER 3); ERROR: every hash partition modulus must be a factor of the next larger modulus DETAIL: 25 is not divisible by 10, the modulus of existing partition "hpart_1" CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 150, REMAINDER 3); ERROR: every hash partition modulus must be a factor of the next larger modulus DETAIL: 150 is not a factor of 200, the modulus of existing partition "hpart_3" Calling the existing partition by name seems good. And this phrasing puts the focus on the new modulus in both variants; presumably the existing partition is OK and the problem is in the new definition. - Heikki
Re: row filtering for logical replication
On Tue, Feb 2, 2021, at 8:38 AM, japin wrote: > In 0003 patch, function GetPublicationRelationQuals() has been defined, but it > never used. So why should we define it? Thanks for taking a look again. It is an oversight. It was introduced in an attempt to refactor ALTER PUBLICATION SET TABLE. In AlterPublicationTables, we could possibly keep some publication-table mappings that does not change, however, since commit 3696a600e2, it is required to find the qual for all inheritors (see GetPublicationRelations). I explain this decision in the following comment: /* * Remove all publication-table mappings. We could possibly * remove (i) tables that are not found in the new table list and * (ii) tables that are being re-added with a different qual * expression. For (ii), simply updating the existing tuple is not * enough, because of qual expression dependencies. */ I will post a new patch set later. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)
On Tue, Feb 2, 2021, at 12:34, Euler Taveira wrote: >You should probably email: webmaster (at) postgresql (dot) org Thanks, done. /Joel
Re: adding wait_start column to pg_locks
On 2021-01-25 23:44, Fujii Masao wrote: Another comment is; Doesn't the change of MyProc->waitStart need the lock table's partition lock? If yes, we can do that by moving LWLockRelease(partitionLock) just after the change of MyProc->waitStart, but which causes the time that lwlock is being held to be long. So maybe we need another way to do that. Thanks for your comments! It would be ideal for the consistency of the view to record "waitstart" during holding the table partition lock. However, as you pointed out, it would give non-negligible performance impacts. I may miss something, but as far as I can see, the influence of not holding the lock is that "waitstart" can be NULL even though "granted" is false. I think people want to know the start time of the lock when locks are held for a long time. In that case, "waitstart" should have already been recorded. If this is true, I think the current implementation may be enough on the condition that users understand it can happen that "waitStart" is NULL and "granted" is false. Attached a patch describing this in the doc and comments. Any Thoughts? Regards, -- Atsushi TorikoshiFrom 03c6e1ed6ffa215ee898b5a6a75d77277fb8e672 Mon Sep 17 00:00:00 2001 From: Atsushi Torikoshi Date: Tue, 2 Feb 2021 21:32:36 +0900 Subject: [PATCH v6] To examine the duration of locks, we did join on pg_locks and pg_stat_activity and used columns such as query_start or state_change. However, since they are the moment when queries have started or their state has changed, we could not get the lock duration in this way. This patch adds a new field "waitstart" preserving lock acquisition wait start time. Note that updating this field and lock acquisition are not performed synchronously for performance reasons. Therefore, depending on the timing, it can happen that waitstart is NULL even though granted is false. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a...@oss.nttdata.com --- contrib/amcheck/expected/check_btree.out | 4 ++-- doc/src/sgml/catalogs.sgml | 14 ++ src/backend/storage/ipc/standby.c| 16 ++-- src/backend/storage/lmgr/lock.c | 8 src/backend/storage/lmgr/proc.c | 10 ++ src/backend/utils/adt/lockfuncs.c| 9 - src/include/catalog/pg_proc.dat | 6 +++--- src/include/storage/lock.h | 2 ++ src/include/storage/proc.h | 1 + src/test/regress/expected/rules.out | 5 +++-- 10 files changed, 65 insertions(+), 10 deletions(-) diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out index 13848b7449..5a3f1ef737 100644 --- a/contrib/amcheck/expected/check_btree.out +++ b/contrib/amcheck/expected/check_btree.out @@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx'); SELECT * FROM pg_locks WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 'bttest_b_idx']::regclass[]) AND pid = pg_backend_pid(); - locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ---+--+--+--+---++---+-+---+--++-+--+-+-- + locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart +--+--+--+--+---++---+-+---+--++-+--+-+--+--- (0 rows) COMMIT; diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 865e826fb0..d81d6e1c52 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -10592,6 +10592,20 @@ SCRAM-SHA-256$:&l lock table + + + + waitstart timestamptz + + + Lock acquisition wait start time. + Note that updating this field and lock acquisition are not performed + synchronously for performance reasons. Therefore, depending on the + timing, it can happen that waitstart is + NULL even though + granted is false. + + diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 39a30c00f7..2282229568 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -539,13 +539,25 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) void ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict) { - TimestampTz ltime; + TimestampTz ltime, now; Assert(InHotStandby); ltime = GetStandbyLimitTime(); + now = GetCurrentTimestamp(); - if (GetCurrentTimestamp() >= ltime && ltime != 0
Re: Single transaction in the tablesync worker?
On Tue, Feb 2, 2021 at 11:35 AM Ajin Cherian wrote: > > Another failure I see in my testing > The problem here is that we are allowing to drop the table when table synchronization is still in progress and then we don't have any way to know the corresponding slot or origin. I think we can try to drop the slot and origin as well but that is not a good idea because slots once dropped won't be rolled back. So, I have added a fix to disallow the drop of the table when table synchronization is still in progress. Apart from that, I have fixed comments raised by Peter as discussed above and made some additional changes in comments, code (code changes are cosmetic), and docs. Let me know if the issue reported is fixed or not? -- With Regards, Amit Kapila. v25-0001-Tablesync-Solution1.patch Description: Binary data
Re: Single transaction in the tablesync worker?
On Tue, Feb 2, 2021 at 3:31 PM Peter Smith wrote: > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also > tried a simple test where I do a DROP TABLE with very bad timing for > the tablesync worker. It seems that doing this can cause the sync > worker's MyLogicalRepWorker->relid to become invalid. > I think this should be fixed by latest patch because I have disallowed drop of a table when its synchronization is in progress. You can check once and let me know if the issue still exists? -- With Regards, Amit Kapila.
RE: libpq debug log
Hi all, Thank you Kirk san for creating the v14 patch. I update the patch. I fixed all of Tsunakawa san's review comments. I am trying to solve three bugs. Two bags were pointed out by Alvaro san in a previous e-mail. And I found one bug. > From: Alvaro Herrera > Sent: Friday, January 22, 2021 6:53 AM ... > > (5) > > @@ -966,10 +966,6 @@ pqSaveParameterStatus(PGconn *conn, const char > *name, const char *value) > > pgParameterStatus *pstatus; > > pgParameterStatus *prev; > > > > - if (conn->Pfdebug) > > - fprintf(conn->Pfdebug, "pqSaveParameterStatus: '%s' = > '%s'\n", > > - name, value); > > - > > > > Where is this information output instead? > > Hmm, seems to have been lost. I restored it, but didn't verify > the resulting behavior carefully. I checked log output using Matsumura san's application app.c; --- 2021-01-27 11:29:49.718 < ParameterStatus 22 "application_name" "" pqSaveParameterStatus: 'application_name' = '' 2021-01-27 11:29:49.718 < ParameterStatus 25 "client_encoding" "UTF8" pqSaveParameterStatus: 'client_encoding' = 'UTF8' --- New trace log prints "ParameterStatus" which report run-time parameter status. And new log also prints parameter name and value in " StartupMessage " message. We can know the parameter status using these protocol messages. So I think "pqSaveParameterStatus:" is not necessary to output. In StartupMessage, the protocol message is output as "UnknownMessage" like this. [...] > UnknownMessage 38 '\x00\x03\x00\x00user\x00iwata\x00database\x00postgres\x00\x00' To fix this, I want to move a protocol message without the first byte1 to a frontend message logging function. I will work on this. > From: 'Alvaro Herrera' > Sent: Friday, January 22, 2021 10:38 PM > > Ah, that was the reason for separation. Then, I'm fine with either > > keeping the separation, or integrating the two functions in fe-misc.c > > with PQuntrace() being also there. I kind of think the latter is > > better, because of code readability and, because tracing facility is > > not a connection-related reature so categorizing it as "misc" feels > > natural. > > Maybe we can create a new file specifically for this to avoid mixing > unrelated stuff in fe-misc.c -- say fe-trace.c where all this new > tracing stuff goes. I moved PQtrace() from fe-connect.c to fe-misc.c. And I agree with creating new file for new tracing functions. I will do this. > From: 'Alvaro Herrera' > Sent: Thursday, January 28, 2021 11:14 PM ... > On 2021-Jan-28, k.jami...@fujitsu.com wrote: > > > I realized that existing libpq regression tests in src/test/examples do not > > test PQtrace(). At the same time, no other functions Is calling PQtrace(), > > so it's expected that by default if there are no compilation errors, then it > > will pass all the tests. And it did. > > > > So to check that the tracing feature is enabled and, as requested, outputs > > a trace file, I temporarily modified a bit of testlibpq.c instead **based > > from > > my current environment settings**, and ran the regression test. > > Yeah. It seems useful to build a real test harness based on the new > tracing functionality. And that is precisely why I added the option to > suppress timestamps: so that we can write trace files that do not vary > from run to run. That allows us to have an "expected" trace to which we > can compare the trace file produced by the actual run. I had the idea > that instead of a timestamp we would print a message counter. I didn't > implement that, however. I think it is a useful suggestion. However I couldn't think of how to find out if the logs were really in the correct order. And implementing this change looks very complicated. So I would like to brush up this patch at first. > (29) > -void PQtrace(PGconn *conn, FILE *stream); > +void PQtrace(PGconn *conn, FILE *stream, int flags); > > As I said before, I'm afraid we cannot change the signature of existing API > functions. Please leave the signature of this function unchanged, and add a > new function like PQtraceEx() that adds the flags argument. > > I guess the purpose of adding the flag argument is to not output the timestamp > by default, because getting timestamps would incur significant overhead > (however, I'm not sure if that overhead is worth worrying about given the > already incurred logging overhead.) So, I think it's better to have a flag > like > PQTRACE_OUTPUT_TIMESTAMPS instead of > PQTRACE_SUPPRESS_TIMESTAMPS, and the functions would look like: > > void > PQtrace(PGconn *conn, FILE *stream) > { > PQtraceEx(conn, stream, 0); > } > > void > PQtraceEx(PGconn *conn, FILE *stream, int flags) > { > ... the new implementation in the patch > } > > Remember to add PQtraceEx in exports.txt and make sure it builds on > Windows with Visual Studio. I fixed just add new function. I will look into similar changes so far and give PQtraceEx() a better name. > But..
Re: Typo in tablesync comment
On Tue, Feb 2, 2021 at 10:49 AM Michael Paquier wrote: > > On Tue, Feb 02, 2021 at 10:38:31AM +1100, Peter Smith wrote: > > PSA a trivial patch to correct what seems like a typo in the tablesync > > comment. > > - * subscribed tables and their state. Some transient state during data > - * synchronization is kept in shared memory. The states SYNCWAIT and > + * subscribed tables and their state. Some transient states during > data > + * synchronization are kept in shared memory. The states SYNCWAIT and > > This stuff refers to SUBREL_STATE_* in pg_subscription_rel.h, and FWIW > I find confusing the term "transient" in this context as a state may > last for a rather long time, depending on the time it takes to > synchronize the relation, no? > These in-memory states are used after the initial copy is done. So, these are just for the time the tablesync worker is synced-up with apply worker. In some cases, they could be for a longer period of time when apply worker is quite ahead of tablesync worker then we will be in the CATCHUP state for a long time but SYNCWAIT will still be for a shorter period of time. > I am wondering if we could do better > here, say: > "The state tracking the progress of the relation synchronization is > additionally stored in shared memory, with SYNCWAIT and CATCHUP only > appearing in memory." > I don't mind changing to your proposed text but I think the current wording is also okay and seems clear to me. -- With Regards, Amit Kapila.
Re: [PATCH] remove deprecated v8.2 containment operators
On Mon, Nov 30, 2020 at 02:09:10PM -0500, Tom Lane wrote: > Justin Pryzby writes: > > I think this is waiting on me to provide a patch for the contrib/ modules > > with > > update script removing the SQL operators, and documentating their > > deprecation. > > Right. We can remove the SQL operators, but not (yet) the C code support. > > I'm not sure that the docs change would do more than remove any existing > mentions of the operators. I've finally returned to this. RFC. -- Justin >From 98a870449bdf3670e2e190d8bdfccd59dae42452 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 11 Apr 2020 22:57:06 -0500 Subject: [PATCH] remove deprecated v8.2 containment operators See also: ba920e1c9182eac55d5f1327ab0d29b721154277 684ad6a92fcc33adebdab65c4e7d72a68ba05408 3165426e54df02a6199c0a216546e5095e875a0a 2f70fdb0644c32c4154236c2b5c241bec92eac5e 591d282e8d3e0448ec1339c6b066e10953f040a2 --- contrib/cube/Makefile | 2 +- contrib/cube/cube--1.4--1.5.sql | 10 ++ contrib/cube/cube.control | 2 +- contrib/hstore/Makefile | 1 + contrib/hstore/hstore--1.0--1.1.sql | 7 +++ contrib/hstore/hstore--1.8--1.9.sql | 8 contrib/hstore/hstore.control | 2 +- contrib/intarray/Makefile | 2 +- contrib/intarray/intarray--1.4--1.5.sql | 18 ++ contrib/intarray/intarray.control | 2 +- contrib/seg/Makefile| 2 +- contrib/seg/seg--1.3--1.4.sql | 11 +++ contrib/seg/seg.control | 2 +- doc/src/sgml/cube.sgml | 8 doc/src/sgml/hstore.sgml| 10 -- doc/src/sgml/intarray.sgml | 8 doc/src/sgml/seg.sgml | 8 17 files changed, 62 insertions(+), 41 deletions(-) create mode 100644 contrib/cube/cube--1.4--1.5.sql create mode 100644 contrib/hstore/hstore--1.0--1.1.sql create mode 100644 contrib/hstore/hstore--1.8--1.9.sql create mode 100644 contrib/intarray/intarray--1.4--1.5.sql create mode 100644 contrib/seg/seg--1.3--1.4.sql diff --git a/contrib/cube/Makefile b/contrib/cube/Makefile index 54f609db17..cf195506c7 100644 --- a/contrib/cube/Makefile +++ b/contrib/cube/Makefile @@ -7,7 +7,7 @@ OBJS = \ cubeparse.o EXTENSION = cube -DATA = cube--1.2.sql cube--1.2--1.3.sql cube--1.3--1.4.sql \ +DATA = cube--1.2.sql cube--1.2--1.3.sql cube--1.3--1.4.sql cube--1.4--1.5.sql \ cube--1.1--1.2.sql cube--1.0--1.1.sql PGFILEDESC = "cube - multidimensional cube data type" diff --git a/contrib/cube/cube--1.4--1.5.sql b/contrib/cube/cube--1.4--1.5.sql new file mode 100644 index 00..07a5f0755d --- /dev/null +++ b/contrib/cube/cube--1.4--1.5.sql @@ -0,0 +1,10 @@ +/* contrib/cube/cube--1.4--1.5.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION cube UPDATE TO '1.5'" to load this file. \quit + +-- Remove @ and ~ +ALTER OPERATOR FAMILY gist_cube_ops USING gist +DROP OPERATOR 13 (cube, cube); +ALTER OPERATOR FAMILY gist_cube_ops USING gist +DROP OPERATOR 14 (cube, cube); diff --git a/contrib/cube/cube.control b/contrib/cube/cube.control index 3e238fc937..50427ec117 100644 --- a/contrib/cube/cube.control +++ b/contrib/cube/cube.control @@ -1,6 +1,6 @@ # cube extension comment = 'data type for multidimensional cubes' -default_version = '1.4' +default_version = '1.5' module_pathname = '$libdir/cube' relocatable = true trusted = true diff --git a/contrib/hstore/Makefile b/contrib/hstore/Makefile index c4e339b57c..97b228b65f 100644 --- a/contrib/hstore/Makefile +++ b/contrib/hstore/Makefile @@ -12,6 +12,7 @@ OBJS = \ EXTENSION = hstore DATA = hstore--1.4.sql \ + hstore--1.8--1.9.sql \ hstore--1.7--1.8.sql \ hstore--1.6--1.7.sql \ hstore--1.5--1.6.sql \ diff --git a/contrib/hstore/hstore--1.0--1.1.sql b/contrib/hstore/hstore--1.0--1.1.sql new file mode 100644 index 00..4e32a575c5 --- /dev/null +++ b/contrib/hstore/hstore--1.0--1.1.sql @@ -0,0 +1,7 @@ +/* contrib/hstore/hstore--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION hstore UPDATE TO '1.1'" to load this file. \quit + +ALTER EXTENSION hstore DROP OPERATOR => (text, text); +DROP OPERATOR => (text, text); diff --git a/contrib/hstore/hstore--1.8--1.9.sql b/contrib/hstore/hstore--1.8--1.9.sql new file mode 100644 index 00..2220fee444 --- /dev/null +++ b/contrib/hstore/hstore--1.8--1.9.sql @@ -0,0 +1,8 @@ +/* contrib/hstore/hstore--1.8--1.9.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION hstore UPDATE TO '1.9'" to load this file. \quit + +-- Remove @ +ALTER OPERATOR FAMILY gist_hstore_ops USING gist +DROP OPERATOR 13 (hstore, hstore); diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control index 89e3c746c4..b73c28aa4d 100644 --- a/contrib/hstore/h
Re: libpq debug log
On 2021-Jan-29, tsunakawa.ta...@fujitsu.com wrote: > (30) > +/* > + * Deallocate FE/BE message tracking memory. We only do this because > + * FE message can grow arbitrarily large, and skip it in the initial state, > + * because it's likely pointless. > + */ > +void > +pqTraceUninit(PGconn *conn) > +{ > + if (conn->fe_msg && > + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS) > + { > + free(conn->fe_msg); > + conn->fe_msg = NULL; > + } > > What does the second if condition mean? If it's not necessary, change the > comment accordingly. The rationale for that second condition is this: if the memory allocated is the initial size, we don't free memory, because it would just be allocated of the same size next time, and that size is not very big, so it's not a big deal if we just let it be, so that it is reused if we call PQtrace() again later. However, if the allocated size is larger than default, then it is possible that some previous tracing run has enlarged the trace struct to a very large amount of memory, and we don't want to leave that in place. -- Álvaro Herrera Valdivia, Chile
Re: New IndexAM API controlling index vacuum strategies
вт, 2 февр. 2021 г. в 05:27, Peter Geoghegan : > And now here is the second thing I thought of, which is much better: > > Sometimes 1% of the dead tuples in a heap relation will be spread > across 90%+ of the pages. With other workloads 1% of dead tuples might > be highly concentrated, and appear in no more than 1% of all heap > pages. Obviously the distinction between these two cases/workloads > matters a lot. And so the triggering criteria must be quantitative > *and* qualitative. It should not be based on counting dead tuples, > since that alone won't differentiate these two extreme cases - both of > which are probably quite common (in the real world extremes are > actually the normal and common case IME). > > I like the idea of basing it on counting *heap blocks*, not dead > tuples. We can count heap blocks that have *at least* one dead tuple > (of course it doesn't matter how they're dead, whether it was this > VACUUM operation or some earlier opportunistic pruning). Note in > particular that it should not matter if it's a heap block that has > only one LP_DEAD line pointer or a heap page that is near the > MaxHeapTuplesPerPage limit for the page -- we count either type of > page towards the heap-page based limit used to decide if index > vacuuming goes ahead for all indexes during VACUUM. > I really like this idea! It resembles the approach used in bottom-up index deletion, block-based accounting provides a better estimate for the usefulness of the operation. I suppose that 1% threshold should be configurable as a cluster-wide GUC and also as a table storage parameter? -- Victor Yegorov
Re: Add primary keys to system catalogs
On Tue, Feb 2, 2021 at 6:49 PM Peter Eisentraut wrote: > > I do wonder, however, under what circumstances code would be put into a > situation where it would add the exact same dependency again, and also > under what circumstances it would add a dependency between the same > objects but a different deptype, and how that would work during > recursive deletion. Now that we have the refobjversion column, the > presence of duplicate dependencies might be even more dubious. I think > that could be worth analyzing. There's at least dependencies from indexes to pg_class (underlying table) and pg_collation that can be entirely duplicated (if the same column and/or collation is used in both key and predicate for instance), and this was the case before refobjversion was introduced.
Race between KeepFileRestoredFromArchive() and restartpoint
A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as seen once on the buildfarm[1]. The attached patch adds a test case; it applies atop the "stop events" patch[2]. We have two systems for adding long-term pg_wal directory entries. KeepFileRestoredFromArchive() adds them during archive recovery, while InstallXLogFileSegment() does so at all times. Unfortunately, InstallXLogFileSegment() happens throughout archive recovery, via the checkpointer recycling segments and calling PreallocXlogFiles(). Multiple processes can run InstallXLogFileSegment(), which uses ControlFileLock to represent the authority to modify the directory listing of pg_wal. KeepFileRestoredFromArchive() just assumes it controls pg_wal. Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Notable alternatives: - Release ControlFileLock at the end of XLogFileInit(), not at the end of InstallXLogFileSegment(). Add ControlFileLock acquisition to KeepFileRestoredFromArchive(). This provides adequate mutual exclusion, but XLogFileInit() could return a file descriptor for an unlinked file. That's fine for PreallocXlogFiles(), but it feels wrong. - During restartpoints, never preallocate or recycle segments. (Just delete obsolete WAL.) By denying those benefits, this presumably makes streaming recovery less efficient. - Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment, then copy bytes. This is simple, but it multiplies I/O. That might be negligible on account of caching, or it might not be. A variant, incurring extra fsyncs, would be to use durable_rename() to replace the segment we get from XLogFileInit(). - Make KeepFileRestoredFromArchive() rename without first unlinking. This avoids checkpoint failure, but a race could trigger noise from the LOG message in InstallXLogFileSegment -> durable_rename_excl. Does anyone prefer some alternative? It's probably not worth back-patching anything for a restartpoint failure this rare, because most restartpoint outcomes are not user-visible. Thanks, nm [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17 [2] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com Author: Noah Misch Commit: Noah Misch Not for commit; for demonstration only. Before applying this, apply https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 199d911..1c0a988 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -71,6 +71,7 @@ #include "storage/reinit.h" #include "storage/smgr.h" #include "storage/spin.h" +#include "storage/stopevent.h" #include "storage/sync.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -3583,6 +3584,23 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno, elog(ERROR, "InstallXLogFileSegment should not have failed"); } +static Jsonb * +InstallXLogFileSegmentEndStopEventParams(XLogSegNo segno) +{ + MemoryContext oldCxt; + JsonbParseState *state = NULL; + Jsonb *res; + + stopevents_make_cxt(); + oldCxt = MemoryContextSwitchTo(stopevents_cxt); + pushJsonbValue(&state, WJB_BEGIN_OBJECT, NULL); + jsonb_push_int8_key(&state, "segno", segno); + res = JsonbValueToJsonb(pushJsonbValue(&state, WJB_END_OBJECT, NULL)); + MemoryContextSwitchTo(oldCxt); + + return res; +} + /* * Install a new XLOG segment file as a current or future log segment. * @@ -3664,6 +3682,9 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, if (use_lock) LWLockRelease(ControlFileLock); + STOPEVENT(InstallXLogFileSegmentEndStopEvent, + InstallXLogFileSegmentEndStopEventParams(*segno)); + return true; } diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1c5a4f8..d9c5a3a 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -30,6 +30,7 @@ #include "storage/ipc.h" #include "storage/lwlock.h" #include "storage/pmsignal.h" +#include "storage/stopevent.h" /* * Attempt to retrieve the specified file from off-line archival storage. @@ -372,6 +373,23 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, bool failOn } +static Jsonb * +KeepFileRestoredFr
Re: Add primary keys to system catalogs
Peter Eisentraut writes: > I do wonder, however, under what circumstances code would be put into a > situation where it would add the exact same dependency again, and also > under what circumstances it would add a dependency between the same > objects but a different deptype, and how that would work during > recursive deletion. Now that we have the refobjversion column, the > presence of duplicate dependencies might be even more dubious. I think > that could be worth analyzing. The duplicate-deps case occurs if, say, you use the same function more than once in a view. While we could probably get rid of dups generated within the same recordMultipleDependencies call (and maybe already do?), it's harder to deal with dups made by retail calls. For example, makeOperatorDependencies doesn't trouble to avoid making duplicate dependencies when the operator's left and right input types are the same, or the same as the result type. The case with almost-dup-but-not-same-deptype *does* occur, for instance it's very possible to have both normal and automatic dependencies. As I recall, the policy is to perform the deletion if any of the deps would allow it. Dunno about refobjversion; I have my doubts that putting that info in pg_depend was a sane design choice at all. But from what I understand of it, wouldn't all deps on a given object necessarily have the same version? regards, tom lane
Re: Add primary keys to system catalogs
On Tue, Feb 2, 2021 at 11:17 PM Tom Lane wrote: > > Dunno about refobjversion; I have my doubts that putting that info in > pg_depend was a sane design choice at all. But from what I understand > of it, wouldn't all deps on a given object necessarily have the same > version? Correct, assuming of course that the dependencies are from the same object.
Re: Key management with tests
On Mon, Feb 1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote: > On Mon, Feb 1, 2021 at 06:31:32PM -0500, Stephen Frost wrote: > > * Bruce Momjian (br...@momjian.us) wrote: > > > The purpose of cluster file encryption is to prevent users with read > > > access to the directories used to store database files and write-ahead > > > log files from being able to access the data stored in those files. > > > For example, when using cluster file encryption, users who have read > > > access to the cluster directories for backup purposes will not be able > > > to decrypt the data stored in these files. It also protects against > > > decrypted data access after media theft. > > > > That's one valid use-case and it particularly makes sense to consider, > > now that we support group read-access to the data cluster. The last > > Do enough people use group read-access to be useful? I am thinking group read-access might be a requirement for cluster file encryption to be effective. > > line seems a bit unclear- I would update it to say: > > Cluster file encryption also provides data-at-rest security, protecting > > users from data loss should the physical media on which the cluster is > > stored be stolen, improperly deprovisioned (not wiped or destroyed), or > > otherwise ends up in the hands of an attacker. > > I have split the section into three paragraphs, trimmed down some of the > suggested text, and added it. Full version below. Here is an updated doc description of memory reading: This also does not protect against users who have read access to database process memory — all in-memory data pages and data encryption keys are stored unencrypted in memory, so an attacker who --> is able to read memory can decrypt the entire cluster. The Postgres --> operating system user and the operating system administrator, e.g., --> the root user, have such access. > > > File system write access can allow for unauthorized file system data > > > decryption if the writes can be used to weaken the system's security > > > and this weakened system is later supplied with externally-stored keys. > > > > This isn't very clear as to exactly what the concern is or how an > > attacker would be able to thwart the system if they had write access to > > it. An attacker with write access could possibly attempt to replace the > > existing keys, but with the key wrapping that we're using, that should > > result in just a decryption failure (unless, of course, the attacker has > > the actual KEK that was used, but that's not terribly interesting to > > worry about since then they could just go access the files directly). > > Uh, well, they could modify postgresql.conf to change the script to save > the secret returned by the script before returning it to the PG server. > We could require postgresql.conf to be somewhere secure, but then how do > we know that is secure? I just don't see a clean solution here, but the > idea that you write and then wait for the key to show up seems like a > very valid way of attack, and it took me a while to be able to > articulate it. Let's suppose you lock down your cluster --- the non-PGDATA files are owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA and are not writable by the database OS user, or we have the PGDATA directory on another server, so the adversary can only write to the remote PGDATA directory. What can they do? Well, they can't modify pg_proc to add a shared library since pg_proc is encrypted, so we have to focus on files needed before encryption starts or files that can't be easily encrypted. They could create postgresql.conf.auto in PGDATA, and modify cluster_key_command to capture the key, or they could modify preload libraries or archive command to call a command to read memory as the PG OS user and write the key out somewhere, or use the key to rewrite the database files --- those wouldn't even need a database restart, just a reload. They could also modify pg_xact files so that, even though the heap/index files are encrypted, how the contents of those files are interpreted would change. In summary, to detect malicious user writes, you would need to protect the files used before encryption starts (root owned or owned by another user?), and encrypt all files after encryption starts --- any other approach would probably leave open attack vectors, and I don't think there is sufficient community desire to add such boundaries. How do other database systems guarantee to detect malicious writes? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Recording foreign key relationships for the system catalogs
"Joel Jacobson" writes: > I could only find one minor error, > found by running the regression-tests, > and then using the query below to compare "is_opt" > with my own "zero_values" in my tool > that derives its value from pg_catalog content. > ... > I therefore think is_opt should be changed to true for this row: > fktable| fkcols| pktable| pkcols | > is_array | is_opt > ---+-+--+---+--+ > pg_constraint | {confrelid,confkey} | pg_attribute | {attrelid,attnum} | t > | f No, I think it's correct as-is (and this is one reason that I chose to have two separate FK entries for cases like this). confrelid can be zero in rows that are not FK constraints. However, such a row must also have empty confkey. The above entry states that for each element of confkey, the pair (confrelid,confkey[i]) must be nonzero and have a match in pg_attribute. It creates no constraint if confkey is empty. > If this is fixed, I also agree this is ready to be committed. Appreciate the review! Please confirm if you agree with above analysis. regards, tom lane
Re: adding wait_start column to pg_locks
On 2021/02/02 22:00, torikoshia wrote: On 2021-01-25 23:44, Fujii Masao wrote: Another comment is; Doesn't the change of MyProc->waitStart need the lock table's partition lock? If yes, we can do that by moving LWLockRelease(partitionLock) just after the change of MyProc->waitStart, but which causes the time that lwlock is being held to be long. So maybe we need another way to do that. Thanks for your comments! It would be ideal for the consistency of the view to record "waitstart" during holding the table partition lock. However, as you pointed out, it would give non-negligible performance impacts. I may miss something, but as far as I can see, the influence of not holding the lock is that "waitstart" can be NULL even though "granted" is false. I think people want to know the start time of the lock when locks are held for a long time. In that case, "waitstart" should have already been recorded. Sounds reasonable. If this is true, I think the current implementation may be enough on the condition that users understand it can happen that "waitStart" is NULL and "granted" is false. Attached a patch describing this in the doc and comments. Any Thoughts? 64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating "waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock when reading "waitStart"? + Lock acquisition wait start time. Isn't it better to describe this more clearly? What about the following? Time when the server process started waiting for this lock, or null if the lock is held. + Note that updating this field and lock acquisition are not performed + synchronously for performance reasons. Therefore, depending on the + timing, it can happen that waitstart is + NULL even though + granted is false. I agree that it's helpful to add the note about that NULL can be returned even when "granted" is false. But IMO we don't need to document why this behavior can happen internally. So what about the following? Note that this can be null for a very short period of time after the wait started even though granted is false. Since the document for pg_locks uses "null" instead of NULL (I'm not sure why, though), I used "null" for the sake of consistency. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
bugfix - plpgsql - statement counter is incremented 2x for FOR stmt
Hi When I fixed one plpgsql_check issue, I found another plpgsql issue. Now we have field nstatements that hold a number of plpgsql statements in function. Unfortunately I made an error when I wrote this functionality and for FOR statements, this counter is incremented 2x. Higher number than a real number is better than a lesser number, but it can be a source of problems too (inside plpgsql_check I iterate over 0 .. nstatements stmtid, and due this bug I had a problem with missing statements). Attached patch is pretty simple: Regards Pavel diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y index abf196d4be..34e0520719 100644 --- a/src/pl/plpgsql/src/pl_gram.y +++ b/src/pl/plpgsql/src/pl_gram.y @@ -1341,7 +1341,6 @@ stmt_for : opt_loop_label K_FOR for_control loop_body new = (PLpgSQL_stmt_fori *) $3; new->lineno = plpgsql_location_to_lineno(@2); - new->stmtid = ++plpgsql_curr_compile->nstatements; new->label = $1; new->body = $4.stmts; $$ = (PLpgSQL_stmt *) new; @@ -1356,7 +1355,6 @@ stmt_for : opt_loop_label K_FOR for_control loop_body /* forq is the common supertype of all three */ new = (PLpgSQL_stmt_forq *) $3; new->lineno = plpgsql_location_to_lineno(@2); - new->stmtid = ++plpgsql_curr_compile->nstatements; new->label = $1; new->body = $4.stmts; $$ = (PLpgSQL_stmt *) new;
Re: memory leak in auto_explain
japin writes: > Here's my analysis: > 1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL function > memory context, which is a long-lived, cause the memory grow up. Yeah, agreed. I think people looking at this have assumed that the ExecutorEnd hook would automatically be running in the executor's per-query context, but that's not so; we haven't yet entered standard_ExecutorEnd where the context switch is. The caller's context is likely to be much longer-lived than the executor's. I think we should put the switch one level further out than you have it here, just to be sure that InstrEndLoop is covered too (that doesn't allocate memory, but auto_explain shouldn't assume that). Otherwise seems like a good fix. regards, tom lane
Re: bugfix - plpgsql - statement counter is incremented 2x for FOR stmt
Pavel Stehule writes: > When I fixed one plpgsql_check issue, I found another plpgsql issue. Now we > have field nstatements that hold a number of plpgsql statements in > function. Unfortunately I made an error when I wrote this functionality and > for FOR statements, this counter is incremented 2x. Higher number than a > real number is better than a lesser number, but it can be a source of > problems too (inside plpgsql_check I iterate over 0 .. nstatements stmtid, > and due this bug I had a problem with missing statements). > Attached patch is pretty simple: Right, done. regards, tom lane
Re: bugfix - plpgsql - statement counter is incremented 2x for FOR stmt
út 2. 2. 2021 v 20:36 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > When I fixed one plpgsql_check issue, I found another plpgsql issue. Now > we > > have field nstatements that hold a number of plpgsql statements in > > function. Unfortunately I made an error when I wrote this functionality > and > > for FOR statements, this counter is incremented 2x. Higher number than a > > real number is better than a lesser number, but it can be a source of > > problems too (inside plpgsql_check I iterate over 0 .. nstatements > stmtid, > > and due this bug I had a problem with missing statements). > > > Attached patch is pretty simple: > > Right, done. > Thank you for commit Regards Pavel > regards, tom lane >
Re: Recording foreign key relationships for the system catalogs
On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote: >No, I think it's correct as-is (and this is one reason that I chose to >have two separate FK entries for cases like this). confrelid can be >zero in rows that are not FK constraints. However, such a row must >also have empty confkey. The above entry states that for each element >of confkey, the pair (confrelid,confkey[i]) must be nonzero and have >a match in pg_attribute. It creates no constraint if confkey is empty. Thanks for explaining, I get it now. >Appreciate the review! Please confirm if you agree with above >analysis. Yes, I agree with the analysis. /Joel
Re: Thoughts on "killed tuples" index hint bits support on standby
Hello, Peter. > AFAICT that's not true, at least not in any practical sense. See the > comment in the middle of MarkBufferDirtyHint() that begins with "If we > must not write WAL, due to a relfilenode-specific...", and see the > "Checksums" section at the end of src/backend/storage/page/README. The > last paragraph in the README is particularly relevant: I have attached a TAP-test to demonstrate how easily checksums on standby and primary starts to differ. The test shows two different scenarios - for both heap and index (and the bit is placed in both standby and primary). Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on secondary could be easily lost. But it leaves the page dirty if it already is (or it could be marked dirty by WAL replay later). So, hints bits could be easily flushed and taken into account during checksum calculation on both - standby and primary. > "We can set the hint, just not dirty the page as a result so the hint > is lost when we evict the page or shutdown" Yes, it is not allowed to mark a page as dirty because of hints on standby. Because we could achieve this: CHECKPOINT SET HINT BIT TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT But this scenario is totally fine: CHECKPOINT FPI (page is still dirty) SET HINT BIT TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK And, as result, this is fine too: CHECKPOINT FPI WITH MASKED LP_DEAD (page is still dirty) SET HINT BIT TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY So, my point here - it is fine to mask LP_DEAD bits during replay because they are already different on standby and primary. And it is fine to set and flush hint bits (and LP_DEADs) on standby because they already could be easily flushed (just need to consider minRecovertPoint and, probably, OldesXmin from primary in case of LP_DEAD to make promotion easily). >> And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD. > I don't see how that is relevant. btree_mask() is only used by > wal_consistency_checking, which is mostly just for Postgres hackers. I was thinking about the possibility to reuse these functions in masking during replay. Thanks, Michail. 022_checksum_tests.pl Description: Perl program
Re: New IndexAM API controlling index vacuum strategies
On Tue, Feb 2, 2021 at 6:28 AM Victor Yegorov wrote: > I really like this idea! Cool! > It resembles the approach used in bottom-up index deletion, block-based > accounting provides a better estimate for the usefulness of the operation. It does resemble bottom-up index deletion, in one important general sense: it is somewhat qualitative (though *also* somewhat quantitive). This is new for vacuumlazy.c. But the idea now is to deemphasize bottom-up index deletion heavy workloads in the first version of this patch -- just to cut scope. The design I described yesterday centers around "~99.9% append-only table" workloads, where anti-wraparound vacuums that scan indexes are a big source of unnecessary work (in practice it is always anti-wraparound vacuums, simply because there will never be enough garbage to trigger a regular autovacuum run). But it now occurs to me that there is another very important case that it will also help, without making the triggering condition for index vacuuming any more complicated: it will help cases where HOT updates are expected (because all updates don't modify indexed columns). It's practically impossible for HOT updates to occur 100% of the time, even with workloads whose updates never modify indexed columns. You can clearly see this by looking at the stats from pg_stat_user_tables with a standard pgbench workload. It does get better with lower heap fill factor, but I think that HOT is never 100% effective (i.e. 100% of updates are HOT updates) in the real world -- unless maybe you set heap fillfactor as low as 50, which is very rare. HOT might well be 95% effective, or 99% effective, but it's never truly 100% effective. And so this is another important workload where the difference between "practically zero dead tuples" and "precisely zero dead tuples" *really* matters when deciding if a VACUUM operation needs to go ahead. Once again, a small difference, but also a big difference. Forgive me for repeating myself do much, but: paying attention to cost/benefit asymmetries like this one sometimes allow us to recognize an optimization that is an "excellent deal". We saw this with bottom-up index deletion. Seems good to keep an eye out for that. > I suppose that 1% threshold should be configurable as a cluster-wide GUC > and also as a table storage parameter? Possibly. I'm concerned about making any user-visible interface (say a GUC) compatible with an improved version that is smarter about bottom-up index deletion (in particular, one that can vacuum only a subset of the indexes on a table, which now seems too ambitious for Postgres 14). -- Peter Geoghegan
Re: [HACKERS] Custom compression methods
Even more review comments, still looking mostly at 0001: If there's a reason why parallel_schedule is arranging to run the compression test in parallel with nothing else, the comment in that file should explain the reason. If there isn't, it should be added to a parallel group that doesn't have the maximum number of tests yet, probably the last such group in the file. serial_schedule should add the test in a position that roughly corresponds to where it appears in parallel_schedule. I believe it's relatively standard practice to put variable declarations at the top of the file. compress_lz4.c and compress_pglz.c instead put those declarations nearer to the point of use. compressamapi.c has an awful lot of #include directives for the code it actually contains. I believe that we should cut that down to what is required by 0001, and other patches can add more later as required. In fact, it's tempting to just get rid of this .c file altogether and make the two functions it contains static inline functions in the header, but I'm not 100% sure that's a good idea. The copyright dates in a number of the file headers are out of date. binary_upgrade_next_pg_am_oid and the related changes to CreateAccessMethod don't belong in 0001, because it doesn't support non-built-in compression methods. These changes and the related pg_dump change should be moved to the patch that adds support for that. The comments added to dumpTableSchema() say that "compression is assigned by ALTER" but don't give a reason. I think they should. I don't know how much they need to explain about what the code does, but they definitely need to explain why it does it. Also, isn't this bad? If we create the column with the wrong compression setting initially and then ALTER it, we have to rewrite the table. If it's empty, that's cheap, but it'd still be better not to do it at all. I'm not sure it's a good idea for dumpTableSchema() to leave out specifying the compression method if it happens to be pglz. I think we definitely shouldn't do it in binary-upgrade mode. What if we changed the default in a future release? For that matter, even 0002 could make the current approach unsafe I think, anyway. The changes to pg_dump.h look like they haven't had a visit from pgindent. You should probably try to do that for the whole patch, though it's a bit annoying since you'll have to manually remove unrelated changes to the same files that are being modified by the patch. Also, why the extra blank line here? GetAttributeCompression() is hard to understand. I suggest changing the comment to "resolve column compression specification to an OID" and somehow rejigger the code so that you aren't using one not-NULL test and one NULL test on the same variable. Like maybe change the first part to if (!IsStorageCompressible(typstorage)) { if (compression == NULL) return InvalidOid; ereport(ERROR, ...); } It puzzles me that CompareCompressionMethodAndDecompress() calls slot_getallattrs() just before clearing the slot. It seems like this ought to happen before we loop over the attributes, so that we don't need to call slot_getattr() every time. See the comment for that function. But even if we didn't do that for some reason, why would we do it here? If it's already been done, it shouldn't do anything, and if it hasn't been done, it might overwrite some of the values we just poked into tts_values. It also seems suspicious that we can get away with clearing the slot and then again marking it valid. I'm not sure it really works like that. Like, can't clearing the slot invalidate pointers stored in tts_values[]? For instance, if they are pointers into an in-memory heap tuple, tts_heap_clear() is going to free the tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is going to unpin it. I think the supported procedure for this sort of thing is to have a second slot, set tts_values, tts_isnull etc. and then materialize the slot. After materializing the new slot, it's independent of the old slot, which can then be cleared. See for example tts_virtual_materialize(). The whole approach you've taken here might need to be rethought a bit. I think you are right to want to avoid copying everything over into a new slot if nothing needs to be done, and I think we should definitely keep that optimization, but I think if you need to copy stuff, you have to do the above procedure and then continue using the other slot instead of the original one. Some places I think we have functions that return either the original slot or a different one depending on how it goes; that might be a useful idea here. But, you also can't just spam-create slots; it's important that whatever ones we end up with get reused for every tuple. Doesn't the change to describeOneTableDetails() require declaring changing the declaration of char *headers[11] to char *headers[12]? How does this not fail Assert(cols <= lengthof(headers))? Why does describeOneTableDetais() arrange to truncate t
Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)
"Joel Jacobson" writes: > Doc: improve documentation of oid columns that can be zero. Since this is pretty closely tied to the catalog-foreign-key work, I went ahead and reviewed/pushed it. The zero notations now match up with what we'd found in the other thread. regards, tom lane
Re: Perform COPY FROM encoding conversions in larger chunks
On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas wrote: > Thanks. I fixed it slightly differently, and also changed LocalToUtf() > to follow the same pattern, even though LocalToUtf() did not have the > same bug. Looks good to me. > I added a bunch of tests for various built-in conversions. Nice! I would like to have utf8 tests for every category of invalid byte (overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch. > I spent some time refactoring and adding comments all around the patch, > hopefully making it all more clear. One notable difference is that I > renamed 'raw_buf' (which exists in master too) to 'input_buf', and > renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this > patch again another day with fresh eyes, and also try to add some tests > for the corner cases at buffer boundaries. The comments and renaming are really helpful in understanding that file! Although a new patch is likely forthcoming, I did take a brief look and found the following: In copyfromparse.c, this is now out of date: * Read the next input line and stash it in line_buf, with conversion to * server encoding. One of your FIXME comments seems to allude to this, but if we really need a difference here, maybe it should be explained: +#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */ +#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */ Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003 accidentally? -- John Naylor EDB: http://www.enterprisedb.com
Re: Background writer and checkpointer in crash recovery
On Sat, Aug 29, 2020 at 8:13 PM Thomas Munro wrote: > Currently we don't run the bgwriter process during crash recovery. > I've CCed Simon and Heikki who established this in commit cdd46c76. > Based on that commit message, I think the bar to clear to change the > policy is to show that it's useful, and that it doesn't make crash > recovery less robust. See the other thread for some initial evidence > of usefulness from Jakub Wartak. I think it also just makes intuitive > sense that it's got to help bigger-than-buffer-pool crash recovery if > you can shift buffer eviction out of the recovery loop. As for > robustness, I suppose we could provide the option to turn it off just > in case that turns out to be useful in some scenarios, but I'm > wondering why we would expect something that we routinely run in > archive/streaming recovery to reduce robustness in only slightly > different circumstances. > > Here's an experiment-grade patch, comments welcome, though at this > stage it's primarily thoughts about the concept that I'm hoping to > solicit. I think the way it works right now is stupid and the proposed change is going in the right direction. We have ample evidence already that handing off fsyncs to a background process is a good idea, and there's no reason why that shouldn't be beneficial during crash recovery just as it is at other times. But even if it somehow failed to improve performance during recovery, there's another good reason to do this, which is that it would make the code simpler. Having the pendingOps stuff in the startup process in some recovery situations and in the checkpointer in other recovery situations makes this harder to reason about. As Tom said, the system state where bgwriter and checkpointer are not running is an uncommon one, and is probably more likely to have (or grow) bugs than the state where they are running. The rat's-nest of logic introduced by the comment "Perform a checkpoint to update all our recovery activity to disk." inside StartupXLOG() could really do with some simplification. Right now we have three cases: CreateEndOfRecoveryRecord(), RequestCheckpoint(), and CreateCheckpoint(). Maybe with this change we could get it down to just two, since RequestCheckpoint() already knows what to do about !IsUnderPostmaster. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Recording foreign key relationships for the system catalogs
"Joel Jacobson" writes: > On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote: >> Appreciate the review! Please confirm if you agree with above >> analysis. > Yes, I agree with the analysis. Cool, thanks. I've pushed it now. regards, tom lane
Re: Proposal: Save user's original authenticated identity for logging
On Thu, 2021-01-28 at 18:22 +, Jacob Champion wrote: > = Proposal = > > I propose that every auth method should store the string it uses to > identify a user -- what I'll call an "authenticated identity" -- into > one central location in Port, after authentication succeeds but before > any pg_ident authorization occurs. Thanks everyone for all of the feedback! Here's my summary of the conversation so far: - The idea of storing the user's original identity consistently across all auth methods seemed to be positively received. - Exposing this identity through log_line_prefix was not as well- received, landing somewhere between "meh" and "no thanks". The main concern was log bloat/expense. - Exposing it through the CSV log got the same reception: if we expose it through log_line_prefix, we should expose it through CSV, but no one seemed particularly excited about either. - The idea of logging this information once per session, as part of log_connection, got a more positive response. That way the information can still be obtained, but it doesn't clutter every log line. - There was also some interest in exposing this through the statistics collector, either as a superuser-only feature or via the pg_read_all_stats role. - There was some discussion around *which* string to choose as the identifer for more complicated cases, such as TLS client certificates. - Other improvements around third-party authorization and role management were discussed, including the ability to auto-create nonexistent roles, to sync role definitions as a first-party feature, and to query an external system for role authorization. (Let me know if there's something else I've missed.) == My Plans == Given the feedback above, I'll continue to flesh out the PoC patch, focusing on 1) storing the identity in a single place for all auth methods and 2) exposing it consistently in the logs as part of log_connections. I'll drop the log_line_prefix format specifier from the patch and see what that does to the testing side of things. I also plan to write a follow-up patch to add the authenticated identity to the statistics collector, with pg_get_authenticated_identity() to retrieve it. I'm excited to see where the third-party authz and role management conversations go, but I won't focus on those for my initial patchset. I think this patch has use even if those ideas are implemented too. --Jacob
Re: LogwrtResult contended spinlock
Hello, So I addressed about half of your comments in this version merely by fixing silly bugs. The problem I had which I described as "synchronization fails" was one of those silly bugs. So in further thinking, it seems simpler to make only LogwrtResult atomic, and leave LogwrtRqst as currently, using the spinlock. This should solve the contention problem we saw at the customer (but I've asked Jaime very nicely to do a test run, if possible, to confirm). For things like logical replication, which call GetFlushRecPtr() very frequently (causing the spinlock issue we saw) it is good, because we're no longer hitting the spinlock at all in that case. I have another (pretty mechanical) patch that renames LogwrtResult.Write to LogWriteResult and LogwrtResult.Flush to LogFlushResult. That more clearly shows that we're no longer updating them on unison. Didn't want to attach here because I didn't rebase on current one. But it seems logical: there's no longer any point in doing struct assignment, which is the only thing that stuff was good for. On 2021-Jan-29, Andres Freund wrote: > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata, > > { > > /* advance global request to include new block(s) > > */ > > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > > EndPos); > > + pg_memory_barrier(); > > That's not really useful - the path that actually updates already > implies a barrier. It'd probably be better to add a barrier to a "never > executed cmpxchg" fastpath. Got it. Do you mean in pg_atomic_monotonic_advance_u64() itself? I'm not sure which is the nicer semantics. (If it's got to be at the caller, then we'll need to return a boolean from there, which sounds worse.) > > @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record) > > WriteRqstPtr = pg_atomic_read_u64(&XLogCtl->LogwrtRqst.Write); > > LogwrtResult.Write = > > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); > > LogwrtResult.Flush = > > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > > + pg_read_barrier(); > > I'm not sure you really can get away with just a read barrier in these > places. We can't e.g. have later updates to other shared memory > variables be "moved" to before the barrier (which a read barrier > allows). Ah, that makes sense. I have not really studied the barrier locations terribly closely in this version of the patch. It probably misses some (eg. in GetFlushRecPtr and GetXLogWriteRecPtr). It is passing the tests for me, but alone that's probably not enough. I'm gonna try and study the generated assembly and see if I can make sense of things ... -- Álvaro Herrera39°49'30"S 73°17'W >From ca05447f41270d6b3ac3b6fcf816f86aba86d08f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 29 Jan 2021 22:34:23 -0300 Subject: [PATCH 1/2] add pg_atomic_monotonic_advance_u64 --- src/include/port/atomics.h | 21 + 1 file changed, 21 insertions(+) diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 856338f161..a1c4764d18 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -519,6 +519,27 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) return pg_atomic_sub_fetch_u64_impl(ptr, sub_); } +/* + * Monotonically advance the given variable using only atomic operations until + * it's at least the target value. + */ +static inline void +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) +{ + uint64 currval; + +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION + AssertPointerAlignment(ptr, 8); +#endif + + currval = pg_atomic_read_u64(ptr); + while (currval < target_) + { + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) + break; + } +} + #undef INSIDE_ATOMICS_H #endif /* ATOMICS_H */ -- 2.20.1 >From d185cf50a47a0f9e346e49ccda038bb016ce246b Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH 2/2] make LogwrtResult atomic --- src/backend/access/transam/xlog.c | 66 +++ 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f03bd473e2..10802bd56f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -405,12 +405,9 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * These structs are identical but are declared separately to indicate their * slightly different functions. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which
Re: Recording foreign key relationships for the system catalogs
I wrote: > * It would now be possible to remove the PGNSP and PGUID kluges > entirely in favor of plain BKI_LOOKUP references to pg_namespace > and pg_authid. The catalog header usage would get a little > more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog) > and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES). I'm a bit > inclined to do it, simply to remove one bit of mechanism that has > to be documented; but it's material for a separate patch perhaps. Here's a patch for that part. I think this is probably a good idea not only because it removes magic, but because now that we have various predefined roles it's becoming more and more likely that some of those will need to be cross-referenced in other catalogs' initial data. With this change, nothing special will be needed for that. Multiple built-in schemas also become more feasible than they were. regards, tom lane diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml index 6d3c5be67f..db1b3d5e9a 100644 --- a/doc/src/sgml/bki.sgml +++ b/doc/src/sgml/bki.sgml @@ -540,17 +540,6 @@ expected to be in the pg_catalog schema. - - - - In addition to the generic lookup mechanisms, there is a special - convention that PGNSP is replaced by the OID of - the pg_catalog schema, - and PGUID is replaced by the OID of the bootstrap - superuser role. These usages are somewhat historical but so far - there hasn't been a need to generalize them. - - diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index 5bdc7adc44..b159958112 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -184,15 +184,9 @@ my $GenbkiNextOid = $FirstGenbkiObjectId; # within a given Postgres release, such as fixed OIDs. Do not substitute # anything that could depend on platform or configuration. (The right place # to handle those sorts of things is in initdb.c's bootstrap_template1().) -my $BOOTSTRAP_SUPERUSERID = - Catalog::FindDefinedSymbolFromData($catalog_data{pg_authid}, - 'BOOTSTRAP_SUPERUSERID'); my $C_COLLATION_OID = Catalog::FindDefinedSymbolFromData($catalog_data{pg_collation}, 'C_COLLATION_OID'); -my $PG_CATALOG_NAMESPACE = - Catalog::FindDefinedSymbolFromData($catalog_data{pg_namespace}, - 'PG_CATALOG_NAMESPACE'); # Fill in pg_class.relnatts by looking at the referenced catalog's schema. @@ -213,11 +207,12 @@ foreach my $row (@{ $catalog_data{pg_am} }) $amoids{ $row->{amname} } = $row->{oid}; } -# There is only one authid at bootstrap time, and we handle it specially: -# the usually-defaulted symbol PGUID becomes the bootstrap superuser's OID. -# (We could drop this in favor of writing out BKI_DEFAULT(POSTGRES) ...) +# role OID lookup my %authidoids; -$authidoids{'PGUID'} = $BOOTSTRAP_SUPERUSERID; +foreach my $row (@{ $catalog_data{pg_authid} }) +{ + $authidoids{ $row->{rolname} } = $row->{oid}; +} # class (relation) OID lookup (note this only covers bootstrap catalogs!) my %classoids; @@ -240,11 +235,12 @@ foreach my $row (@{ $catalog_data{pg_language} }) $langoids{ $row->{lanname} } = $row->{oid}; } -# There is only one namespace at bootstrap time, and we handle it specially: -# the usually-defaulted symbol PGNSP becomes the pg_catalog namespace's OID. -# (We could drop this in favor of writing out BKI_DEFAULT(pg_catalog) ...) +# namespace (schema) OID lookup my %namespaceoids; -$namespaceoids{'PGNSP'} = $PG_CATALOG_NAMESPACE; +foreach my $row (@{ $catalog_data{pg_namespace} }) +{ + $namespaceoids{ $row->{nspname} } = $row->{oid}; +} # opclass OID lookup my %opcoids; diff --git a/src/include/catalog/pg_authid.dat b/src/include/catalog/pg_authid.dat index a643a09588..87d917ffc3 100644 --- a/src/include/catalog/pg_authid.dat +++ b/src/include/catalog/pg_authid.dat @@ -15,6 +15,10 @@ # The C code typically refers to these roles using the #define symbols, # so make sure every entry has an oid_symbol value. +# The bootstrap superuser is named POSTGRES according to this data and +# according to BKI_DEFAULT entries in other catalogs. However, initdb +# will replace that at database initialization time. + { oid => '10', oid_symbol => 'BOOTSTRAP_SUPERUSERID', rolname => 'POSTGRES', rolsuper => 't', rolinherit => 't', rolcreaterole => 't', rolcreatedb => 't', rolcanlogin => 't', diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h index bb6938caa2..3e37729436 100644 --- a/src/include/catalog/pg_class.h +++ b/src/include/catalog/pg_class.h @@ -38,7 +38,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat NameData relname; /* OID of namespace containing this class */ - Oid relnamespace BKI_DEFAULT(PGNSP) BKI_LOOKUP(pg_namespace); + Oid relnamespace BKI_DEFAULT(pg_catalog) BKI_LOOKUP(pg_namespace); /* OID of entry in pg_type for relation's implicit row type, if any */ Oid reltype BKI_LOOKUP_OPT(pg
Re: LogwrtResult contended spinlock
Hi, On 2021-02-02 20:19:19 -0300, Alvaro Herrera wrote: > > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata, > > > { > > > /* advance global request to include new block(s) > > > */ > > > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > > > EndPos); > > > + pg_memory_barrier(); > > > > That's not really useful - the path that actually updates already > > implies a barrier. It'd probably be better to add a barrier to a "never > > executed cmpxchg" fastpath. > > Got it. Do you mean in pg_atomic_monotonic_advance_u64() itself? Yes. > I'm not sure which is the nicer semantics. (If it's got to be at the > caller, then we'll need to return a boolean from there, which sounds > worse.) Nearly all other modifying atomic operations have full barrier semantics, so I think it'd be better to have it inside the pg_atomic_monotonic_advance_u64(). > +/* > + * Monotonically advance the given variable using only atomic operations > until > + * it's at least the target value. > + */ > +static inline void > +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 > target_) > +{ > + uint64 currval; > + > +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION > + AssertPointerAlignment(ptr, 8); > +#endif > + > + currval = pg_atomic_read_u64(ptr); > + while (currval < target_) > + { > + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) > + break; > + } > +} So I think it'd be static inline void pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { uint64 currval; #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); #endif currval = pg_atomic_read_u64(ptr); if (currval >= target_) { pg_memory_barrier(); return; } while (currval < target_) { if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) break; } } > @@ -1172,9 +1170,10 @@ XLogInsertRecord(XLogRecData *rdata, > /* advance global request to include new block(s) */ > if (XLogCtl->LogwrtRqst.Write < EndPos) > XLogCtl->LogwrtRqst.Write = EndPos; > - /* update local result copy while I have the chance */ > - LogwrtResult = XLogCtl->LogwrtResult; > SpinLockRelease(&XLogCtl->info_lck); > + /* update local result copy */ > + LogwrtResult.Write = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); > + LogwrtResult.Flush = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > } As mentioned before - it's not clear to me why this is a valid thing to do without verifying all LogwrtResult.* usages. You can get updates completely out of order / independently. > @@ -2675,8 +2673,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) >* code in a couple of places. >*/ > { > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Write, > LogwrtResult.Write); > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Flush, > LogwrtResult.Flush); > + pg_memory_barrier(); > SpinLockAcquire(&XLogCtl->info_lck); > - XLogCtl->LogwrtResult = LogwrtResult; I still don't see why we need "locked" atomic operations here, rather than just a pg_atomic_write_u64(). They can only be modified with WALWriteLock held. There's two reasons for using a spinlock in this place: 1) it avoids torn reads of 64bit values - pg_atomic_write_u64()/pg_atomic_read_u64() avoid that already. 2) it ensures that Write/Flush are updated in unison - but that's not useful anymore, given that other places now read the variables separately. > @@ -3064,8 +3063,10 @@ XLogBackgroundFlush(void) > WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ; > > /* if we have already flushed that far, consider async commit records */ > + LogwrtResult.Flush = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > if (WriteRqst.Write <= LogwrtResult.Flush) > { > + pg_memory_barrier(); > SpinLockAcquire(&XLogCtl->info_lck); > WriteRqst.Write = XLogCtl->asyncXactLSN; > SpinLockRelease(&XLogCtl->info_lck); A SpinLockAcquire() is a full memory barrier on its own I think. This'd probably better solved by just making asyncXactLSN atomic... Greetings, Andres Freund
RE: Determine parallel-safety of partition relations for Inserts
Hi, Attaching v5 patches with the changes: * rebase the code on the greg's latest parallel insert patch * fix some code style. Please consider it for further review. Best regards, Houzj v5_0004-reloption-parallel_dml-test-and-doc.patch Description: v5_0004-reloption-parallel_dml-test-and-doc.patch v5_0001-guc-option-enable_parallel_dml-src.patch Description: v5_0001-guc-option-enable_parallel_dml-src.patch v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch v5_0003-reloption-parallel_dml-src.patch.patch Description: v5_0003-reloption-parallel_dml-src.patch.patch
RE: Determine parallel-safety of partition relations for Inserts
> For v3_0003-reloption-parallel_dml-src.patch : > + table_close(rel, NoLock); > Since the rel would always be closed, it seems the return value from > RelationGetParallelDML() can be assigned to a variable, followed by call to > table_close(), then the return statement. Thanks for the comment. Fixed in the latest patch. Best regards, houzj
Re: Single transaction in the tablesync worker?
On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila wrote: > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith wrote: > > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also > > tried a simple test where I do a DROP TABLE with very bad timing for > > the tablesync worker. It seems that doing this can cause the sync > > worker's MyLogicalRepWorker->relid to become invalid. > > > > I think this should be fixed by latest patch because I have disallowed > drop of a table when its synchronization is in progress. You can check > once and let me know if the issue still exists? > FYI - I confirmed that the problem scenario that I reported yesterday is no longer possible because now the V25 patch is disallowing the DROP TABLE while the tablesync is still running. PSA my test logs showing it is now working as expected. Kind Regards, Peter Smith. Fujitsu Australia Test Scenario 1. INSERT data so tablesync should copy something 2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out from under the sync worker! 3. Continue the sync worker see what happens == ## ## Insert data ## [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');" INSERT 0 1 ## ## SUBSCRIBE and continue to breakpoint at top of tablesync function LogicalRepSyncTableStart ## [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2021-02-03 11:46:07.207 AEDT [8298] LOG: logical decoding found consistent point at 0/1654698 2021-02-03 11:46:07.207 AEDT [8298] DETAIL: There are no running transactions. 2021-02-03 11:46:07.207 AEDT [8298] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2021-02-03 11:46:07.218 AEDT [8302] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-02-03 11:46:07.218 AEDT [8302] LOG: !!>> The apply worker process has PID = 8302 [postgres@CentOS7-x64 ~]$ 2021-02-03 11:46:07.224 AEDT [8306] LOG: starting logical decoding for slot "tap_sub" 2021-02-03 11:46:07.224 AEDT [8306] DETAIL: Streaming transactions committing after 0/16546D0, reading WAL from 0/1654698. 2021-02-03 11:46:07.224 AEDT [8306] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-03 11:46:07.225 AEDT [8306] LOG: logical decoding found consistent point at 0/1654698 2021-02-03 11:46:07.225 AEDT [8306] DETAIL: There are no running transactions. 2021-02-03 11:46:07.225 AEDT [8306] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-03 11:46:07.225 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:07.225 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:07.229 AEDT [8307] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-02-03 11:46:07.229 AEDT [8307] LOG: !!>> The tablesync worker process has PID = 8307 2021-02-03 11:46:07.229 AEDT [8307] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 8307 now! ## PID 8302 is apply worker ## PID 8307 is tablesync worker 2021-02-03 11:46:08.241 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:08.241 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:09.253 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:09.253 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:10.255 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:10.255 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:11.258 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:11.258 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:12.263 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:12.263 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:13.265 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:13.265 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:14.275 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:14.275 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:15.280 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:15.280 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03 11:46:16.282 AEDT [8302] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-02-03 11:46:16.282 AEDT [8302] LOG: !!>> apply worker: called process_syncing_tables 2021-02-03
RE: libpq debug log
(39) + of tracing. If (flags & PQTRACE_OUTPUT_TIMESTAMPS) is + true, then timestamp is not printed with each message. The flag name (OUTPUT) and its description (not printed) doesn't match. I think you can use less programmatical expression like "If flags contains PQTRACE_OUTPUT_TIMESTAMPS". (40) diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt +PQtraceEx 180 \ No newline at end of file What's the second line? Isn't the file missing an empty line at the end? (41) +void +PQtraceEx(PGconn *conn, FILE *debug_port, int flags) +{ + if (conn == NULL) + return; ... + if (!debug_port) + return; The if should better be as follows to match the style of existing surrounding code. + if (debug_port == NULL) (42) +pqLogFormatTimestamp(char *timestr, Size ts_len) I think you should use int or size_t instead of Size here, because other libpq code uses them. int should be enough. If the compiler gives warnings, prepend "(int)" before sizeof() at call sites. (43) + /* append microseconds */ + sprintf(timestr + strlen(timestr), ".%06d", (int) (tval.tv_usec / 1000)); "/ 1000" should be removed. (44) + if ((conn->traceFlags & PQTRACE_OUTPUT_TIMESTAMPS) == 0) + timestr_p = pqLogFormatTimestamp(timestr, sizeof(timestr)); == 0 should be removed. Regards Takayuki Tsunakawa
Re: Single transaction in the tablesync worker?
On Wed, Feb 3, 2021 at 12:24 AM Amit Kapila wrote: > The problem here is that we are allowing to drop the table when table > synchronization is still in progress and then we don't have any way to > know the corresponding slot or origin. I think we can try to drop the > slot and origin as well but that is not a good idea because slots once > dropped won't be rolled back. So, I have added a fix to disallow the > drop of the table when table synchronization is still in progress. > Apart from that, I have fixed comments raised by Peter as discussed > above and made some additional changes in comments, code (code changes > are cosmetic), and docs. > > Let me know if the issue reported is fixed or not? Yes, the issue is fixed, now the table drop results in an error. postgres=# drop table tab_rep ; ERROR: could not drop relation mapping for subscription "tap_sub" DETAIL: Table synchronization for relation "tab_rep" is in progress and is in state "f". HINT: Use ALTER SUBSCRIPTION ... ENABLE to enable subscription if not already enabled or use DROP SUBSCRIPTION ... to drop the subscription. regards, Ajin Cherian Fujitsu Australia
Re: adding wait_start column to pg_locks
On 2021/02/03 1:49, Fujii Masao wrote: On 2021/02/02 22:00, torikoshia wrote: On 2021-01-25 23:44, Fujii Masao wrote: Another comment is; Doesn't the change of MyProc->waitStart need the lock table's partition lock? If yes, we can do that by moving LWLockRelease(partitionLock) just after the change of MyProc->waitStart, but which causes the time that lwlock is being held to be long. So maybe we need another way to do that. Thanks for your comments! It would be ideal for the consistency of the view to record "waitstart" during holding the table partition lock. However, as you pointed out, it would give non-negligible performance impacts. I may miss something, but as far as I can see, the influence of not holding the lock is that "waitstart" can be NULL even though "granted" is false. I think people want to know the start time of the lock when locks are held for a long time. In that case, "waitstart" should have already been recorded. Sounds reasonable. If this is true, I think the current implementation may be enough on the condition that users understand it can happen that "waitStart" is NULL and "granted" is false. Attached a patch describing this in the doc and comments. Any Thoughts? 64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating "waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock when reading "waitStart"? Also it might be worth thinking to use 64-bit atomic operations like pg_atomic_read_u64(), for that. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
RE: libpq debug log
From: 'Alvaro Herrera' > > + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS) > The rationale for that second condition is this: if the memory allocated > is the initial size, we don't free memory, because it would just be > allocated of the same size next time, and that size is not very big, so > it's not a big deal if we just let it be, so that it is reused if we > call PQtrace() again later. However, if the allocated size is larger > than default, then it is possible that some previous tracing run has > enlarged the trace struct to a very large amount of memory, and we don't > want to leave that in place. Ah, understood. In that case, num_fields should be max_fields. This has reminded me that freePGconn() should free be_msg and fe_msg. Regards Takayuki Tsunakawa
Re: libpq debug log
At Wed, 3 Feb 2021 01:26:41 +, "tsunakawa.ta...@fujitsu.com" wrote in > (41) > +void > +PQtraceEx(PGconn *conn, FILE *debug_port, int flags) > +{ > + if (conn == NULL) > + return; > ... > + if (!debug_port) > + return; > > The if should better be as follows to match the style of existing surrounding > code. > > + if (debug_port == NULL) I don't have particular preference here, but FWIW the current PQuntrace is doing this: > void > PQuntrace(PGconn *conn) > { > if (conn == NULL) > return; > if (conn->Pfdebug) > { > fflush(conn->Pfdebug); > (44) > + if ((conn->traceFlags & PQTRACE_OUTPUT_TIMESTAMPS) == 0) > + timestr_p = pqLogFormatTimestamp(timestr, sizeof(timestr)); > > == 0 should be removed. Looking the doc mentioned in the comment #39: + flags contains flag bits describing the operating mode + of tracing. If (flags & PQTRACE_OUTPUT_TIMESTAMPS) is + true, then timestamp is not printed with each message. PQTRACE_OUTPUT_TIMESTAMPS is designed to *inhibit* timestamps from being prepended. If that is actually intended, the symbol name should be PQTRACE_NOOUTPUT_TIMESTAMP. Otherwise, the doc need to be fixed. By the way removing "== 0" makes it difficult to tell whether the condition is correct or not; I recommend to use '!= 0" rather than removing '== 0'. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Single transaction in the tablesync worker?
On Wed, Feb 3, 2021 at 6:38 AM Peter Smith wrote: > > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila wrote: > > > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith wrote: > > > > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also > > > tried a simple test where I do a DROP TABLE with very bad timing for > > > the tablesync worker. It seems that doing this can cause the sync > > > worker's MyLogicalRepWorker->relid to become invalid. > > > > > > > I think this should be fixed by latest patch because I have disallowed > > drop of a table when its synchronization is in progress. You can check > > once and let me know if the issue still exists? > > > > FYI - I confirmed that the problem scenario that I reported yesterday > is no longer possible because now the V25 patch is disallowing the > DROP TABLE while the tablesync is still running. > Thanks for the confirmation. BTW, can you please check if we can reproduce that problem without this patch? If so, we might want to apply this fix irrespective of this patch. If not, why not? -- With Regards, Amit Kapila.
Re: Single transaction in the tablesync worker?
On Wed, Feb 3, 2021 at 1:34 PM Amit Kapila wrote: > > On Wed, Feb 3, 2021 at 6:38 AM Peter Smith wrote: > > > > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila wrote: > > > > > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith wrote: > > > > > > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also > > > > tried a simple test where I do a DROP TABLE with very bad timing for > > > > the tablesync worker. It seems that doing this can cause the sync > > > > worker's MyLogicalRepWorker->relid to become invalid. > > > > > > > > > > I think this should be fixed by latest patch because I have disallowed > > > drop of a table when its synchronization is in progress. You can check > > > once and let me know if the issue still exists? > > > > > > > FYI - I confirmed that the problem scenario that I reported yesterday > > is no longer possible because now the V25 patch is disallowing the > > DROP TABLE while the tablesync is still running. > > > > Thanks for the confirmation. BTW, can you please check if we can > reproduce that problem without this patch? If so, we might want to > apply this fix irrespective of this patch. If not, why not? > Yes, this was an existing postgres bug. It is independent of the patch. I can reproduce exactly the same stacktrace using the HEAD src pulled @ 3/Feb. PSA my test logs showing the details. Kind Regards, Peter Smith. Fujitsu Australia Test Scenario: 1. INSERT data so tablesync should copy something 2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out from under the sync worker! 3. Continue the sync worker see what happens All code is from PG HEAD (3/Feb) except the "!!>>" added to allow me to attech to debugger. == ## ## Insert data ## [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');" INSERT 0 1 ## ## SUBSCRIBE and continue to breakpoint at top of tablesync function LogicalRepSyncTableStart ## [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2021-02-03 13:42:20.031 AEDT [9555] LOG: logical decoding found consistent point at 0/16605E8 2021-02-03 13:42:20.031 AEDT [9555] DETAIL: There are no running transactions. 2021-02-03 13:42:20.031 AEDT [9555] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2021-02-03 13:42:20.043 AEDT [9556] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-02-03 13:42:20.043 AEDT [9556] LOG: !!>> The apply worker process has PID = 9556 [postgres@CentOS7-x64 ~]$ 2021-02-03 13:42:20.052 AEDT [9562] LOG: starting logical decoding for slot "tap_sub" 2021-02-03 13:42:20.052 AEDT [9562] DETAIL: Streaming transactions committing after 0/1660620, reading WAL from 0/16605E8. 2021-02-03 13:42:20.052 AEDT [9562] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-03 13:42:20.052 AEDT [9562] LOG: logical decoding found consistent point at 0/16605E8 2021-02-03 13:42:20.052 AEDT [9562] DETAIL: There are no running transactions. 2021-02-03 13:42:20.052 AEDT [9562] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-02-03 13:42:20.056 AEDT [9564] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-02-03 13:42:20.056 AEDT [9564] LOG: !!>> The tablesync worker process has PID = 9564 2021-02-03 13:42:20.056 AEDT [9564] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 9564 now! ## PID 9564 is sync worker ## PID 9556 is apply worker ## ## While paused in debugger for the tablesync worker do DROP TABLE on subscriber. ## Note we have not done any ALTER SUBSCRIPTION. ## psql -d test_sub -p 54321 -c "DROP TABLE test_tab;" DROP TABLE ## Following stacktrace LOG happens [postgres@CentOS7-x64 ~]$ TRAP: FailedAssertion("strvalue != NULL", File: "snprintf.c", Line: 442, PID: 9564) postgres: logical replication worker for subscription 16407 sync 16385 (ExceptionalCondition+0xb9)[0xad8f4b] postgres: logical replication worker for subscription 16407 sync 16385 [0xb4c6ee] postgres: logical replication worker for subscription 16407 sync 16385 (pg_vsnprintf+0x7c)[0xb4c072] postgres: logical replication worker for subscription 16407 sync 16385 (pvsnprintf+0x30)[0xb3f26f] postgres: logical replication worker for subscription 16407 sync 16385 (appendStringInfoVA+0x80)[0xb40cb6] postgres: logical replication worker for subscription 16407 sync 16385 (errmsg+0x183)[0xad9d33] postgres: logical replication worker for subscription 16407 sync 16385 [0x8c52f3] postgres: logical replication worker for subscription 16407 sync 16385 (LogicalRepSyncT
Re: memory leak in auto_explain
On Wed, 03 Feb 2021 at 02:13, Tom Lane wrote: > japin writes: >> Here's my analysis: >> 1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL >> function >> memory context, which is a long-lived, cause the memory grow up. > > Yeah, agreed. I think people looking at this have assumed that the > ExecutorEnd hook would automatically be running in the executor's > per-query context, but that's not so; we haven't yet entered > standard_ExecutorEnd where the context switch is. The caller's > context is likely to be much longer-lived than the executor's. > > I think we should put the switch one level further out than you have > it here, just to be sure that InstrEndLoop is covered too (that doesn't > allocate memory, but auto_explain shouldn't assume that). Otherwise > seems like a good fix. > Thanks for your review and clarification. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: [HACKERS] Custom compression methods
On Wed, Feb 3, 2021 at 2:07 AM Robert Haas wrote: > > Even more review comments, still looking mostly at 0001: > > If there's a reason why parallel_schedule is arranging to run the > compression test in parallel with nothing else, the comment in that > file should explain the reason. If there isn't, it should be added to > a parallel group that doesn't have the maximum number of tests yet, > probably the last such group in the file. > > serial_schedule should add the test in a position that roughly > corresponds to where it appears in parallel_schedule. > > I believe it's relatively standard practice to put variable > declarations at the top of the file. compress_lz4.c and > compress_pglz.c instead put those declarations nearer to the point of > use. > > compressamapi.c has an awful lot of #include directives for the code > it actually contains. I believe that we should cut that down to what > is required by 0001, and other patches can add more later as required. > In fact, it's tempting to just get rid of this .c file altogether and > make the two functions it contains static inline functions in the > header, but I'm not 100% sure that's a good idea. > > The copyright dates in a number of the file headers are out of date. > > binary_upgrade_next_pg_am_oid and the related changes to > CreateAccessMethod don't belong in 0001, because it doesn't support > non-built-in compression methods. These changes and the related > pg_dump change should be moved to the patch that adds support for > that. > > The comments added to dumpTableSchema() say that "compression is > assigned by ALTER" but don't give a reason. I think they should. I > don't know how much they need to explain about what the code does, but > they definitely need to explain why it does it. Also, isn't this bad? > If we create the column with the wrong compression setting initially > and then ALTER it, we have to rewrite the table. If it's empty, that's > cheap, but it'd still be better not to do it at all. > > I'm not sure it's a good idea for dumpTableSchema() to leave out > specifying the compression method if it happens to be pglz. I think we > definitely shouldn't do it in binary-upgrade mode. What if we changed > the default in a future release? For that matter, even 0002 could make > the current approach unsafe I think, anyway. > > The changes to pg_dump.h look like they haven't had a visit from > pgindent. You should probably try to do that for the whole patch, > though it's a bit annoying since you'll have to manually remove > unrelated changes to the same files that are being modified by the > patch. Also, why the extra blank line here? > > GetAttributeCompression() is hard to understand. I suggest changing > the comment to "resolve column compression specification to an OID" > and somehow rejigger the code so that you aren't using one not-NULL > test and one NULL test on the same variable. Like maybe change the > first part to if (!IsStorageCompressible(typstorage)) { if > (compression == NULL) return InvalidOid; ereport(ERROR, ...); } > > It puzzles me that CompareCompressionMethodAndDecompress() calls > slot_getallattrs() just before clearing the slot. It seems like this > ought to happen before we loop over the attributes, so that we don't > need to call slot_getattr() every time. See the comment for that > function. But even if we didn't do that for some reason, why would we > do it here? If it's already been done, it shouldn't do anything, and > if it hasn't been done, it might overwrite some of the values we just > poked into tts_values. It also seems suspicious that we can get away > with clearing the slot and then again marking it valid. I'm not sure > it really works like that. Like, can't clearing the slot invalidate > pointers stored in tts_values[]? For instance, if they are pointers > into an in-memory heap tuple, tts_heap_clear() is going to free the > tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is > going to unpin it. I think the supported procedure for this sort of > thing is to have a second slot, set tts_values, tts_isnull etc. and > then materialize the slot. After materializing the new slot, it's > independent of the old slot, which can then be cleared. See for > example tts_virtual_materialize(). The whole approach you've taken > here might need to be rethought a bit. I think you are right to want > to avoid copying everything over into a new slot if nothing needs to > be done, and I think we should definitely keep that optimization, but > I think if you need to copy stuff, you have to do the above procedure > and then continue using the other slot instead of the original one. > Some places I think we have functions that return either the original > slot or a different one depending on how it goes; that might be a > useful idea here. But, you also can't just spam-create slots; it's > important that whatever ones we end up with get reused for every > tuple. > > Doesn't the change to describeO
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Tue, Feb 2, 2021 at 9:45 AM Fujii Masao wrote: > One merit of keep_connections that I found is that we can use it even > when connecting to the older PostgreSQL that doesn't support > idle_session_timeout. Also it seems simpler to use keep_connections > rather than setting idle_session_timeout in multiple remote servers. > So I'm inclined to add this feature, but I'd like to hear more opinions. Thanks. > > And, just using idle_session_timeout on a remote server may not help > > us completely. Because the remote session may go away, while we are > > still using that cached connection in an explicit txn on the local > > session. Our connection retry will also not work because we are in the > > middle of an xact, so the local explicit txn gets aborted. > > Regarding idle_in_transaction_session_timeout, this seems true. But > I was thinking that idle_session_timeout doesn't cause this issue because > it doesn't close the connection in the middle of transaction. No? You are right. idle_session_timeout doesn't take effect when in the middle of an explicit txn. I missed this point. > Here are some review comments. > > - (used_in_current_xact && !keep_connections)) > + (used_in_current_xact && > + (!keep_connections || !entry->keep_connection))) > > The names of GUC and server-level option should be the same, > to make the thing less confusing? We can have GUC name keep_connections as there can be multiple connections within a local session and I can change the server level option keep_connection to keep_connections because a single foreign server can have multiple connections as we have seen that in the use case identified by you. I will change that in the next patch set. > IMO the server-level option should override GUC. IOW, GUC setting > should be used only when the server-level option is not specified. > But the above code doesn't seem to do that. Thought? Note that default values for GUC and server level option are on i.e. connections are cached. The main intention of the GUC is to not set server level options to false for all the foreign servers in case users don't want to keep any foreign server connections. If the server level option overrides GUC, then even if users set GUC to off, they have to set the server level option to false for all the foreign servers. So, the below code in the patch, first checks the GUC. If the GUC is off, then discards the connections. If the GUC is on, then it further checks the server level option. If it's off discards the connection, otherwise not. I would like it to keep this behaviour as is. Thoughts? if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE || entry->changing_xact_state || entry->invalidated || +(used_in_current_xact && +(!keep_connections || !entry->keep_connection))) { elog(DEBUG3, "discarding connection %p", entry->conn); disconnect_pg_server(entry); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Identify missing publications from publisher while create/alter subscription.
On Mon, Jan 25, 2021 at 10:32 PM vignesh C wrote: > > I mean it doesn’t seem right to disallow to create the subscription if > > the publisher doesn't exist, and my reasoning was even though the > > publisher exists while creating the subscription you might drop it > > later right?. So basically, now also we can create the same scenario > > that a subscription may exist for the publication which does not > > exist. > > > > I would like to defer on documentation for this. > I feel we should have the behavior similar to publication tables as given > below, then it will be consistent and easier for the users: > > This is the behavior in case of table: > Step 1: > PUBLISHER SIDE: > create table t1(c1 int); > create table t2(c1 int); > CREATE PUBLICATION mypub1 for table t1,t2; > -- All above commands succeeds > Step 2: > SUBSCRIBER SIDE: > -- Create subscription without creating tables will result in error: > CREATE SUBSCRIPTION mysub1 CONNECTION 'dbname=source_rep host=localhost > user=vignesh port=5432' PUBLICATION mypub1; > ERROR: relation "public.t2" does not exist > create table t1(c1 int); > create table t2(c1 int); > > CREATE SUBSCRIPTION mysub1 CONNECTION 'dbname=source_rep host=localhost > user=vignesh port=5432' PUBLICATION mypub1; > > postgres=# select * from pg_subscription; > oid | subdbid | subname | subowner | subenabled | subbinary | substream | > subconninfo | subslotname | > subsynccommit | subpublications > ---+-+-+--++---+---+-+-+---+- > 16392 | 13756 | mysub1 | 10 | t | f | f | > dbname=source_rep host=localhost user=vignesh port=5432 | mysub1 | off > | {mypub1} > (1 row) > > postgres=# select *,srrelid::oid::regclass from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn | srrelid > -+-++---+- >16392 | 16389 | r | 0/1608BD0 | t2 >16392 | 16384 | r | 0/1608BD0 | t1 > > (2 rows) > Step 3: > PUBLISHER: > drop table t2; > create table t3; > CREATE PUBLICATION mypub2 for table t1,t3; > > Step 4: > SUBSCRIBER: > postgres=# select *,srrelid::oid::regclass from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn | srrelid > -+-++---+- >16392 | 16389 | r | 0/1608BD0 | t2 >16392 | 16384 | r | 0/1608BD0 | t1 > > (2 rows) > > postgres=# alter subscription mysub1 refresh publication ; > ALTER SUBSCRIPTION > > -- Subscription relation will be updated. > postgres=# select *,srrelid::oid::regclass from pg_subscription_rel; > srsubid | srrelid | srsubstate | srsublsn | srrelid > -+-++---+- >16392 | 16384 | r | 0/1608BD0 | t1 > (1 row) > > > -- Alter subscription fails while setting publication having a table that > does not exist > postgres=# alter subscription mysub1 set publication mysub2; > ERROR: relation "public.t3" does not exist > > To maintain consistency, we should have similar behavior in case of > publication too. > If a publication which does not exist is specified during create > subscription, then we should throw an error similar to step 2 behavior. > Similarly if a publication which does not exist is specified during alter > subscription, then we should throw an error similar to step 4 behavior. If > publication is dropped after subscription is created, this should be removed > when an alter subscription subname refresh publication is performed similar > to step 4. > Thoughts? IIUC, your idea is to check if the publications (that are associated with a subscription) are present in the publisher or not during ALTER SUBSCRIPTION ... REFRESH PUBLICATION;. If that's the case, then I have scenario: 1) subscription is created with pub1, pub2 and assume both the publications are present in the publisher 2) pub1 and pub2 tables data is replicated properly 3) pub2 is dropped on the publisher 4) run alter subscription .. refresh publication on the subscriber, so that the pub2 tables will be removed from the subscriber 5) for some reason, user creates pub2 again on the publisher and want to replicated some tables 6) run alter subscription .. refresh publication on the subscriber, so that the pub2 tables will be added to the subscriber table list Now, if we remove the dropped publication pub2 in step 4 from the subscription list(as per your above analysis and suggestion), then after step 5, users will need to add the publication pub2 to the subscription again. I feel this is a change in the current behaviour. The existing behaviour on master doesn't mandate this as the dropped publications are not removed from the subscription list at all. To not mandate any new behaviour, I would suggest to have a
Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
On Thu, Jan 28, 2021 at 10:07 AM japin wrote: > Attaching v3 patches, please consider these for further review. I think we can add a commitfest entry for this feature, so that the patches will be tested on cfbot. Ignore if done already. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Can we have a new SQL callable function to get Postmaster PID?
Hi, Can we have a new function, say pg_postgres_pid(), to return postmaster PID similar to pg_backend_pid()? At times, it will be difficult to use OS level commands to get the postmaster pid of a backend to which it is connected. It's even worse if we have multiple postgres server instances running on the same system. I'm not sure whether it's safe to expose postmaster pid this way, but it will be useful at least for debugging purposes on say Windows or other non-Linux platforms where it's a bit difficult to get process id. Users can also look at the postmaster.pid file to figure out what's the current postmaster pid, if not using OS level commands, but having a SQL callable function makes life easier. The function can look like this: Datum pg_postgres_pid(PG_FUNCTION_ARGS) { PG_RETURN_INT32(PostmasterPid); } Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote: > On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote: > > Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses > > index_create() with a proper tablespaceOid instead of > > SetRelationTableSpace(). And its checks structure is more restrictive even > > without tablespace change, so it doesn't use CheckRelationTableSpaceMove(). > > Sure. I have not checked the patch in details, but even with that it > would be much safer to me if we apply the same sanity checks > everywhere. That's less potential holes to worry about. Thanks Alexey for the new patch. I have been looking at the main patch in details. /* -* Don't allow reindex on temp tables of other backends ... their local -* buffer manager is not going to cope. +* We don't support moving system relations into different tablespaces +* unless allow_system_table_mods=1. */ If you remove the check on RELATION_IS_OTHER_TEMP() in reindex_index(), you would allow the reindex of a temp relation owned by a different session if its tablespace is not changed, so this cannot be removed. +!allowSystemTableMods && IsSystemRelation(iRel)) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot reindex temporary tables of other sessions"))); +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("permission denied: \"%s\" is a system catalog", +RelationGetRelationName(iRel; Indeed, a system relation with a relfilenode should be allowed to move under allow_system_table_mods. I think that we had better move this check into CheckRelationTableSpaceMove() instead of reindex_index() to centralize the logic. ALTER TABLE does this business in RangeVarCallbackForAlterRelation(), but our code path opening the relation is different for the non-concurrent case. + if (OidIsValid(params->tablespaceOid) && + IsSystemClass(relid, classtuple)) + { + if (!allowSystemTableMods) + { + /* Skip all system relations, if not allowSystemTableMods * I don't see the need for having two warnings here to say the same thing if a relation is mapped or not mapped, so let's keep it simple. +REINDEX (TABLESPACE regress_tblspace) SYSTEM CONCURRENTLY postgres; -- fail +ERROR: cannot reindex system catalogs concurrently [...] +REINDEX (TABLESPACE regress_tblspace) DATABASE regression; -- ok with warning +WARNING: cannot change tablespace of indexes on system relations, skipping all +REINDEX (TABLESPACE pg_default) DATABASE regression; -- ok with warning +WARNING: cannot change tablespace of indexes on system relations, skipping all Those tests are costly by design, so let's drop them. They have been useful to check the patch, but if tests are changed with objects remaining around this would cost a lot of resources. + /* It's not a shared catalog, so refuse to move it to shared tablespace */ + if (params->tablespaceOid == GLOBALTABLESPACE_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot move non-shared relation totablespace \"%s\"", +get_tablespace_name(params->tablespaceOid; There is no test coverage for this case with REINDEX CONCURRENTLY, and that's easy enough to stress. So I have added one. I have found that the test suite was rather messy in its organization. Table creations were done first with a set of tests not really ordered, so that was really hard to follow. This has also led to a set of tests that were duplicated, while other tests have been missed, mainly some cross checks for the concurrent and non-concurrent behaviors. I have reordered the whole so as tests on catalogs, normal tables and partitions are done separately with relations created and dropped for each set. Partitions use a global check for tablespaces and relfilenodes after one concurrent reindex (didn't see the point in doubling with the non-concurrent case as the same code path to select the relations from the partition tree is taken). An ACL test has been added at the end. The case of partitioned indexes was kind of interesting and I thought about that a couple of days, and I took the decision to ignore relations that have no storage as you did, documenting that ALTER TABLE can be used to update the references of the partitioned relations. The command is still useful with this behavior, and the tests I have added track that. Finally, I have reworked the docs, separating the limitations related to system catalogs and partitioned relations, to be more consistent with the notes at the end of the page. -- Michael From c021aeca905a738d38acb257cbc4724804273499 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 3 Feb 2021 15:26:21 +0900 Subject: [PATCH v11] Allow REINDEX to change tablespace REINDEX already
Re: Printing backtrace of postgres processes
Thanks Bharath for your comments. On Mon, Feb 1, 2021 at 6:14 AM Bharath Rupireddy wrote: > > On Fri, Jan 29, 2021 at 7:10 PM vignesh C wrote: > > > 4) How about following > > > + errmsg("must be a superuser to print backtrace > > > of backend process"))); > > > instead of > > > + errmsg("must be a superuser to print backtrace > > > of superuser query process"))); > > > > > > > Here the message should include superuser, we cannot remove it. Non > > super user can log non super user provided if user has permissions for > > it. > > > > > 5) How about following > > > errmsg("must be a member of the role whose backed > > > process's backtrace is being printed or member of > > > pg_signal_backend"))); > > > instead of > > > + errmsg("must be a member of the role whose > > > backtrace is being logged or member of pg_signal_backend"))); > > > > > > > Modified it. > > Maybe I'm confused here to understand the difference between > SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and > corresponding error messages. Some clarification/use case to know in > which scenarios we hit those error messages might help me. Did we try > to add test cases that show up these error messages for > pg_print_backtrace? If not, can we add? I have tested this manually: I have tested it manually, Here is the test I did: Create 2 users: create user test password 'test@123'; create user test1 password 'test@123'; Test1: Test print backtrace of a different user's session: ./psql -d postgres -U test psql (14devel) Type "help" for help. postgres=> select pg_backend_pid(); pg_backend_pid 30142 (1 row) -- ./psql -d postgres -U test1 psql (14devel) Type "help" for help. postgres=> select pg_print_backtrace(30142); ERROR: must be a member of the role whose backtrace is being logged or member of pg_signal_backend The above will be successful after: grant pg_signal_backend to test1; Test1: Test for non super user trying to print backtrace of a super user's session: ./psql -d postgres psql (14devel) Type "help" for help. postgres=# select pg_backend_pid(); pg_backend_pid 30211 (1 row) ./psql -d postgres -U test1 psql (14devel) Type "help" for help. postgres=> select pg_print_backtrace(30211); ERROR: must be a superuser to print backtrace of superuser process I have not added any tests for this as we required 2 active sessions and I did not see any existing framework for this. This test should help in relating the behavior. > > Attached v5 patch has the fixes for the same. > > Thoughts? > > Thanks. Here are some comments on v5 patch: > > 1) typo - it's "process" not "porcess" +a backend porcess. For example: > Modified. > 2) select * from pg_print_backtrace(NULL); > postgres=# select proname, proisstrict from pg_proc where proname = > 'pg_print_backtrace'; > proname | proisstrict > +- > pg_print_backtrace | t > > See the documentation: > "proisstrict bool > > Function returns null if any call argument is null. In that case the > function won't actually be called at all. Functions that are not > “strict” must be prepared to handle null inputs." > So below PG_ARGISNUL check is not needed, you can remove that, because > pg_print_backtrace will not get called with null input. > intbt_pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0); > Modified. > 3) Can we just set results = true instead of PG_RETURN_BOOL(true); so > that it will be returned from PG_RETURN_BOOL(result); just for > consistency? > if (!SendProcSignal(bt_pid, PROCSIG_BACKTRACE_PRINT, > InvalidBackendId)) > PG_RETURN_BOOL(true); > else > ereport(WARNING, > (errmsg("failed to send signal to postmaster: %m"))); > } > > PG_RETURN_BOOL(result); > I felt existing is better as it will reduce one instruction of setting first and then returning. There are only 2 places from where we return. I felt we could directly return true or false. > 4) Below is what happens if I request for a backtrace of the > postmaster process? 1388210 is pid of postmaster. > postgres=# select * from pg_print_backtrace(1388210); > WARNING: PID 1388210 is not a PostgreSQL server process > pg_print_backtrace > > f > > Does it make sense to have a postmaster's backtrace? If yes, can we > have something like below in sigusr1_handler()? > if (CheckPostmasterSignal(PMSIGNAL_BACKTRACE_EMIT)) > { > LogBackTrace(); > } > We had a discussion about this in [1] earlier and felt including this is not very useful. > 5) Can we have PROCSIG_PRINT_BACKTRACE instead of > PROCSIG_BACKTRACE_PRINT, just for readability and consistency with > function name? > PROCSIG_PRINT_BACKTRACE is better, I have modified it. > 6) I think it's n
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Wed, Feb 03, 2021 at 03:37:39PM +0900, Michael Paquier wrote: > index 627b36300c..4ee3951ca0 100644 > --- a/doc/src/sgml/ref/reindex.sgml > +++ b/doc/src/sgml/ref/reindex.sgml > @@ -293,8 +311,30 @@ REINDEX [ ( class="parameter">option [, ...] ) ] { IN > respectively. Each partition of the specified partitioned relation is > reindexed in a separate transaction. Those commands cannot be used inside > a transaction block when working on a partitioned table or index. > + If a REINDEX command fails when run on a partitioned > + relation, and TABLESPACE was specified, then it may not > + have moved all indexes to the new tablespace. Re-running the command > + will rebuild again all the partitions and move previously-unprocessed remove "again" > + indexes to the new tablespace. > + > + > + > + When using the TABLESPACE clause with > + REINDEX on a partitioned index or table, only the > + tablespace references of the partitions are updated. As partitioned > indexes I think you should say "of the LEAF partitions ..". The intermediate, partitioned tables are also "partitions" (partitioned partitions if you like). > + are not updated, it is recommended to separately use > + ALTER TABLE ONLY on them to achieve that. Maybe say: "..to set the default tablespace of any new partitions created in the future". -- Justin
Re: Printing backtrace of postgres processes
On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy wrote: > > On Mon, Feb 1, 2021 at 6:14 AM Bharath Rupireddy > wrote: > > On Fri, Jan 29, 2021 at 7:10 PM vignesh C wrote: > > > > 4) How about following > > > > + errmsg("must be a superuser to print backtrace > > > > of backend process"))); > > > > instead of > > > > + errmsg("must be a superuser to print backtrace > > > > of superuser query process"))); > > > > > > > > > > Here the message should include superuser, we cannot remove it. Non > > > super user can log non super user provided if user has permissions for > > > it. > > > > > > > 5) How about following > > > > errmsg("must be a member of the role whose backed > > > > process's backtrace is being printed or member of > > > > pg_signal_backend"))); > > > > instead of > > > > + errmsg("must be a member of the role whose > > > > backtrace is being logged or member of pg_signal_backend"))); > > > > > > > > > > Modified it. > > > > Maybe I'm confused here to understand the difference between > > SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and > > corresponding error messages. Some clarification/use case to know in > > which scenarios we hit those error messages might help me. Did we try > > to add test cases that show up these error messages for > > pg_print_backtrace? If not, can we add? > > Are these superuser and permission checks enough from a security > standpoint that we don't expose some sensitive information to the > user? Although I'm not sure, say from the backtrace printed and > attached to GDB, can users see the passwords or other sensitive > information from the system that they aren't supposed to see? > > I'm sure this point would have been discussed upthread. This will just print the backtrace of the current backend. Users cannot get password information from this. This backtrace will be sent from customer side to the customer support. Developers will use gdb to check the file and line number using the addresses. We are suggesting to use gdb to get the file and line number from the address. Users can attach gdb to the process even now without this feature, I think that behavior will be the same as is. That will not be impacted because of this feature. It was discussed to keep the checks similar to pg_terminate_backend as discussed in [1]. [1] https://www.postgresql.org/message-id/CA%2BTgmoZ8XeQVCCqfq826kAr804a1ZnYy46FnQB9r2n_iOofh8Q%40mail.gmail.com Regards, Vignesh
Re: Typo in tablesync comment
On Tue, Feb 02, 2021 at 07:23:37PM +0530, Amit Kapila wrote: > I don't mind changing to your proposed text but I think the current > wording is also okay and seems clear to me. Reading that again, I still find the word "transient" to be misleading in this context. Any extra opinions? -- Michael signature.asc Description: PGP signature
Re: Printing backtrace of postgres processes
vignesh C writes: > On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy > wrote: >> Are these superuser and permission checks enough from a security >> standpoint that we don't expose some sensitive information to the >> user? > This will just print the backtrace of the current backend. Users > cannot get password information from this. Really? A backtrace normally exposes the text of the current query, for instance, which could contain very sensitive data (passwords in ALTER USER, customer credit card numbers in ordinary data, etc etc). We don't allow the postmaster log to be seen by any but very privileged users; it's not sane to think that this data is any less security-critical than the postmaster log. This point is entirely separate from the question of whether triggering stack traces at inopportune moments could cause system malfunctions, but that question is also not to be ignored. TBH, I'm leaning to the position that this should be superuser only. I do NOT agree with the idea that ordinary users should be able to trigger it, even against backends theoretically belonging to their own userid. (Do I need to point out that some levels of the call stack might be from security-definer functions with more privilege than the session's nominal user?) regards, tom lane
Re: Typo in tablesync comment
On Wed, Feb 3, 2021 at 6:13 PM Michael Paquier wrote: > > On Tue, Feb 02, 2021 at 07:23:37PM +0530, Amit Kapila wrote: > > I don't mind changing to your proposed text but I think the current > > wording is also okay and seems clear to me. > > Reading that again, I still find the word "transient" to be misleading > in this context. Any extra opinions? OTOH I thought "additionally stored" made it seem like those states were in the catalog and "additionally" in shared memory. Maybe better to rewrite it more drastically? e.g - *The catalog pg_subscription_rel is used to keep information about *subscribed tables and their state. The catalog holds all states *except SYNCWAIT and CATCHUP which are only in shared memory. - Kind Regards, Peter Smith. Fujitsu Australia
Re: Single transaction in the tablesync worker?
On Tue, Feb 2, 2021 at 9:03 PM Ajin Cherian wrote: > I am sorry, my above steps were not correct. I think the reason for > the failure I was seeing were some other steps I did prior to this. I > will recreate this and update you with the appropriate steps. The correct steps are as follows: Publisher: postgres=# CREATE TABLE tab_rep (a int primary key); CREATE TABLE postgres=# INSERT INTO tab_rep SELECT generate_series(1,100); INSERT 0 100 postgres=# CREATE PUBLICATION tap_pub FOR ALL TABLES; CREATE PUBLICATION Subscriber: postgres=# CREATE TABLE tab_rep (a int primary key); CREATE TABLE postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false); NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION postgres=# ALTER SUBSCRIPTION tap_sub enable; ALTER SUBSCRIPTION Allow the tablesync to complete and then drop the subscription, the table remains full and restarting the subscription should fail with a constraint violation during tablesync but it does not. Subscriber: postgres=# drop subscription tap_sub ; NOTICE: dropped replication slot "tap_sub" on publisher DROP SUBSCRIPTION postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false); NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION postgres=# ALTER SUBSCRIPTION tap_sub enable; ALTER SUBSCRIPTION This takes the subscriber into an error loop but no mention of what the error was: 2021-02-02 05:01:34.698 EST [1549] LOG: logical replication table synchronization worker for subscription "tap_sub", table "tab_rep" has started 2021-02-02 05:01:34.739 EST [1549] ERROR: table copy could not rollback transaction on publisher 2021-02-02 05:01:34.739 EST [1549] DETAIL: The error was: another command is already in progress 2021-02-02 05:01:34.740 EST [8028] LOG: background worker "logical replication worker" (PID 1549) exited with exit code 1 2021-02-02 05:01:40.107 EST [1711] LOG: logical replication table synchronization worker for subscription "tap_sub", table "tab_rep" has started 2021-02-02 05:01:40.121 EST [1711] ERROR: could not create replication slot "pg_16479_sync_16435": ERROR: replication slot "pg_16479_sync_16435" already exists 2021-02-02 05:01:40.121 EST [8028] LOG: background worker "logical replication worker" (PID 1711) exited with exit code 1 2021-02-02 05:01:45.140 EST [1891] LOG: logical replication table synchronization worker for subscription "tap_sub", table "tab_rep" has started regards, Ajin Cherian Fujitsu Australia