date:20210124

Re: pg_upgrade fails with non-standard ACL

2021-01-24 Thread Noah Misch

On Thu, Jan 21, 2021 at 01:03:58AM +0300, Anastasia Lubennikova wrote:
> On 03.01.2021 14:29, Noah Misch wrote:
> >Overall, this patch predicts a subset of cases where pg_dump will emit a
> >failing GRANT or REVOKE that targets a pg_catalog object.  Can you write a
> >code comment stating what it does and does not detect?  I think it's okay to
> >not predict every failure, but let's record the intended coverage.  Here are 
> >a
> >few examples I know; there may be more.  The above query only finds GRANTs to
> >non-pinned roles.  pg_dump dumps the following, but the above query doesn't
> >find them:
> >
> >   REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public;
> >   GRANT EXECUTE ON FUNCTION pg_reload_conf() TO pg_signal_backend;

I see a new comment listing object types.  Please also explain the lack of
preventing REVOKE failures (first example query above) and the limitation
around non-pinned roles (second example).

> >The above query should test refclassid.

Please do so.

> + /* Handle table column objects */
> + if (strstr(aclinfo->obj_type, "column") != NULL)
> + {
> + char   *name_pos = 
> strstr(aclinfo->obj_ident,
> + 
>   aclinfo->obj_name);

> + if (*name_pos == '\"')
> + name_pos--;

This solves the problem affecting a column named "a.b", but it fails for a
column named "pg_catalog" or "a""b".  I recommend solving this by retrieving
all three elements of the pg_identify_object_as_address array, then quoting
each of them on the client side.

Re: Is Recovery actually paused?

2021-01-24 Thread Dilip Kumar

On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy
 wrote:
>
> On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar  wrote:
> > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy 
> >  wrote:
> >>
> >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar  wrote:
> >> > Please find the patch for the same.  I haven't added a test case for
> >> > this yet.  I mean we can write a test case to pause the recovery and
> >> > get the status.  But I am not sure that we can really write a reliable
> >> > test case for 'pause requested' and 'paused'.
> >>
> >> +1 to just show the recovery pause state in the output of
> >> pg_is_wal_replay_paused. But, should the function name
> >> "pg_is_wal_replay_paused" be something like
> >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> >> in a function, I expect a boolean output. Others may have better
> >> thoughts.
> >>
> >> IIUC the above change, ensuring the recovery is paused after it's
> >> requested lies with the user. IMHO, the main problem we are trying to
> >> solve is not met
> >
> >
> > Basically earlier their was no way for the user yo know whether the 
> > recovery is actually paused or not because it was always returning true 
> > after pause requested.  Now, we will return whether pause requested or 
> > actually paused.  So > for tool designer who want to wait for recovery to 
> > get paused can have a loop and wait until the recovery state reaches to 
> > paused.  That will give a better control.
>
> I get it and I agree to have that change. My point was whether we can
> have a new function pg_wal_replay_pause_and_wait that waits until
> recovery is actually paused ((along with pg_is_wal_replay_paused
> returning the actual state than a true/false) so that tool developers
> don't need to have the waiting code outside, if at all they care about
> it? Others may have better thoughts than me.

I think the previous patch was based on that idea where we thought
that we can pass an argument to pg_is_wal_replay_paused which can
decide whether to wait or return without the wait.  I think this
version looks better to me where we give the status instead of
waiting.  I am not sure whether we want another version of
pg_wal_replay_pause which will wait for actually it to get paused.  I
mean there is always a scope to include the functionality in the
database which can be achieved by the tool, but this patch was trying
to solve the problem that there was no way to know the status so I
think returning the correct status should be the scope of this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Joel Jacobson

Hi Mark and others,

On Sun, Jan 24, 2021, at 09:22, Mark Rofail wrote:
> Changelog:
> - v13 (compatible with current master 2021-01-24, commit 
> 7e57255f6189380d545e1df6a6b38827b213e3da)
...
> I encourage everyone to take review this patch. After considerable reviews 
> and performance testing, it will be ready for a commitfest.
...
> Array-ELEMENT-foreign-key-v13.patch

This is awesome, a million thanks for this!

I've tested the patch and tried to use it in the pg_catalog-diff-tool I'm 
working on.

I found one problem, described in Test #3 below.

*** Test #1 OK: Multi-key FK on (oid, smallint[])

Find a suitable row to do testing on:

joel=# SELECT oid,conrelid,conkey FROM catalog_clone.pg_constraint WHERE 
cardinality(conkey) > 1 LIMIT 1;
  oid  | conrelid |  conkey
---+--+--
12112 | 1255 | {2,20,3}
(1 row)

Corrupting the row will not be detected since no FK yet:

joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3,1234}' WHERE 
oid = 12112;
UPDATE 1

Trying to add a FK now will detect the corrupted row:

joel=# ALTER TABLE catalog_clone.pg_constraint ADD FOREIGN KEY (conrelid, EACH 
ELEMENT OF conkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum);
ERROR:  insert or update on table "pg_constraint" violates foreign key 
constraint "pg_constraint_conrelid_conkey_fkey"
DETAIL:  Key (conrelid, EACH ELEMENT OF conkey)=(1255, {2,20,3,1234}) is not 
present in table "pg_attribute".

OK, good, we got an error.

Fix row and try again:

joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3}' WHERE oid = 
12112;
UPDATE 1
joel=# ALTER TABLE catalog_clone.pg_constraint ADD FOREIGN KEY (conrelid, EACH 
ELEMENT OF conkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum);
ALTER TABLE

OK, good, FK added.

Thanks to the FK, trying to corrupt the column again will now give an error:

joel=# UPDATE catalog_clone.pg_constraint SET conkey = '{2,20,3,1234}' WHERE 
oid = 12112;
ERROR:  insert or update on table "pg_constraint" violates foreign key 
constraint "pg_constraint_conrelid_conkey_fkey"
DETAIL:  Key (conrelid, EACH ELEMENT OF conkey)=(1255, {2,20,3,1234}) is not 
present in table "pg_attribute".

OK, good, we got an error.

*** Test #2 OK: FK on oid[]

Find a suitable row to do testing on:

joel=# \d catalog_clone.pg_proc
proallargtypes  | oid[] |   |  |

joel=# SELECT oid,proallargtypes FROM catalog_clone.pg_proc WHERE 
cardinality(proallargtypes) > 1 LIMIT 1;
oid  | proallargtypes
--+
3059 | {25,2276}
(1 row)

Corrupting the row will not be detected since no FK yet:

joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276,1234}' WHERE 
oid = 3059;
UPDATE 1

Trying to add a FK now will detect the corrupted row:

joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF 
proallargtypes) REFERENCES catalog_clone.pg_type (oid);
ERROR:  insert or update on table "pg_proc" violates foreign key constraint 
"pg_proc_proallargtypes_fkey"
DETAIL:  Key (EACH ELEMENT OF proallargtypes)=({25,2276,1234}) is not present 
in table "pg_type".

OK, good, we got an error.

Fix row and try again:

joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276}' WHERE oid 
= 3059;
UPDATE 1
joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF 
proallargtypes) REFERENCES catalog_clone.pg_type (oid);
ALTER TABLE

OK, good, FK added.

Thanks to the FK, trying to corrupt the column again will now give an error:

joel=# UPDATE catalog_clone.pg_proc SET proallargtypes = '{25,2276,1234}' WHERE 
oid = 3059;
ERROR:  insert or update on table "pg_proc" violates foreign key constraint 
"pg_proc_proallargtypes_fkey"
DETAIL:  Key (EACH ELEMENT OF proallargtypes)=({25,2276,1234}) is not present 
in table "pg_type".

OK, good, we got an error.

*** Test 3 NOT OK: FK on oidvector

Find a suitable row to do testing on:

joel=# \d catalog_clone.pg_proc
proargtypes | oidvector |   |  |

joel=# SELECT oid,proargtypes FROM catalog_clone.pg_proc WHERE 
cardinality(proargtypes) > 1 LIMIT 1;
oid | proargtypes
-+-
  79 | 19 25
(1 row)

Corrupting the row will not be detected since no FK yet:

joel=# UPDATE catalog_clone.pg_proc SET proargtypes = '19 25 12345'::oidvector 
WHERE oid = 79;
UPDATE 1

Trying to add a FK now will detect the corrupted row:

joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF 
proargtypes) REFERENCES catalog_clone.pg_type (oid);
ERROR:  insert or update on table "pg_proc" violates foreign key constraint 
"pg_proc_proargtypes_fkey"
DETAIL:  Key (EACH ELEMENT OF proargtypes)=(19 25 12345) is not present in 
table "pg_type".

OK, good, we got an error.

Fix row and try again:

joel=# UPDATE catalog_clone.pg_proc SET proargtypes = '19 25'::oidvector WHERE 
oid = 79;
UPDATE 1
joel=# ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF 
proargtypes) REFERENCES catalog_clone.pg_type (oid);
ALTER TABLE

OK, good,

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2021-01-24 Thread Andy Fan

Hi Masahiko:

On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada 
wrote:

> Hi Andy,
>
> On Mon, Dec 7, 2020 at 9:15 PM Andy Fan  wrote:
> >
> >
> >
> > On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <
> jesper.peder...@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> On 12/5/20 10:38 PM, Andy Fan wrote:
> >> > Currently the UniqueKey is defined as a List of Expr, rather than
> >> > EquivalenceClasses.
> >> > A complete discussion until now can be found at [1] (The messages I
> replied
> >> > to also
> >> > care a lot and the information is completed). This patch has stopped
> at
> >> > this place for
> >> > a while,  I'm planning to try EquivalenceClasses,  but any suggestion
> would
> >> > be welcome.
> >> >
> >> >
> >>
> >> Unfortunately I think we need a RfC style patch of both versions in
> >> their minimum implementation.
> >>
> >> Hopefully this will make it easier for one or more committers to decide
> >> on the right direction since they can do a side-by-side comparison of
> >> the two solutions.
> >>
> >
> > I do get the exact same idea.  Actually I have made EquivalenceClasses
> > works with baserel last weekend and then I realized it is hard to compare
> > the 2 situations without looking into the real/Poc code, even for very
> > experienced people.  I will submit a new patch after I get the
> partitioned
> > relation, subquery works.  Hope I can make it in one week.
>
> Status update for a commitfest entry.
>
> Are you planning to submit a new patch? Or is there any blocker for
> this work? This patch entry on CF app has been in state Waiting on
> Author for a while. If there is any update on that, please reflect on
> CF app.
>
>
> I agree that  the current status is "Waiting on author",  and no block
issue for others.
I plan to work on this in 1 month.  I have to get my current urgent case
completed first.
Sorry for the delay action and thanks for asking.


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Extend more usecase for planning time partition pruning and init partition pruning.

2021-01-24 Thread Andy Fan

Hi:

 I recently found a use case like this.  SELECT * FROM p, q WHERE p.partkey
=
 q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning
time
 partition prune or init partition prune.  Even though we have run-time
 partition pruning work at last, it is too late in some cases since we have
 to init all the plan nodes in advance.  In my case, there are 10+
 partitioned relation in one query and the execution time is short, so the
 init plan a lot of plan nodes cares a lot.

The attached patches fix this issue. It just get the "p.partkey = q.colx"
case in root->eq_classes or rel->joinlist (outer join), and then check if
there
is some baserestrictinfo in another relation which can be used for partition
pruning. To make the things easier, both partkey and colx must be Var
expression in implementation.

- v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch

Just some existing refactoring and extending ChangeVarNodes to be able
to change var->attno.

- v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch

Do the real job.

Thought?



-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch
Description: Binary data


v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch
Description: Binary data

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Joel Jacobson

On Sun, Jan 24, 2021, at 11:21, Mark Rofail wrote:
>This seems to be a type casting problem indeed. The coercion part of the patch 
>found in "ruleutils.c:11438-11491" is the culprit, is there a cleaner way to 
>achieve this?
>I am aware of the problem and will try to fix it, but I assume this is because 
>of "Array-containselem-gin-v1.patch" can you give the v13 patch a try alone?
>I will work on a fix and send it soon.

Actually, the error occurred without the "Array-containselem-gin-v1.patch".

However, I also had the 
"v3-0001-Add-primary-keys-and-unique-constraints-to-system.patch", to get PKs 
on the oids.

So, I've retested to ensure it didn't cause this problem,
this time only 7e57255f6189380d545e1df6a6b38827b213e3da + 
"Array-ELEMENT-foreign-key-v13.patch",
but I still get the same error though.

Exact steps to reproduce error from a clean installation:

$ cd postgresql
$ git checkout 7e57255f6189380d545e1df6a6b38827b213e3da
$ patch -p1 < ~/Array-ELEMENT-foreign-key-v13.patch
$ ./configure --prefix=~/pg-head
$ make -j16
$ make install
$ initdb -D ~/pghead
$ pg_ctl -D ~/pghead -l /tmp/logfile start
$ createdb $USER
$ psql

CREATE SCHEMA catalog_clone;
CREATE TABLE catalog_clone.pg_proc AS SELECT * FROM pg_catalog.pg_proc;
CREATE TABLE catalog_clone.pg_type AS SELECT * FROM pg_catalog.pg_type;
ALTER TABLE catalog_clone.pg_type ADD PRIMARY KEY (oid);
ALTER TABLE catalog_clone.pg_proc ADD FOREIGN KEY (EACH ELEMENT OF proargtypes) 
REFERENCES catalog_clone.pg_type (oid);
UPDATE catalog_clone.pg_proc SET proargtypes = '19 25 12345'::oidvector WHERE 
oid = 79;
ERROR:  operator does not exist: oidvector pg_catalog.@> oid[]
LINE 1: ... 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1 OPERATOR(p...
 ^
HINT:  No operator matches the given name and argument types. You might need to 
add explicit type casts.
QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM 
pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) 
FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1 
OPERATOR(pg_catalog. @>) ARRAY["oid"] FOR KEY SHARE OF x) z)

/Joel

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Mark Rofail

Hello Joel,

On Sun, Jan 24, 2021 at 12:11 PM Joel Jacobson  wrote:

> HINT:  No operator matches the given name and argument types. You might
> need to add explicit type casts.
> QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM
> pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*)
> FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_type" x WHERE $1
> OPERATOR(pg_catalog. @>) ARRAY["oid"] FOR KEY SHARE OF x) z)
>
> It seems to me there is some type conversion between oidvector and oid[]
> that isn't working properly?
>

This seems to be a type casting problem indeed. The coercion part of the
patch found in "ruleutils.c:11438-11491" is the culprit, is there a cleaner
way to achieve this?
I am aware of the problem and will try to fix it, but I assume this is
because of "Array-containselem-gin-v1.patch" can you give the v13 patch a
try alone?

I will work on a fix and send it soon.

/Mark

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Joel Jacobson

Hi again,

I found a similar problem with int2vector columns:

CREATE TABLE catalog_clone.pg_index AS SELECT * FROM pg_catalog.pg_index;
CREATE TABLE catalog_clone.pg_attribute AS SELECT attrelid,attnum FROM 
pg_catalog.pg_attribute;
ALTER TABLE catalog_clone.pg_attribute ADD UNIQUE (attrelid, attnum);
ALTER TABLE catalog_clone.pg_index ADD FOREIGN KEY (indrelid, EACH ELEMENT OF 
indkey) REFERENCES catalog_clone.pg_attribute (attrelid, attnum);

SELECT indexrelid,indrelid,indkey FROM catalog_clone.pg_index WHERE 
cardinality(indkey) > 1 LIMIT 1;
indexrelid | indrelid | indkey
+--+
   2837 | 2836 | 1 2
(1 row)

UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE 
indexrelid = 2837;
ERROR:  operator does not exist: int2vector pg_catalog.@> smallint[]
LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p...
 ^
HINT:  No operator matches the given name and argument types. You might need to 
add explicit type casts.
QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM 
pg_catalog.unnest($2) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) 
FROM (SELECT 1 FROM ONLY "catalog_clone"."pg_attribute" x WHERE "attrelid" 
OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(pg_catalog. @>) ARRAY["attnum"] FOR 
KEY SHARE OF x) z)

This made me wonder if there are any more of these "vector" columns than the 
oidvector and int2vector.

It appears they are the only two:

SELECT DISTINCT data_type, udt_name from information_schema.columns WHERE 
udt_name LIKE '%vector' ORDER BY 1,2;
data_type |  udt_name
---+
ARRAY | int2vector
ARRAY | oidvector
(2 rows)

/Joel

Re: simplifying foreign key/RI checks

2021-01-24 Thread Amit Langote

On Sun, Jan 24, 2021 at 11:26 AM Corey Huinker  wrote:
> On Sat, Jan 23, 2021 at 12:52 PM Zhihong Yu  wrote:
>>
>> Hi,

Thanks for the review.

>> +   for (i = 0; i < riinfo->nkeys; i++)
>> +   {
>> +   Oid eq_opr = eq_oprs[i];
>> +   Oid typeid = RIAttType(fk_rel, riinfo->fk_attnums[i]);
>> +   RI_CompareHashEntry *entry = ri_HashCompareOp(eq_opr, typeid);
>> +
>> +   if (pk_nulls[i] != 'n' && 
>> OidIsValid(entry->cast_func_finfo.fn_oid))
>>
>> It seems the pk_nulls[i] != 'n' check can be lifted ahead of the assignment 
>> to the three local variables. That way, ri_HashCompareOp wouldn't be called 
>> when pk_nulls[i] == 'n'.

Good idea, so done.  Although, there can't be nulls right now.

>> +   case TM_Updated:
>> +   if (IsolationUsesXactSnapshot())
>> ...
>> +   case TM_Deleted:
>> +   if (IsolationUsesXactSnapshot())
>>
>> It seems the handling for TM_Updated and TM_Deleted is the same. The cases 
>> for these two values can be put next to each other (saving one block of 
>> code).

Ah, yes.  The TM_Updated case used to be handled a bit differently in
earlier unposted versions of the patch, though at some point I
concluded that the special handling was unnecessary, but didn't
realize what you just pointed out.  Fixed.

> I'll pause on reviewing v4 until you've addressed the suggestions above.

Here's v5.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v5-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


v5-0001-Export-get_partition_for_tuple.patch
Description: Binary data

Re: POC: postgres_fdw insert batching

2021-01-24 Thread Amit Langote

On Sun, Jan 24, 2021 at 2:17 AM Tomas Vondra
 wrote:
> On 1/23/21 9:31 AM, Amit Langote wrote:
> > I was looking at this and it looks like we've got a problematic case
> > where postgresGetForeignModifyBatchSize() is called from
> > ExecInitRoutingInfo().
> >
> > That case is when the insert is performed as part of a cross-partition
> > update of a partitioned table containing postgres_fdw foreign table
> > partitions.  Because we don't check the operation in
> > ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
> > inserts attempt to use batching.  However the ResultRelInfo may be one
> > for the original update operation, so ri_FdwState contains a
> > PgFdwModifyState with batch_size set to 0, because updates don't
> > support batching.  As things stand now,
> > postgresGetForeignModifyBatchSize() simply returns that, tripping the
> > following Assert in the caller.
> >
> > Assert(partRelInfo->ri_BatchSize >= 1);
> >
> > Use this example to see the crash:
> >
> > create table p (a int) partition by list (a);
> > create table p1 (like p);
> > create extension postgres_fdw;
> > create server lb foreign data wrapper postgres_fdw ;
> > create user mapping for current_user server lb;
> > create foreign table fp1 (a int) server lb options (table_name 'p1');
> > alter table p attach partition fp1 for values in (1);
> > create or replace function report_trig_details() returns trigger as $$
> > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> > 'DELETE' then return old; end if; return new; end; $$ language
> > plpgsql;
> > create trigger trig before update on fp1 for each row execute function
> > report_trig_details();
> > create table p2 partition of p for values in (2);
> > insert into p values (2);
> > update p set a = 1;  -- crashes
> >
> > So we let's check mtstate->operation == CMD_INSERT in
> > ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
> > in cross-update situations where mtstate->operation would be
> > CMD_UPDATE.
> >
> > I've attached a patch.
>
> Thanks for catching this. I think it'd be good if the fix included a
> regression test. The example seems like a good starting point, not sure
> if it can be simplified further.

Yes, it can be simplified by using a local join to prevent the update
of the foreign partition from being pushed to the remote server, for
which my example in the previous email used a local trigger.  Note
that the update of the foreign partition to be done locally is a
prerequisite for this bug to occur.

I've added that simplified test case in the attached updated patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v2-0001-Prevent-FDW-insert-batching-during-cross-partitio.patch
Description: Binary data

Re: mkid reference

2021-01-24 Thread Magnus Hagander

On Fri, Jan 22, 2021 at 7:07 PM Tom Lane  wrote:
>
> Daniel Gustafsson  writes:
> >> On 22 Jan 2021, at 12:56, Magnus Hagander  wrote:
> >> And maybe even more interestnig -- is there a point to this whole
> >> make_diff directory at all in these days of git? Or should we just
> >> remove it rather than try to fix it?
>
> > There's also src/tools/make_mkid which use this mkid tool.  +1 for removing.
> > If anything, it seems better replaced by extended documentation on the 
> > existing
> > wiki article [0] on how to use "git format-patch".
>
> I found man pages for mkid online --- it's apparently a ctags-like
> code indexing tool, not something for patches.  So maybe Bruce still
> uses it, or maybe not.  But as long as we've also got make_ctags and
> make_etags in there, I don't have a problem with leaving make_mkid.
>
> make_diff, on the other hand, certainly looks like technology whose
> time has passed.  I wonder about pgtest, too.

I'll go kill make_diff then -- quicker than fixing the docs of it.

As for pgtest, that one looks a bit interesting as well -- but it's
been patched on as late as 9.5 and in 2018, so it seems at least Bruce
uses it :)

While at it, what point is "codelines" adding?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-01-24 Thread Alvaro Herrera

On 2021-Jan-24, Julien Rouhaud wrote:

> + /*
> +  * Do not allow tuples with invalid combinations of hint bits to be 
> placed
> +  * on a page.  These combinations are detected as corruption by the
> +  * contrib/amcheck logic, so if you disable one or both of these
> +  * assertions, make corresponding changes there.
> +  */
> + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> +  (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> 
> 
> I attach a simple self contained script to reproduce the problem, the last
> UPDATE triggering the Assert.
> 
> I'm not really familiar with this part of the code, so it's not exactly clear
> to me if some logic is missing in compute_new_xmax_infomask() /
> heap_prepare_insert(), or if this should actually be an allowed combination of
> hint bit.

Hmm, it's probably a bug in compute_new_xmax_infomask.  I don't think
the combination is sensible.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"There is evil in the world. There are dark, awful things. Occasionally, we get
a glimpse of them. But there are dark corners; horrors almost impossible to
imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-01-24 Thread Mahendra Singh Thalor

On Sun, 24 Jan 2021 at 11:48, Julien Rouhaud  wrote:
>
> Hi,
>
> While working on pg14 compatibility for an extension relying on an
apparently
> uncommon combination of FOR UPDATE and stored function calls, I hit some
new
> Asserts introduced in 866e24d47db (Extend amcheck to check heap pages):
>
> +   /*
> +* Do not allow tuples with invalid combinations of hint bits to
be placed
> +* on a page.  These combinations are detected as corruption by
the
> +* contrib/amcheck logic, so if you disable one or both of these
> +* assertions, make corresponding changes there.
> +*/
> +   Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> +(tuple->t_data->t_infomask2 &
HEAP_KEYS_UPDATED)));
>
>
> I attach a simple self contained script to reproduce the problem, the last
> UPDATE triggering the Assert.
>
> I'm not really familiar with this part of the code, so it's not exactly
clear
> to me if some logic is missing in compute_new_xmax_infomask() /
> heap_prepare_insert(), or if this should actually be an allowed
combination of
> hint bit.

Thanks Juliean for reporting this. I am also able to reproduce this assert.

*Small test case to reproduce:*

> DROP TABLE IF EXISTS t1;
> CREATE TABLE t1(id integer, val text);
> INSERT INTO t1 SELECT i, 'val' FROM generate_series(1, 2) i;
>
> BEGIN;
> SAVEPOINT s1;
> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE;
> UPDATE t1 SET val = 'hoho' WHERE id = 2;
> release s1;
> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE;
> UPDATE t1 SET val = 'hoho' WHERE id = 2;
>

If we remove the "release s1;" step from the test case, then we are not
getting this assert failure.

*Stack trace:*

> warning: Unexpected size of section `.reg-xstate/123318' in core file.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `postgres: mahendrathalor postgres [local] UPDATE
>'.
> Program terminated with signal SIGABRT, Aborted.
>
> warning: Unexpected size of section `.reg-xstate/123318' in core file.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7fb50a7c88b1 in __GI_abort () at abort.c:79
> #2  0x5612a63f7c84 in ExceptionalCondition (
> conditionName=0x5612a64da470 "!((tuple->t_data->t_infomask &
> HEAP_XMAX_LOCK_ONLY) && (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))",
> errorType=0x5612a64da426 "FailedAssertion", fileName=0x5612a64da420
> "hio.c", lineNumber=57) at assert.c:69
> #3  0x5612a597e76b in RelationPutHeapTuple (relation=0x7fb50ce37fc0,
> buffer=163, tuple=0x5612a795de18, token=false) at hio.c:56
> #4  0x5612a5955d32 in heap_update (relation=0x7fb50ce37fc0,
> otid=0x7ffc8b5e30d2, newtup=0x5612a795de18, cid=0, crosscheck=0x0,
> wait=true, tmfd=0x7ffc8b5e3060,
> lockmode=0x7ffc8b5e3028) at heapam.c:3791
> #5  0x5612a596ebdc in heapam_tuple_update (relation=0x7fb50ce37fc0,
> otid=0x7ffc8b5e30d2, slot=0x5612a794d348, cid=3, snapshot=0x5612a793d620,
> crosscheck=0x0, wait=true,
> tmfd=0x7ffc8b5e3060, lockmode=0x7ffc8b5e3028,
> update_indexes=0x7ffc8b5e3025) at heapam_handler.c:327
> #6  0x5612a5da745d in table_tuple_update (rel=0x7fb50ce37fc0,
> otid=0x7ffc8b5e30d2, slot=0x5612a794d348, cid=3, snapshot=0x5612a793d620,
> crosscheck=0x0, wait=true,
> tmfd=0x7ffc8b5e3060, lockmode=0x7ffc8b5e3028,
> update_indexes=0x7ffc8b5e3025) at ../../../src/include/access/tableam.h:1422
> #7  0x5612a5dab6ef in ExecUpdate (mtstate=0x5612a794bc20,
> resultRelInfo=0x5612a794be58, tupleid=0x7ffc8b5e30d2, oldtuple=0x0,
> slot=0x5612a794d348, planSlot=0x5612a794d1f8,
> epqstate=0x5612a794bd18, estate=0x5612a794b9b0, canSetTag=true) at
> nodeModifyTable.c:1498
> #8  0x5612a5dadb17 in ExecModifyTable (pstate=0x5612a794bc20) at
> nodeModifyTable.c:2254
> #9  0x5612a5d4fdc5 in ExecProcNodeFirst (node=0x5612a794bc20) at
> execProcnode.c:450
> #10 0x5612a5d3bd3a in ExecProcNode (node=0x5612a794bc20) at
> ../../../src/include/executor/executor.h:247
> #11 0x5612a5d40764 in ExecutePlan (estate=0x5612a794b9b0,
> planstate=0x5612a794bc20, use_parallel_mode=false, operation=CMD_UPDATE,
> sendTuples=false, numberTuples=0,
> direction=ForwardScanDirection, dest=0x5612a7946d38,
> execute_once=true) at execMain.c:1542
> #12 0x5612a5d3c8e2 in standard_ExecutorRun (queryDesc=0x5612a79468c0,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:364
> #13 0x5612a5d3c5aa in ExecutorRun (queryDesc=0x5612a79468c0,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:308
> #14 0x5612a612d78a in ProcessQuery (plan=0x5612a7946c48,
> sourceText=0x5612a787b570 "UPDATE t1 SET val = 'hoho' WHERE id = 2;",
> params=0x0, queryEnv=0

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-01-24 Thread Julien Rouhaud

On Mon, Jan 25, 2021 at 12:01 AM Alvaro Herrera  wrote:
>
> On 2021-Jan-24, Julien Rouhaud wrote:
>
> > + /*
> > +  * Do not allow tuples with invalid combinations of hint bits to be 
> > placed
> > +  * on a page.  These combinations are detected as corruption by the
> > +  * contrib/amcheck logic, so if you disable one or both of these
> > +  * assertions, make corresponding changes there.
> > +  */
> > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > +  (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> >
> >
> > I attach a simple self contained script to reproduce the problem, the last
> > UPDATE triggering the Assert.
> >
> > I'm not really familiar with this part of the code, so it's not exactly 
> > clear
> > to me if some logic is missing in compute_new_xmax_infomask() /
> > heap_prepare_insert(), or if this should actually be an allowed combination 
> > of
> > hint bit.
>
> Hmm, it's probably a bug in compute_new_xmax_infomask.  I don't think
> the combination is sensible.

Yeah, the combination clearly doesn't make sense, but I'm wondering
what to do about existing data?  Amcheck.verify_am will report
corruption for those, and at least all servers where powa-archivist
extension is installed will be impacted.

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-01-24 Thread Julien Rouhaud

On Mon, Jan 25, 2021 at 12:06 AM Mahendra Singh Thalor
 wrote:
>
> On Sun, 24 Jan 2021 at 11:48, Julien Rouhaud  wrote:
> >
> > Hi,
> >
> > I'm not really familiar with this part of the code, so it's not exactly 
> > clear
> > to me if some logic is missing in compute_new_xmax_infomask() /
> > heap_prepare_insert(), or if this should actually be an allowed combination 
> > of
> > hint bit.
>
> Thanks Juliean for reporting this. I am also able to reproduce this assert.

Thanks for looking at it!
>
> Small test case to reproduce:
>>
>> DROP TABLE IF EXISTS t1;
>> CREATE TABLE t1(id integer, val text);
>> INSERT INTO t1 SELECT i, 'val' FROM generate_series(1, 2) i;
>>
>> BEGIN;
>> SAVEPOINT s1;
>> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE;
>> UPDATE t1 SET val = 'hoho' WHERE id = 2;
>> release s1;
>> SELECT 1 FROM t1 WHERE id = 2 FOR UPDATE;
>> UPDATE t1 SET val = 'hoho' WHERE id = 2;
>
>
> If we remove the "release s1;" step from the test case, then we are not 
> getting this assert failure.

Yes, this is the smallest reproducer that could trigger the problem,
and the release is required.

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Mark Rofail

Hello Joel,

UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE
> indexrelid = 2837;
> ERROR:  operator does not exist: int2vector pg_catalog.@> smallint[]
> LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p...
>
 In your example, you are using the notation  '1 2 12345'::int2vector which
signifies a vector as you have mentioned, however in this patch we use the
array operator @> to help with the indexing, the incompatibility stems from
here.
I do not think that postgres contains vector operators, correct me if I am
wrong. I feel that supporting vectors is our of the scope of this patch, if
you have an idea how to support it please let me know.

/Mark

Re: Stronger safeguard for archive recovery not to miss data

2021-01-24 Thread Laurenz Albe

On Thu, 2021-01-21 at 15:30 +0100, I wrote:
> On Thu, 2021-01-21 at 13:09 +, osumi.takami...@fujitsu.com wrote:
> 
> > > My vote is that we should not have a GUC for such an unlikely event, and 
> > > that
> > > stopping recovery is good enough.
> > OK. IIUC, my current patch for this fix doesn't need to be changed or 
> > withdrawn.
> > Thank you for your explanation.
> 
> Well, that's just my opinion.
> 
> Fujii Masao seemed to disagree with the patch, and his voice carries weight.

I think you should pst another patch where the second, now superfluous,
error message is removed.

Yours,
Laurenz Albe

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread tsunakawa.ta...@fujitsu.com

Hello Greg-san,


Second group of comments (I'll reply to (1) - (4) later):


(5)
@@ -790,7 +790,8 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
... 
-   if (plannedstmt->commandType != CMD_SELECT || 
plannedstmt->hasModifyingCTE)
+   if ((plannedstmt->commandType != CMD_SELECT &&
+!IsModifySupportedInParallelMode(plannedstmt->commandType)) || 
plannedstmt->hasModifyingCTE)
PreventCommandIfParallelMode(CreateCommandName((Node *) 
plannedstmt));
 }

Now that we're trying to allow parallel writes (INSERT), we should:

* use ExecCheckXactReadOnly() solely for checking read-only transactions, as 
the function name represents.  That is, move the call to 
PreventCommandIfParallelMode() up to standard_ExecutorStart().

* Update the comment  above the call to ExecCheckXactReadOnly().


(6)
@@ -764,6 +777,22 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
...
+   else
+   {
+   pei->processed_count = NULL;
+   }

The braces can be deleted.


(7)
@@ -1400,6 +1439,16 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)

 true);
queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
+   Assert(queryDesc->operation == CMD_SELECT || 
IsModifySupportedInParallelMode(queryDesc->operation));
+   if (IsModifySupportedInParallelMode(queryDesc->operation))
+   {
+   /*
+* Record that the CurrentCommandId is used, at the start of the
+* parallel operation.
+*/
+   SetCurrentCommandIdUsedForWorker();
+   }
+
/* Setting debug_query_string for individual workers */
debug_query_string = queryDesc->sourceText;

@@ -765,12 +779,16 @@ GetCurrentCommandId(bool used)
if (used)
{
/*
-* Forbid setting currentCommandIdUsed in a parallel worker, 
because
-* we have no provision for communicating this back to the 
leader.  We
-* could relax this restriction when currentCommandIdUsed was 
already
-* true at the start of the parallel operation.
+* If in a parallel worker, only allow setting 
currentCommandIdUsed if
+* currentCommandIdUsed was already true at the start of the 
parallel
+* operation (by way of SetCurrentCommandIdUsed()), otherwise 
forbid
+* setting currentCommandIdUsed because we have no provision for
+* communicating this back to the leader. Once 
currentCommandIdUsed is
+* set, the commandId used by leader and workers can't be 
changed,
+* because CommandCounterIncrement() then prevents any attempted
+* increment of the current commandId.
 */
-   Assert(!IsParallelWorker());
+   Assert(!(IsParallelWorker() && !currentCommandIdUsed));
currentCommandIdUsed = true;
}
return currentCommandId;

What happens without these changes?  If this kind of change is really 
necessary, it seems more natural to pass currentCommandIdUsed together with 
currentCommandId through SerializeTransactionState() and 
StartParallelWorkerTransaction(), instead of the above changes.

As an aside, SetCurrentCommandIdUsed() in the comment should be 
SetCurrentCommandIdUsedForWorker().


(8)
+   /*
+* If the trigger type is RI_TRIGGER_FK, this indicates a FK 
exists in
+* the relation, and this would result in creation of new 
CommandIds
+* on insert/update/delete and this isn't supported in a 
parallel
+* worker (but is safe in the parallel leader).
+*/
+   trigtype = RI_FKey_trigger_type(trigger->tgfoid);
+   if (trigtype == RI_TRIGGER_FK)
+   {
+   if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, 
context))
+   return true;
+   }

Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK 
triggers do not generate command IDs.  See RI_FKey_check() which is called in 
RI_TRIGGER_FK case.  In there, ri_PerformCheck() is called with the 
detectNewRows argument set to false, which causes CommandCounterIncrement() to 
not be called.

Plus, tables that have RI_TRIGGER_PK should allow parallel INSERT in a 
parallel-safe manner, because those triggers only fire for UPDATE and DELETE.  
So, for the future parallel UPDATE/DELETE support, the above check should be 
performed in UPDATE and DELETE cases.

(In a data warehouse, fact tables, which store large amounts of historical 
data, typically have foreign keys to smaller dimension tables.  Thus, it's 
important to allow parallel INSERTs on tables with foreign keys.)


Regards
Takayuki Tsunakawa

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda


Hi, Japin

Thanks for your comments.

On 2021-01-23 01:46, japin wrote:

Hi, Masahiro

Thanks for you update the v4 patch.  Here are some comments:

(1)
+   char*msg = NULL;
+   boolsync_called;/* whether to sync
data to the disk. */
+   instr_time  start;
+   instr_time  duration;
+
+   /* check whether to sync data to the disk is really occurred. 
*/

+   sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.


Yes, I fixed it.


(2)
+   if (sync_called)
+   {
+   /* increment the i/o timing and the number of times to
fsync WAL data */
+   if (track_wal_io_timing)
+   {
+   INSTR_TIME_SET_CURRENT(duration);
+   INSTR_TIME_SUBTRACT(duration, start);
+   WalStats.m_wal_sync_time =
INSTR_TIME_GET_MICROSEC(duration);
+   }
+
+   WalStats.m_wal_sync++;
+   }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).


Yes, I removed it.


In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data?  And for those cases, 
we
cannot get accurate write/sync timing and number of write/sync times, 
right?


case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;


Yes, I add the following comments in the document.

@@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i

   
   
Total number of times WAL data was synced to disk
+   (if  is 
open_datasync or
+   open_sync, this value is zero because WAL 
data is synced

+   when to write it).
   
  

@@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i

   
Total amount of time that has been spent in the portion of
WAL data was synced to disk, in milliseconds
-   (if  is enabled, 
otherwise zero)
+   (if  is enabled, 
otherwise zero.
+   if  is 
open_datasync or
+   open_sync, this value is zero too because WAL 
data is synced

+   when to write it).
   
  


I attached a modified patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Fri, 22 Jan 2021 21:38:31 +0900
Subject: [PATCH] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of writing/syncing WAL data.
- the total amount of time that has been spent in
  writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it leads syncing
WAL data at same time when to write it.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75...@oss.nttdata.com

(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
---
 doc/src/sgml/config.sgml  | 21 +++
 doc/src/sgml/monitoring.sgml  | 48 ++-
 src/backend/access/transam/xlog.c | 59 ++-
 src/backend/catalog/system_views.sql  |  4 ++
 src/backend/postmaster/pgstat.c   |  4 ++
 src/backend/postmaster/walwriter.c|  3 +
 src/backend/utils/adt/pgstatfuncs.c   | 24 +++-
 src/backend/utils/misc/guc.c  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/include/catalog/pg_proc.dat   | 14 ++---
 src/include/pgstat.h  |  8 +++
 src/test/regress/expected/rules.out   |  6 +-
 13 files changed, 189 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..43f3fbcaf8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  track_wal_io_timing (boolean)
+  
+   track_wal_io_timing configuration parameter
+  
+  
+  
+   
+Enables timing of WAL I/O calls. This parameter is off by default,
+because it will repeatedly query the operating system for
+the current time, which may cause significant overhead on some
+platforms.  You can use the  tool to
+measure the overhead of timing on your system.
+I/O timing information is
+displayed in

Re: simplifying foreign key/RI checks

2021-01-24 Thread Corey Huinker

On Sun, Jan 24, 2021 at 6:51 AM Amit Langote 
wrote:

> On Sun, Jan 24, 2021 at 11:26 AM Corey Huinker 
> wrote:
> > On Sat, Jan 23, 2021 at 12:52 PM Zhihong Yu  wrote:
> >>
> >> Hi,
>
> Thanks for the review.
>
> >> +   for (i = 0; i < riinfo->nkeys; i++)
> >> +   {
> >> +   Oid eq_opr = eq_oprs[i];
> >> +   Oid typeid = RIAttType(fk_rel, riinfo->fk_attnums[i]);
> >> +   RI_CompareHashEntry *entry = ri_HashCompareOp(eq_opr,
> typeid);
> >> +
> >> +   if (pk_nulls[i] != 'n' &&
> OidIsValid(entry->cast_func_finfo.fn_oid))
> >>
> >> It seems the pk_nulls[i] != 'n' check can be lifted ahead of the
> assignment to the three local variables. That way, ri_HashCompareOp
> wouldn't be called when pk_nulls[i] == 'n'.
>
> Good idea, so done.  Although, there can't be nulls right now.
>
> >> +   case TM_Updated:
> >> +   if (IsolationUsesXactSnapshot())
> >> ...
> >> +   case TM_Deleted:
> >> +   if (IsolationUsesXactSnapshot())
> >>
> >> It seems the handling for TM_Updated and TM_Deleted is the same. The
> cases for these two values can be put next to each other (saving one block
> of code).
>
> Ah, yes.  The TM_Updated case used to be handled a bit differently in
> earlier unposted versions of the patch, though at some point I
> concluded that the special handling was unnecessary, but didn't
> realize what you just pointed out.  Fixed.
>
> > I'll pause on reviewing v4 until you've addressed the suggestions above.
>
> Here's v5.
>

v5 patches apply to master.
Suggested If/then optimization is implemented.
Suggested case merging is implemented.
Passes make check and make check-world yet again.
Just to confirm, we *don't* free the RI_CompareHashEntry because it points
to an entry in a hash table which is TopMemoryContext aka lifetime of the
session, correct?

Anybody else want to look this patch over before I mark it Ready For
Committer?

Re: Is Recovery actually paused?

2021-01-24 Thread Masahiko Sawada

On Sun, Jan 17, 2021 at 5:19 PM Dilip Kumar  wrote:
>
> On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > > > >
> > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > > Dilip Kumar  wrote:
> > > > >
> > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > > > wait.
> > > > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > > > blocked forever,
> > > > > > > > although this setting may not be usual. In addition, some users 
> > > > > > > > may set
> > > > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > > > pg_is_wal_replay_paused,
> > > > > > > > it could wait for a long time.
> > > > > > > >
> > > > > > > > At least, I think we need some descriptions on document to 
> > > > > > > > explain
> > > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > > >
> > > > > > > Ok
> > > > > >
> > > > > > Fixed this, added some comments in .sgml as well as in function 
> > > > > > header
> > > > >
> > > > > Thank you for fixing this.
> > > > >
> > > > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > > > "Pauses recovery." to "Request to pause recovery." in according with
> > > > > pg_is_wal_replay_paused?
> > > >
> > > > Okay
> > > >
> > > > >
> > > > > > > > Also, how about adding a new boolean argument to 
> > > > > > > > pg_is_wal_replay_paused to
> > > > > > > > control whether this waits for recovery to get paused or not? 
> > > > > > > > By setting its
> > > > > > > > default value to true or false, users can use the old format 
> > > > > > > > for calling this
> > > > > > > > and the backward compatibility can be maintained.
> > > > > > >
> > > > > > > So basically, if the wait_recovery_pause flag is false then we 
> > > > > > > will
> > > > > > > immediately return true if the pause is requested?  I agree that 
> > > > > > > it is
> > > > > > > good to have an API to know whether the recovery pause is 
> > > > > > > requested or
> > > > > > > not but I am not sure is it good idea to make this API serve both 
> > > > > > > the
> > > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > > >
> > > > >
> > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > purpose;
> > > > > this waits recovery to actually get paused. If we want to limit this 
> > > > > API's
> > > > > purpose only to return the pause state, it seems better to fix this 
> > > > > to return
> > > > > the actual state at the cost of lacking the backward compatibility. 
> > > > > If we want
> > > > > to know whether pause is requested, we may add a new API like
> > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > recovery to actually
> > > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > > purpose.
> > > > >
> > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > > > care either.
> > > >
> > > > I don't think that it will be blocked ever, because
> > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > recovery process will not be stuck on waiting for the WAL.
> > > >
> > > > > > > > As another comment, while pg_is_wal_replay_paused is blocking, 
> > > > > > > > I can not cancel
> > > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > > > waiting loop.
> > > > >
> > > > > How about this fix? I think users may want to cancel 
> > > > > pg_is_wal_replay_paused() during
> > > > > this is blocking.
> > > >
> > > > Yeah, we can do this.  I will send the updated patch after putting
> > > > some more thought into these comments.  Thanks again for the feedback.
> > > >
> > >
> > > Please find the updated patch.
> >
> > I've looked at the patch. Here are review comments:
> >
> > +   /* Recovery pause state */
> > +   RecoveryPauseState  recoveryPause;
> >
> > Now that the value can have tri-state, how about renaming it to
> > recoveryPauseState?
>
> This makes sense to me.
>
> > ---
> >  bool
> >  RecoveryIsPaused(void)
> > +{
> > +   boolrecoveryPause;
> > +
> > +   SpinLockAcquire(&XLogCtl->info_lck);
> > +   recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ?
> > true : false;
> > +   SpinLockRelease(&XLogCtl->info_lck);
> > +
> > +   return recoveryPause;
> > +}
> > +
> > +bool
> > +RecoveryPauseRequested(void)
> >  {
> > boolrecoveryPause;
> >
> > SpinLockAcquire(&XLogCtl->info_lck);
> > -   recoveryPause = XLogCtl->recoveryPause;
> > +   recoveryPause = (XLogCtl->recoveryPause !=
> > RECOVERY_IN_PROGRESS) ? true : false;
> > SpinLockRelease(&XLogCtl->i

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Peter Smith

On Sun, Jan 24, 2021 at 5:54 PM Peter Smith  wrote:
> > 4.
> > - /*
> > - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> > - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> > - * NAMEDATALEN on the remote that matters, but this scheme will also work
> > - * reasonably if that is different.)
> > - */
> > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for 
> > sanity */
> > - slotname = psprintf("%.*s_%u_sync_%u",
> > - NAMEDATALEN - 28,
> > - MySubscription->slotname,
> > - MySubscription->oid,
> > - MyLogicalRepWorker->relid);
> > + /* Calculate the name of the tablesync slot. */
> > + slotname = ReplicationSlotNameForTablesync(
> > +MySubscription->oid,
> > +MyLogicalRepWorker->relid);
> >
> > What is the reason for changing the slot name calculation? If there is
> > any particular reasons, then we can add a comment to indicate why we
> > can't include the subscription's slotname in this calculation.
> >
>
> The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION)
> and so including the subscription slot name as part of the tablesync
> slot name was considered to be:
> a) possibly risky/undefined, if the subscription slot_name = NONE
> b) confusing, if we end up using 2 different slot names for the same
> tablesync (e.g. if the subscription slot name is changed before a sync
> worker is re-launched).
> And since this subscription slot name part is not necessary for
> uniqueness anyway, it was removed from the tablesync slot name to
> eliminate those concerns.
>
> Also, the tablesync slot name calculation was encapsulated as a
> separate function because previously (i.e. before v18) it was used by
> various other cleanup codes. I still like it better as a function, but
> now it is only called from one place so we could put that code back
> inline if you prefer it how it was..

It turns out those (a/b) concerns I wrote above are maybe unfounded,
because it seems not possible to alter the slot_name = NONE unless the
subscription is first DISABLED.
So probably I can revert all this tablesync slot name calculation back
to how it originally was in the OSS HEAD if you want.


Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-24 Thread Alvaro Herrera

On 2021-Jan-24, Mark Rofail wrote:

> I do not think that postgres contains vector operators, correct me if I am
> wrong. I feel that supporting vectors is our of the scope of this patch, if
> you have an idea how to support it please let me know.

I do not think that this patch needs to support oidvector and int2vector
types as if they were oid[] and int2[] respectively.  If you really want
an element FK there, you can alter the column types to the real array
types in your catalog clone.  oidvector and int2vector are internal
implementation artifacts.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread kuroda.hay...@fujitsu.com

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]+--
wal_records  | 121781
wal_fpi  | 2287
wal_bytes| 36055146
wal_buffers_full | 799
wal_write| 12770
wal_write_time   | 4.469
wal_sync | 11962
wal_sync_time| 132.352
stats_reset  | 2021-01-25 00:51:40.674412+00
```

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to disk because WAL buffers became 
full
+   Total number of times WAL data was written to disk because WAL buffers 
became full
+  
+ 
+
+ 
+  
+   wal_write bigint
+  
+  
+   Total number of times WAL data was written to disk
+  
+ 
+
+ 
+  
+   wal_write_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk, in milliseconds
+   (if  is enabled, otherwise 
zero).
+  
+ 
+
+ 
+  
+   wal_sync bigint
+  
+  
+   Total number of times WAL data was synced to disk
+   (if  is 
open_datasync or 
+   open_sync, this value is zero because WAL data is 
synced 
+   when to write it).
+  
+ 
+
+ 
+  
+   wal_sync_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk, in milliseconds
+   (if  is enabled, otherwise 
zero.
+   if  is 
open_datasync or 
+   open_sync, this value is zero too because WAL data 
is synced 
+   when to write it).
   
  
 ```

Maybe "Total amount of time" should be used, not "Total number of time."
Other views use "amount."

I have no comments anymore.

Hayato Kuroda
FUJITSU LIMITED

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiko Sawada

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
 wrote:
>
> On 2021-01-22 14:50, Masahiko Sawada wrote:
> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> >  wrote:
> >>
> >> Hi,
> >>
> >> I rebased the patch to the master branch.
> >
> > Thank you for working on this. I've read the latest patch. Here are
> > comments:
> >
> > ---
> > +   if (track_wal_io_timing)
> > +   {
> > +   INSTR_TIME_SET_CURRENT(duration);
> > +   INSTR_TIME_SUBTRACT(duration, start);
> > +   WalStats.m_wal_write_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > +   }
> >
> > * I think it should add the time in micro sec.
> > After running pgbench with track_wal_io_timing = on for 30 sec,
> > pg_stat_wal showed the following on my environment:
> >
> > postgres(1:61569)=# select * from pg_stat_wal;
> > -[ RECORD 1 ]+-
> > wal_records  | 285947
> > wal_fpi  | 53285
> > wal_bytes| 442008213
> > wal_buffers_full | 0
> > wal_write| 25516
> > wal_write_time   | 0
> > wal_sync | 25437
> > wal_sync_time| 14490
> > stats_reset  | 2021-01-22 10:56:13.29464+09
> >
> > Since writes can complete less than a millisecond, wal_write_time
> > didn't increase. I think sync_time could also have the same problem.
>
> Thanks for your comments. I didn't notice that.
> I changed the unit from milliseconds to microseconds.
>
> > ---
> > +   /*
> > +* Measure i/o timing to fsync WAL data.
> > +*
> > +* The wal receiver skip to collect it to avoid performance
> > degradation of standy servers.
> > +* If sync_method doesn't have its fsync method, to skip too.
> > +*/
> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> > fsyncMethodCalled())
> > +   INSTR_TIME_SET_CURRENT(start);
> >
> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> > is true? I think the performance degradation is also true for backend
> > processes. If there is another reason for that, I think it's better to
> > mention in both the doc and comment.
> > * How about checking track_wal_io_timing first?
> > * s/standy/standby/
>
> I fixed it.
> As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

>
> > ---
> > +   /* increment the i/o timing and the number of times to fsync WAL
> > data */
> > +   if (fsyncMethodCalled())
> > +   {
> > +   if (!AmWalReceiverProcess() && track_wal_io_timing)
> > +   {
> > +   INSTR_TIME_SET_CURRENT(duration);
> > +   INSTR_TIME_SUBTRACT(duration, start);
> > +   WalStats.m_wal_sync_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > +   }
> > +
> > +   WalStats.m_wal_sync++;
> > +   }
> >
> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> > incrementing m_wal_sync after each sync operation?
>
> I think if syncing the disk does not occur, m_wal_sync should not be
> incremented.
> It depends enableFsync and sync_method.
>
> enableFsync is checked in each fsync method like
> pg_fsync_no_writethrough(),
> so if incrementing m_wal_sync after each sync operation, it should be
> implemented
> in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

> So, why don't you change the function to a flag whether to
> sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

  (do fsync and error handling here)

   /* increment the i/o timing and the number of times to fsync WAL data */
   if (track_wal_io_timing)
   {
   INSTR_TIME_SET_CURRENT(duration);
   INSTR_TIME_SUBTRACT(duration, start);
   WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
   }

I think we can change the first switch-case to an if statement.

>
>
> > * As far as I read the code, issue_xlog_fsync() seems to do fsync even
> > if enableFsync is false. Why does the function return false in that
> > case? I might be missing something.
>
> IIUC, the reason is that I thought that each fsync functions like
> pg_fsync_no_writethrough() check enableFsync.
>
> If this code doesn't check

Re: Extend more usecase for planning time partition pruning and init partition pruning.

2021-01-24 Thread Andy Fan

On Sun, Jan 24, 2021 at 6:34 PM Andy Fan  wrote:

> Hi:
>
>  I recently found a use case like this.  SELECT * FROM p, q WHERE
> p.partkey =
>  q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning
> time
>  partition prune or init partition prune.  Even though we have run-time
>  partition pruning work at last, it is too late in some cases since we have
>  to init all the plan nodes in advance.  In my case, there are 10+
>  partitioned relation in one query and the execution time is short, so the
>  init plan a lot of plan nodes cares a lot.
>
> The attached patches fix this issue. It just get the "p.partkey = q.colx"
> case in root->eq_classes or rel->joinlist (outer join), and then check if
> there
> is some baserestrictinfo in another relation which can be used for
> partition
> pruning. To make the things easier, both partkey and colx must be Var
> expression in implementation.
>
> - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch
>
> Just some existing refactoring and extending ChangeVarNodes to be able
> to change var->attno.
>
> - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch
>
> Do the real job.
>
> Thought?
>
>
>
> --
> Best Regards
> Andy Fan (https://www.aliyun.com/)
>


Some results from this patch.

create table p (a int, b int, c character varying(8)) partition by list(c);
create table p1  partition of p for values in ('01');
create table p2  partition of p for values in ('02');
create table p3  partition of p for values in ('03');
create table q (a int, c character varying(8), b int) partition by list(c);
create table q1  partition of q for values in ('01');
create table q2  partition of q for values in ('02');
create table q3  partition of q for values in ('03');

Before the patch:
postgres=# explain (costs off) select * from p inner join q on p.c = q.c
and q.c > '02';
 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
 ->  Seq Scan on p3 p_3
   ->  Hash
 ->  Seq Scan on q3 q
   Filter: ((c)::text > '02'::text)
(9 rows)

After the patch:

 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Seq Scan on p3 p
   ->  Hash
 ->  Seq Scan on q3 q
   Filter: ((c)::text > '02'::text)
(6 rows)


Before the patch:
postgres=# explain (costs off) select * from p inner join q on p.c = q.c
and (q.c = '02' or q.c = '01');
 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
 ->  Seq Scan on p3 p_3
   ->  Hash
 ->  Append
   ->  Seq Scan on q1 q_1
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
   ->  Seq Scan on q2 q_2
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
(12 rows)

After the patch:
 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
   ->  Hash
 ->  Append
   ->  Seq Scan on q1 q_1
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
   ->  Seq Scan on q2 q_2
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
(11 rows)

Before the patch:
postgres=# explain (costs off) select * from p left join q on p.c = q.c
where (q.c = '02' or q.c = '01');
 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
 ->  Seq Scan on p3 p_3
   ->  Hash
 ->  Append
   ->  Seq Scan on q1 q_1
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
   ->  Seq Scan on q2 q_2
 Filter: (((c)::text = '02'::text) OR ((c)::text =
'01'::text))
(12 rows)

After the patch:
 QUERY PLAN

 Hash Join
   Hash Cond: ((p.c)::text = (q.c)::text)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
   ->  Hash
 ->  Append
   ->  Seq Scan on q1 q_1

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Peter Smith

Hi Amit.

PSA the v19 patch for the Tablesync Solution1.

Main differences from v18:
+ Patch has been rebased off HEAD @ 24/Jan
+ Addressing some review comments [ak0123]

[ak0123] 
https://www.postgresql.org/message-id/CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g%40mail.gmail.com



Features:

* The tablesync worker is now allowing multiple tx instead of single tx.

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking record is cleaned up by:
- process_syncing_tables_for_apply
- DropSubscription
- AlterSubscription_refresh

* Updates to PG docs.

* New TAP test case.

Known Issues:

* None.

---
Kind Regards,
Peter Smith.
Fujitsu Australia


v19-0001-Tablesync-Solution1.patch
Description: Binary data


v19-0002-Tablesync-extra-logging.patch
Description: Binary data

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread japin



On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada  wrote:
> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>  wrote:
>>
>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> >  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I rebased the patch to the master branch.
>> >
>> > Thank you for working on this. I've read the latest patch. Here are
>> > comments:
>> >
>> > ---
>> > +   if (track_wal_io_timing)
>> > +   {
>> > +   INSTR_TIME_SET_CURRENT(duration);
>> > +   INSTR_TIME_SUBTRACT(duration, start);
>> > +   WalStats.m_wal_write_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +   }
>> >
>> > * I think it should add the time in micro sec.
>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>> > pg_stat_wal showed the following on my environment:
>> >
>> > postgres(1:61569)=# select * from pg_stat_wal;
>> > -[ RECORD 1 ]+-
>> > wal_records  | 285947
>> > wal_fpi  | 53285
>> > wal_bytes| 442008213
>> > wal_buffers_full | 0
>> > wal_write| 25516
>> > wal_write_time   | 0
>> > wal_sync | 25437
>> > wal_sync_time| 14490
>> > stats_reset  | 2021-01-22 10:56:13.29464+09
>> >
>> > Since writes can complete less than a millisecond, wal_write_time
>> > didn't increase. I think sync_time could also have the same problem.
>>
>> Thanks for your comments. I didn't notice that.
>> I changed the unit from milliseconds to microseconds.
>>
>> > ---
>> > +   /*
>> > +* Measure i/o timing to fsync WAL data.
>> > +*
>> > +* The wal receiver skip to collect it to avoid performance
>> > degradation of standy servers.
>> > +* If sync_method doesn't have its fsync method, to skip too.
>> > +*/
>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> > fsyncMethodCalled())
>> > +   INSTR_TIME_SET_CURRENT(start);
>> >
>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>> > is true? I think the performance degradation is also true for backend
>> > processes. If there is another reason for that, I think it's better to
>> > mention in both the doc and comment.
>> > * How about checking track_wal_io_timing first?
>> > * s/standy/standby/
>>
>> I fixed it.
>> As kuroda-san mentioned too, the skip is no need to be considered.
>
> I think you also removed the code to have the wal receiver report the
> stats. So with the latest patch, the wal receiver tracks those
> statistics but doesn't report.
>
> And maybe XLogWalRcvWrite() also needs to track I/O?
>
>>
>> > ---
>> > +   /* increment the i/o timing and the number of times to fsync WAL
>> > data */
>> > +   if (fsyncMethodCalled())
>> > +   {
>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing)
>> > +   {
>> > +   INSTR_TIME_SET_CURRENT(duration);
>> > +   INSTR_TIME_SUBTRACT(duration, start);
>> > +   WalStats.m_wal_sync_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +   }
>> > +
>> > +   WalStats.m_wal_sync++;
>> > +   }
>> >
>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> > incrementing m_wal_sync after each sync operation?
>>
>> I think if syncing the disk does not occur, m_wal_sync should not be
>> incremented.
>> It depends enableFsync and sync_method.
>>
>> enableFsync is checked in each fsync method like
>> pg_fsync_no_writethrough(),
>> so if incrementing m_wal_sync after each sync operation, it should be
>> implemented
>> in each fsync method. It leads to many duplicated codes.
>
> Right. I missed that each fsync function checks enableFsync.
>
>> So, why don't you change the function to a flag whether to
>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
> Looks better. Since we don't necessarily need to increment m_wal_sync
> after doing fsync we can write the code without an additional variable
> as follows:
>
> if (enableFsync)
> {
> switch (sync_method)
> {
> case SYNC_METHOD_FSYNC:
> #ifdef HAVE_FSYNC_WRITETHROUGH
> case SYNC_METHOD_FSYNC_WRITETHROUGH:
> #endif
> #ifdef HAVE_FDATASYNC
> case SYNC_METHOD_FDATASYNC:
> #endif
> WalStats.m_wal_sync++;
> if (track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(start);
> break;
> default:
> break;
> }
> }
>
>   (do fsync and error handling here)
>
>/* increment the i/o timing and the number of times to fsync WAL data */
>if (track_wal_io_timing)
>{
>INSTR_TIME_SET_CURRENT(duration);
>INSTR_TIME_SUBTRACT(duration, start);
>WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>}
>
> I think we can change the first switch-case to an if statement.
>

+1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" 
branch.

-- 
Regrads

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Amit Kapila

On Mon, Jan 25, 2021 at 6:15 AM Peter Smith  wrote:
>
> On Sun, Jan 24, 2021 at 5:54 PM Peter Smith  wrote:
> > > 4.
> > > - /*
> > > - * To build a slot name for the sync work, we are limited to NAMEDATALEN 
> > > -
> > > - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> > > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> > > - * NAMEDATALEN on the remote that matters, but this scheme will also work
> > > - * reasonably if that is different.)
> > > - */
> > > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for 
> > > sanity */
> > > - slotname = psprintf("%.*s_%u_sync_%u",
> > > - NAMEDATALEN - 28,
> > > - MySubscription->slotname,
> > > - MySubscription->oid,
> > > - MyLogicalRepWorker->relid);
> > > + /* Calculate the name of the tablesync slot. */
> > > + slotname = ReplicationSlotNameForTablesync(
> > > +MySubscription->oid,
> > > +MyLogicalRepWorker->relid);
> > >
> > > What is the reason for changing the slot name calculation? If there is
> > > any particular reasons, then we can add a comment to indicate why we
> > > can't include the subscription's slotname in this calculation.
> > >
> >
> > The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION)
> > and so including the subscription slot name as part of the tablesync
> > slot name was considered to be:
> > a) possibly risky/undefined, if the subscription slot_name = NONE
> > b) confusing, if we end up using 2 different slot names for the same
> > tablesync (e.g. if the subscription slot name is changed before a sync
> > worker is re-launched).
> > And since this subscription slot name part is not necessary for
> > uniqueness anyway, it was removed from the tablesync slot name to
> > eliminate those concerns.
> >
> > Also, the tablesync slot name calculation was encapsulated as a
> > separate function because previously (i.e. before v18) it was used by
> > various other cleanup codes. I still like it better as a function, but
> > now it is only called from one place so we could put that code back
> > inline if you prefer it how it was..
>
> It turns out those (a/b) concerns I wrote above are maybe unfounded,
> because it seems not possible to alter the slot_name = NONE unless the
> subscription is first DISABLED.
>

Yeah, but I think the user can still change to some other predefined
slot_name. However, I guess it doesn't matter unless it can lead what
you have mentioned in (a). As that can't happen, it is probably better
to take out that change from the patch. I see your point of moving
this calculation to a separate function but not sure if it is worth it
unless we have to call it from multiple places or it simplifies the
existing code.

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Peter Smith

On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila  wrote:
>
> 2.
> @@ -98,11 +102,16 @@
>  #include "miscadmin.h"
>  #include "parser/parse_relation.h"
>  #include "pgstat.h"
> +#include "postmaster/interrupt.h"
>  #include "replication/logicallauncher.h"
>  #include "replication/logicalrelation.h"
> +#include "replication/logicalworker.h"
>  #include "replication/walreceiver.h"
>  #include "replication/worker_internal.h"
> +#include "replication/slot.h"
>
> I don't think the above includes are required. They seem to the
> remnant of the previous approach.
>

OK. Fixed in the latest patch [v19].

> 3.
>  process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>  {
> - Assert(IsTransactionState());
> + bool sync_done = false;
>
>   SpinLockAcquire(&MyLogicalRepWorker->relmutex);
> + sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
> + current_lsn >= MyLogicalRepWorker->relstate_lsn;
> + SpinLockRelease(&MyLogicalRepWorker->relmutex);
>
> - if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
> - current_lsn >= MyLogicalRepWorker->relstate_lsn)
> + if (sync_done)
>   {
>   TimeLineID tli;
>
> + /*
> + * Change state to SYNCDONE.
> + */
> + SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>
> Why do we need these changes? If you have done it for the
> code-readability purpose then we can consider this as a separate patch
> because I don't see why these are required w.r.t this patch.
>

Yes it was for code readability in v17 when this function used to be
much larger. But it is not very necessary anymore and has been
reverted in the latest patch [v19].

> 4.
> - /*
> - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> - * NAMEDATALEN on the remote that matters, but this scheme will also work
> - * reasonably if that is different.)
> - */
> - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity 
> */
> - slotname = psprintf("%.*s_%u_sync_%u",
> - NAMEDATALEN - 28,
> - MySubscription->slotname,
> - MySubscription->oid,
> - MyLogicalRepWorker->relid);
> + /* Calculate the name of the tablesync slot. */
> + slotname = ReplicationSlotNameForTablesync(
> +MySubscription->oid,
> +MyLogicalRepWorker->relid);
>
> What is the reason for changing the slot name calculation? If there is
> any particular reasons, then we can add a comment to indicate why we
> can't include the subscription's slotname in this calculation.
>

The tablesync slot name changes were not strictly necessary, so the
code is all reverted to be the same as OSS HEAD now in the latest
patch [v19].

> 5.
> This is WAL
> + * logged for for the purpose of recovery. Locks are to prevent the
> + * replication origin from vanishing while advancing.
>
> /for for/for
>

OK. Fixed in the latest patch [v19].

> 6.
> + /* Remove the tablesync's origin tracking if exists. */
> + snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
> + originid = replorigin_by_name(originname, true);
> + if (originid != InvalidRepOriginId)
> + {
> + elog(DEBUG1, "DropSubscription: dropping origin tracking for
> \"%s\"", originname);
>
> I don't think we need this and the DEBUG1 message in
> AlterSubscription_refresh. IT is fine to print this information for
> background workers like in apply-worker but not sure if need it here.
> The DropSubscription drops the origin of apply worker but it doesn't
> use such a DEBUG message so I guess we don't it for tablesync origins
> as well.
>

OK. These DEBUG1 logs are removed in the latest patch [v19].


[v19] 
https://www.postgresql.org/message-id/CAHut%2BPsj7Xm8C1LbqeAbk-3duyS8xXJtL9TiGaeu3P8g272mAA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PoC] Non-volatile WAL buffer

2021-01-24 Thread Masahiko Sawada

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
 wrote:
>
>
>
> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
> >  wrote:
> >>
> >> Hi,
> >>
> >> I think I've managed to get the 0002 patch [1] rebased to master and
> >> working (with help from Masahiko Sawada). It's not clear to me how it
> >> could have worked as submitted - my theory is that an incomplete patch
> >> was submitted by mistake, or something like that.
> >>
> >> Unfortunately, the benchmark results were kinda disappointing. For a
> >> pgbench on scale 500 (fits into shared buffers), an average of three
> >> 5-minute runs looks like this:
> >>
> >> branch 116326496
> >> 
> >> master  7291 87704165310150437224186
> >> ntt 7912106095213206212410237819
> >> simple-no-buffers   7654 96544115416 95828103065
> >>
> >> NTT refers to the patch from September 10, pre-allocating a large WAL
> >> file on PMEM, and simple-no-buffers is the simpler patch simply removing
> >> the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.
> >>
> >> Note: The patch is just replacing the old implementation with mmap.
> >> That's good enough for experiments like this, but we probably want to
> >> keep the old one for setups without PMEM. But it's good enough for
> >> testing, benchmarking etc.
> >>
> >> Unfortunately, the results for this simple approach are pretty bad. Not
> >> only compared to the "ntt" patch, but even to master. I'm not entirely
> >> sure what's the root cause, but I have a couple hypotheses:
> >>
> >> 1) bug in the patch - That's clearly a possibility, although I've tried
> >> tried to eliminate this possibility.
> >>
> >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
> >> NVMe storage, but still much slower than DRAM (both in terms of latency
> >> and bandwidth, see [2] for some data). It's not terrible, but the
> >> latency is maybe 2-3x higher - not a huge difference, but may matter for
> >> WAL buffers?
> >>
> >> 3) PMEM does not handle parallel writes well - If you look at [2],
> >> Figure 4(b), you'll see that the throughput actually *drops" as the
> >> number of threads increase. That's pretty strange / annoying, because
> >> that's how we write into WAL buffers - each thread writes it's own data,
> >> so parallelism is not something we can get rid of.
> >>
> >> I've added some simple profiling, to measure number of calls / time for
> >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
> >> for each backend, and logs the counts every 1M ops.
> >>
> >> Typical stats from a concurrent run looks like this:
> >>
> >> xlog stats cnt 4300
> >>map cnt 100 time 5448333 unmap cnt 100 time 3730963
> >>memcpy cnt 985964 time 1550442272 len 15150499
> >>memset cnt 0 time 0 len 0
> >>persist cnt 13836 time 10369617 len 16292182
> >>
> >> The times are in nanoseconds, so this says the backend did 100  mmap and
> >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
> >> taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
> >> copying about 15MB of data. That's quite a lot :-(
> >
> > It might also be interesting if we can see how much time spent on each
> > logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().
> >
>
> Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
> maybe that could be visible in a regular perf profile. Also, I suppose
> most of the time will be used by the pmem calls, shown in the stats.
>
> >>
> >> My conclusion from this is that eliminating WAL buffers and writing WAL
> >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
> >> right approach.
> >>
> >> I suppose we should keep WAL buffers, and then just write the data to
> >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
> >> except that it allocates one huge file on PMEM and writes to that
> >> (instead of the traditional WAL segments).
> >>
> >> So I decided to try how it'd work with writing to regular WAL segments,
> >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
> >> and the results look a bit nicer:
> >>
> >> branch 116326496
> >> 
> >> master  7291 87704165310150437224186
> >> ntt 7912106095213206212410237819
> >> simple-no-buffers   7654 96544115416 95828103065
> >> with-wal-buffers7477 95454181702140167214715
> >>
> >> So, much better than the version without WAL buffers, somewhat better
> >> than master (except for 64/96 clients), but still not

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Amit Kapila

On Sun, Jan 24, 2021 at 12:24 PM Peter Smith  wrote:
>
> On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila  wrote:
> >
> >
> > Few comments:
> > =
> > 1.
> > - *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
> > - *   CATCHUP -> SYNCDONE -> READY.
> > + *   So the state progression is always: INIT -> DATASYNC ->
> > + *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> 
> > READY.
> >
> > I don't think we need to be specific here that sync worker sets
> > FINISHEDCOPY state.
> >
>
> This was meant to indicate that *only* the sync worker knows about the
> FINISHEDCOPY state, whereas all the other states are either known
> (assigned and/or used) by *both* kinds of workers. But, I can remove
> it if you feel that distinction is not useful.
>

Okay, but I feel you can mention that in the description you have
added for FINISHEDCOPY state. It looks a bit odd here and the message
you want to convey is also not that clear.

-- 
With Regards,
Amit Kapila.

RE: Parallel Inserts in CREATE TABLE AS

2021-01-24 Thread Tang, Haiying

>Thanks a lot for the tests. In your test case, parallel insertions are not 
>being picked because the Gather node has
> some projections(floor(((random() * '1'::double precision) + >'1'::double 
> precision)) to perform. That's expected.
>Whenever parallel insertions are chosen for CTAS, we should see "Create 
>target_table '' under Gather node [1] and
>also the actual >row count for Gather node 0 (but in your test it is 
>rows=16268) in the explain analyze output.
>Coming to your test case, if it's modified to something like [1], where the 
>Gather >node has no projections,
>then parallel insertions will be chosen.

Thanks for your explanation and test.
Actually, I deliberately made my test case(with projection) to pick serial 
insert to make tuples unbalance distribution(99% tuples read by one worker) 
happened.
This issue will lead the performance regression.

But it's not introduced by your patch, it’s happening in master(HEAD).
Do you have some thoughts about this.

>[1] - I did this test on my development system, I will run on some performance 
>system and post my observations.

Thank you, It will be very kind of you to do this.
To reproduce above issue, you need to use my case(with projection). Because it 
won’t occur in “parallel insert”.

Regards,
Tang

Re: Is Recovery actually paused?

2021-01-24 Thread Kyotaro Horiguchi

At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar  wrote 
in 
> On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy
>  wrote:
> >
> > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar  wrote:
> > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy 
> > >  wrote:
> > >>
> > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar  
> > >> wrote:
> > >> > Please find the patch for the same.  I haven't added a test case for
> > >> > this yet.  I mean we can write a test case to pause the recovery and
> > >> > get the status.  But I am not sure that we can really write a reliable
> > >> > test case for 'pause requested' and 'paused'.
> > >>
> > >> +1 to just show the recovery pause state in the output of
> > >> pg_is_wal_replay_paused. But, should the function name
> > >> "pg_is_wal_replay_paused" be something like
> > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > >> in a function, I expect a boolean output. Others may have better
> > >> thoughts.
> > >>
> > >> IIUC the above change, ensuring the recovery is paused after it's
> > >> requested lies with the user. IMHO, the main problem we are trying to
> > >> solve is not met
> > >
> > >
> > > Basically earlier their was no way for the user yo know whether the 
> > > recovery is actually paused or not because it was always returning true 
> > > after pause requested.  Now, we will return whether pause requested or 
> > > actually paused.  So > for tool designer who want to wait for recovery to 
> > > get paused can have a loop and wait until the recovery state reaches to 
> > > paused.  That will give a better control.
> >
> > I get it and I agree to have that change. My point was whether we can
> > have a new function pg_wal_replay_pause_and_wait that waits until
> > recovery is actually paused ((along with pg_is_wal_replay_paused
> > returning the actual state than a true/false) so that tool developers
> > don't need to have the waiting code outside, if at all they care about
> > it? Others may have better thoughts than me.
> 
> I think the previous patch was based on that idea where we thought
> that we can pass an argument to pg_is_wal_replay_paused which can
> decide whether to wait or return without the wait.  I think this
> version looks better to me where we give the status instead of
> waiting.  I am not sure whether we want another version of
> pg_wal_replay_pause which will wait for actually it to get paused.  I
> mean there is always a scope to include the functionality in the
> database which can be achieved by the tool, but this patch was trying
> to solve the problem that there was no way to know the status so I
> think returning the correct status should be the scope of this.

I understand that the requirement here is that no record is replayed
after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused()
returns true, and delays taken while recovery don't delay the state
change. That requirements are really synchronous.

On the other hand the machinery is designed to be asynchronous.

>* Note that we intentionally don't take the info_lck spinlock
>* here.  We might therefore read a slightly stale value of
>* the recoveryPause flag, but it can't be very stale (no
>* worse than the last spinlock we did acquire).  Since a
>* pause request is a pretty asynchronous thing anyway,
>* possibly responding to it one WAL record later than we
>* otherwise would is a minor issue, so it doesn't seem worth
>* adding another spinlock cycle to prevent that.

As the result, this patch tries to introduce several new checkpoints
to some delaying point so that that waits can find pause request in a
timely manner. I think we had better use locking (or atomics) for the
information instead of such scattered checkpoints if we expect that
machinery to work in a such syhcnronous manner.

That would make the tri-state state variable and many checkpoints
unnecessary. Maybe.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Amit Kapila

On Sat, Jan 23, 2021 at 11:08 AM Ajin Cherian  wrote:
>
> On Sat, Jan 23, 2021 at 3:16 PM Amit Kapila  wrote:
>
> >
> > I think so. But do you have any reason to believe that it won't be
> > required anymore?
>
> A temporary slot will not clash with a permanent slot of the same name,
>

I have tried below and it seems to be clashing:
postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('test_slot2', 'test_decoding');
 ?column?
--
 init
(1 row)

postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('test_slot2', 'test_decoding',
true);
ERROR:  replication slot "test_slot2" already exists

Note that the third parameter in the second statement above indicates
whether it is a temporary slot or not. What am I missing?
-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-24 Thread Amit Kapila

On Sat, Jan 23, 2021 at 1:44 PM Bharath Rupireddy
 wrote:
>
> On Sat, Jan 23, 2021 at 11:50 AM Amit Kapila  wrote:
> >
> > On Fri, Jan 22, 2021 at 2:59 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Fri, Jan 22, 2021 at 2:00 PM Amit Kapila  
> > > wrote:
> > >
> > > Yes you are right. Looks like the above commit is causing the issue. I
> > > reverted that commit and did not see the drop publication bug, see use
> > > case [1].
> > >
> >
> > Okay, thanks for confirming the same. I am planning to push the
> > attached where I made changes in few comments, combined the code and
> > test patch, and made a few changes in the commit message. Let me know
> > if you have any suggestions?
>
> Thanks Amit. Looks like there were some typos in the test case
> description, I corrected them.
>

Thanks, I have pushed the patch.

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread Hou, Zhijie

Hi,

After doing some test to cover the code path in the PATCH 0001.
I have some suggestions for the 0002 testcase.


(1)
+   /* Check parallel-safety of any expressions in the 
partition key */
+   if (get_partition_col_attnum(pkey, i) == 0)
+   {
+   Node   *check_expr = (Node *) 
lfirst(partexprs_item);
+
+   if (max_parallel_hazard_walker(check_expr, 
context))
+   {
+   table_close(rel, lockmode);
+   return true;
+   }

The testcase seems does not cover the above code(test when the table have 
parallel unsafe expression in the partition key).

Personally, I use the following sql to cover this:
-
create table partkey_unsafe_key_expr_t (a int4, b name) partition by range 
((fullname_parallel_unsafe('',a::varchar)));
explain (costs off) insert into partkey_unsafe_key_expr_t select unique1, 
stringu1 from tenk1;
-


(2)
I noticed that most of testcase test both (parallel safe/unsafe/restricted).
But the index expression seems does not test the parallel restricted.
How about add a testcase like:
-
create or replace function fullname_parallel_restricted(f text, l text) returns 
text as $$
begin
return f || l;
end;
$$ language plpgsql immutable parallel restricted;

create table names4(index int, first_name text, last_name text);
create index names4_fullname_idx on names4 
(fullname_parallel_restricted(first_name, last_name));

--
-- Test INSERT with parallel-restricted index expression
-- (should create a parallel plan)
--
explain (costs off) insert into names4 select * from names;
-

(3)
+   /* Recursively check each partition ... */
+   pdesc = RelationGetPartitionDesc(rel);
+   for (i = 0; i < pdesc->nparts; i++)
+   {
+   if (rel_max_parallel_hazard_for_modify(pdesc->oids[i],
+   
command_type,
+   
context,
+   
AccessShareLock))
+   {
+   table_close(rel, lockmode);
+   return true;
+   }
+   }

It seems we do not have a testcase to test (some parallel unsafe expression 
or.. in partition)
Hoe about add one testcase to test parallel unsafe partition ?

Best regards,
houzj

Re: PG vs LLVM 12 on seawasp, next round

2021-01-24 Thread Noah Misch

On Mon, Jan 18, 2021 at 09:29:53PM +0100, Fabien COELHO wrote:
> >>>The "no such file" error seems more like a machine local issue to me.
> >>
> >>I'll look into it when I have time, which make take some time. Hopefully
> >>over the holidays.
> >
> >This is still happening ... Any chance you can have a look at it?
> 
> Indeed. I'll try to look (again) into it soon. I had a look but did not find
> anything obvious in the short time frame I had. Last two months were a
> little overworked for me so I let slip quite a few things. If you want to
> disable the animal as Tom suggests, do as you want.

Perhaps he was suggesting that you (buildfarm owner) disable the cron job that
initiates new runs.  That's what I do when one of my animals needs my
intervention.

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread Greg Nancarrow

On Mon, Jan 25, 2021 at 10:23 AM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:
>
> Hello Greg-san,
>
>
> Second group of comments (I'll reply to (1) - (4) later):
>
>
> (5)
> @@ -790,7 +790,8 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
> ...
> -   if (plannedstmt->commandType != CMD_SELECT ||
plannedstmt->hasModifyingCTE)
> +   if ((plannedstmt->commandType != CMD_SELECT &&
> +
 !IsModifySupportedInParallelMode(plannedstmt->commandType)) ||
plannedstmt->hasModifyingCTE)
> PreventCommandIfParallelMode(CreateCommandName((Node *)
plannedstmt));
>  }
>
> Now that we're trying to allow parallel writes (INSERT), we should:
>
> * use ExecCheckXactReadOnly() solely for checking read-only transactions,
as the function name represents.  That is, move the call to
PreventCommandIfParallelMode() up to standard_ExecutorStart().
>
> * Update the comment  above the call to ExecCheckXactReadOnly().
>
>

Hmmm, I not so sure. The patch changes just make the existing test for
calling PreventCommandIfParallelMode() a bit more restrictive, to exclude
the Parallel INSERT case. So the code previously wasn't just checking
read-only transactions anyway, so it's not as if the patch has changed
something fundamental in this function. And by moving the
PreventCommandIfParallelMode() call to a higher level, then you're making a
change to the existing order of error-handling (as ExecCheckXactReadOnly()
is calling PreventCommandIfReadOnly() based on a few other range-table
conditions, prior to testing whether to call
PreventCommandIfParallelMode()). I don't want to introduce a bug by making
the change that you're suggesting.


> (6)
> @@ -764,6 +777,22 @@ ExecInitParallelPlan(PlanState *planstate, EState
*estate,
> ...
> +   else
> +   {
> +   pei->processed_count = NULL;
> +   }
>
> The braces can be deleted.
>

Yes they can be deleted, and I guess I will, but for the record, I
personally prefer the explicit brackets, even if just one line, because:
- if more code ever needs to be added to the else, you'll need to add
brackets anyway (and newbies might add extra lines tabbed in, thinking it's
part of the else block  ...).
- I think it looks better and slightly easier to read, especially when
there's a mix of cases with multiple code lines and single code lines
Of course, these kind of things could be debated forever, but I don't think
it's such a big deal.

>
> (7)
> @@ -1400,6 +1439,16 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>
   true);
> queryDesc = ExecParallelGetQueryDesc(toc, receiver,
instrument_options);
>
> +   Assert(queryDesc->operation == CMD_SELECT ||
IsModifySupportedInParallelMode(queryDesc->operation));
> +   if (IsModifySupportedInParallelMode(queryDesc->operation))
> +   {
> +   /*
> +* Record that the CurrentCommandId is used, at the start
of the
> +* parallel operation.
> +*/
> +   SetCurrentCommandIdUsedForWorker();
> +   }
> +
> /* Setting debug_query_string for individual workers */
> debug_query_string = queryDesc->sourceText;
>
> @@ -765,12 +779,16 @@ GetCurrentCommandId(bool used)
> if (used)
> {
> /*
> -* Forbid setting currentCommandIdUsed in a parallel
worker, because
> -* we have no provision for communicating this back to
the leader.  We
> -* could relax this restriction when currentCommandIdUsed
was already
> -* true at the start of the parallel operation.
> +* If in a parallel worker, only allow setting
currentCommandIdUsed if
> +* currentCommandIdUsed was already true at the start of
the parallel
> +* operation (by way of SetCurrentCommandIdUsed()),
otherwise forbid
> +* setting currentCommandIdUsed because we have no
provision for
> +* communicating this back to the leader. Once
currentCommandIdUsed is
> +* set, the commandId used by leader and workers can't be
changed,
> +* because CommandCounterIncrement() then prevents any
attempted
> +* increment of the current commandId.
>  */
> -   Assert(!IsParallelWorker());
> +   Assert(!(IsParallelWorker() && !currentCommandIdUsed));
> currentCommandIdUsed = true;
> }
> return currentCommandId;
>
> What happens without these changes?

The change to the above comment explains why the change is needed.
Without these changes, a call in a parallel worker to GetCurrentCommandId()
will result in an Assert being fired because (prior to the patch)
currentCommandIdUsed is forbidden to be set in a parallel worker, and
calling GetCurrentCommandId(true) (to signify the intent to use the
returned CommandId to mark inserted/updated/deleted tuples) will result in
currentCommandIdUsed

RE: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda


On 2021-01-25 10:34, kuroda.hay...@fujitsu.com wrote:

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]+--
wal_records  | 121781
wal_fpi  | 2287
wal_bytes| 36055146
wal_buffers_full | 799
wal_write| 12770
wal_write_time   | 4.469
wal_sync | 11962
wal_sync_time| 132.352
stats_reset  | 2021-01-25 00:51:40.674412+00
```


Thanks for checking.


I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM
pg_stat_activity WHERE wait_event i
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to disk because WAL
buffers became full
+   Total number of times WAL data was written to disk because WAL
buffers became full
+  
+ 
+
+ 
+  role="column_definition">

+   wal_write bigint
+  
+  
+   Total number of times WAL data was written to disk
+  
+ 
+
+ 
+  role="column_definition">
+   wal_write_time double 
precision

+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk, in milliseconds
+   (if  is enabled,
otherwise zero).
+  
+ 
+
+ 
+  role="column_definition">

+   wal_sync bigint
+  
+  
+   Total number of times WAL data was synced to disk
+   (if  is
open_datasync or
+   open_sync, this value is zero because WAL
data is synced
+   when to write it).
+  
+ 
+
+ 
+  role="column_definition">
+   wal_sync_time double 
precision

+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk, in milliseconds
+   (if  is enabled,
otherwise zero.
+   if  is
open_datasync or
+   open_sync, this value is zero too because
WAL data is synced
+   when to write it).
   
  
 ```

Maybe "Total amount of time" should be used, not "Total number of 
time."

Other views use "amount."


Thanks.

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time 
that xxx" and "Total time spent xxx".


Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Amit Kapila

On Mon, Jan 25, 2021 at 8:23 AM Peter Smith  wrote:
>
> On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila  wrote:
> >
> > 2.
> > @@ -98,11 +102,16 @@
> >  #include "miscadmin.h"
> >  #include "parser/parse_relation.h"
> >  #include "pgstat.h"
> > +#include "postmaster/interrupt.h"
> >  #include "replication/logicallauncher.h"
> >  #include "replication/logicalrelation.h"
> > +#include "replication/logicalworker.h"
> >  #include "replication/walreceiver.h"
> >  #include "replication/worker_internal.h"
> > +#include "replication/slot.h"
> >
> > I don't think the above includes are required. They seem to the
> > remnant of the previous approach.
> >
>
> OK. Fixed in the latest patch [v19].
>

You seem to forgot removing #include "replication/slot.h". Check, if
it is not required then remove that as well.

-- 
With Regards,
Amit Kapila.

Re: PG vs LLVM 12 on seawasp, next round

2021-01-24 Thread Tom Lane

Noah Misch  writes:
> On Mon, Jan 18, 2021 at 09:29:53PM +0100, Fabien COELHO wrote:
>> ... Last two months were a
>> little overworked for me so I let slip quite a few things. If you want to
>> disable the animal as Tom suggests, do as you want.

> Perhaps he was suggesting that you (buildfarm owner) disable the cron job that
> initiates new runs.  That's what I do when one of my animals needs my
> intervention.

Indeed.  I'm not sure there even is a provision to block an animal on the
buildfarm-server side.  If there is, you'd have to request that it be
manually undone after you get around to fixing the animal.  Frankly,
if I were the BF admin, I would be in about as much hurry to do that
as you've been to fix it.

regards, tom lane

RE: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread kuroda.hay...@fujitsu.com

Dear Ikeda-san,

> I checked columns' descriptions of other views.
> There are "Number of xxx", "Total number of xxx", "Total amount of time 
> that xxx" and "Total time spent xxx".

Right.

> Since the "time" is used for showing spending time, not count,
> I'll change it to "Total number of WAL data written/synced to disk".
> Thought?

I misread your patch, sorry. I prefer your suggestion.
Please fix like that way with others.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
 wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:
> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>  wrote:
>>
>> Hi,
>>
>> I rebased the patch to the master branch.
>
> Thank you for working on this. I've read the latest patch. Here are
> comments:
>
> ---
> +   if (track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_write_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
>
> * I think it should add the time in micro sec.
> After running pgbench with track_wal_io_timing = on for 30 sec,
> pg_stat_wal showed the following on my environment:
>
> postgres(1:61569)=# select * from pg_stat_wal;
> -[ RECORD 1 ]+-
> wal_records  | 285947
> wal_fpi  | 53285
> wal_bytes| 442008213
> wal_buffers_full | 0
> wal_write| 25516
> wal_write_time   | 0
> wal_sync | 25437
> wal_sync_time| 14490
> stats_reset  | 2021-01-22 10:56:13.29464+09
>
> Since writes can complete less than a millisecond, wal_write_time
> didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

> ---
> +   /*
> +* Measure i/o timing to fsync WAL data.
> +*
> +* The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> +* If sync_method doesn't have its fsync method, to skip too.
> +*/
> +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> fsyncMethodCalled())
> +   INSTR_TIME_SET_CURRENT(start);
>
> * Why does only the wal receiver skip it even if track_wal_io_timinig
> is true? I think the performance degradation is also true for backend
> processes. If there is another reason for that, I think it's better to
> mention in both the doc and comment.
> * How about checking track_wal_io_timing first?
> * s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

> ---
> +   /* increment the i/o timing and the number of times to fsync WAL
> data */
> +   if (fsyncMethodCalled())
> +   {
> +   if (!AmWalReceiverProcess() && track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_sync_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
> +
> +   WalStats.m_wal_sync++;
> +   }
>
> * I'd avoid always calling fsyncMethodCalled() in this path. How about
> incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

  (do fsync and error handling here)

   /* increment the i/o timing and the number of times to fsync WAL 
data */

   if (track_wal_io_timing)
   {
   INSTR_TIME_SET_CURRENT(duration);
   INSTR_TIME_SUBTRACT(duration, start);
   WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
   }

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I think we can change the first switch-case to an if statement.

Yes, I'll change it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

RE: POC: postgres_fdw insert batching

2021-01-24 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> Yes, it can be simplified by using a local join to prevent the update of the 
> foreign
> partition from being pushed to the remote server, for which my example in the
> previous email used a local trigger.  Note that the update of the foreign
> partition to be done locally is a prerequisite for this bug to occur.

Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it 
partway.  Good catch (and my bad miss.)


+   PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
+   (PgFdwModifyState *) 
resultRelInfo->ri_FdwState :
+   NULL;

This can be written as:

+   PgFdwModifyState *fmstate = (PgFdwModifyState *) 
resultRelInfo->ri_FdwState;


Regards
Takayuki Tsunakawa

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda


On 2021-01-25 11:47, japin wrote:
On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada  
wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
 wrote:


On 2021-01-22 14:50, Masahiko Sawada wrote:
> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>  wrote:
>>
>> Hi,
>>
>> I rebased the patch to the master branch.
>
> Thank you for working on this. I've read the latest patch. Here are
> comments:
>
> ---
> +   if (track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_write_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
>
> * I think it should add the time in micro sec.
> After running pgbench with track_wal_io_timing = on for 30 sec,
> pg_stat_wal showed the following on my environment:
>
> postgres(1:61569)=# select * from pg_stat_wal;
> -[ RECORD 1 ]+-
> wal_records  | 285947
> wal_fpi  | 53285
> wal_bytes| 442008213
> wal_buffers_full | 0
> wal_write| 25516
> wal_write_time   | 0
> wal_sync | 25437
> wal_sync_time| 14490
> stats_reset  | 2021-01-22 10:56:13.29464+09
>
> Since writes can complete less than a millisecond, wal_write_time
> didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

> ---
> +   /*
> +* Measure i/o timing to fsync WAL data.
> +*
> +* The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> +* If sync_method doesn't have its fsync method, to skip too.
> +*/
> +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> fsyncMethodCalled())
> +   INSTR_TIME_SET_CURRENT(start);
>
> * Why does only the wal receiver skip it even if track_wal_io_timinig
> is true? I think the performance degradation is also true for backend
> processes. If there is another reason for that, I think it's better to
> mention in both the doc and comment.
> * How about checking track_wal_io_timing first?
> * s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.


I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?



> ---
> +   /* increment the i/o timing and the number of times to fsync WAL
> data */
> +   if (fsyncMethodCalled())
> +   {
> +   if (!AmWalReceiverProcess() && track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_sync_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
> +
> +   WalStats.m_wal_sync++;
> +   }
>
> * I'd avoid always calling fsyncMethodCalled() in this path. How about
> incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.


Right. I missed that each fsync function checks enableFsync.


So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?


Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

  (do fsync and error handling here)

   /* increment the i/o timing and the number of times to fsync WAL 
data */

   if (track_wal_io_timing)
   {
   INSTR_TIME_SET_CURRENT(duration);
   INSTR_TIME_SUBTRACT(duration, start);
   WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
   }

I think we can change the first switch-case to an if statement.



+1. We can also narrow the scope of "duration" into "if
(track_wal_io_timing)" branch.


Thanks, I'll change it.

--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Is Recovery actually paused?

2021-01-24 Thread Dilip Kumar

On Mon, Jan 25, 2021 at 6:12 AM Masahiko Sawada  wrote:
>
> On Sun, Jan 17, 2021 at 5:19 PM Dilip Kumar  wrote:
> >
> > On Sat, Jan 16, 2021 at 8:59 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > > > Dilip Kumar  wrote:
> > > > > >
> > > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused 
> > > > > > > > > to wait.
> > > > > > > > > Especially, if max_standby_streaming_delay is -1, this will 
> > > > > > > > > be blocked forever,
> > > > > > > > > although this setting may not be usual. In addition, some 
> > > > > > > > > users may set
> > > > > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > > > > pg_is_wal_replay_paused,
> > > > > > > > > it could wait for a long time.
> > > > > > > > >
> > > > > > > > > At least, I think we need some descriptions on document to 
> > > > > > > > > explain
> > > > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > > > >
> > > > > > > > Ok
> > > > > > >
> > > > > > > Fixed this, added some comments in .sgml as well as in function 
> > > > > > > header
> > > > > >
> > > > > > Thank you for fixing this.
> > > > > >
> > > > > > Also, is it better to fix the description of pg_wal_replay_pause 
> > > > > > from
> > > > > > "Pauses recovery." to "Request to pause recovery." in according with
> > > > > > pg_is_wal_replay_paused?
> > > > >
> > > > > Okay
> > > > >
> > > > > >
> > > > > > > > > Also, how about adding a new boolean argument to 
> > > > > > > > > pg_is_wal_replay_paused to
> > > > > > > > > control whether this waits for recovery to get paused or not? 
> > > > > > > > > By setting its
> > > > > > > > > default value to true or false, users can use the old format 
> > > > > > > > > for calling this
> > > > > > > > > and the backward compatibility can be maintained.
> > > > > > > >
> > > > > > > > So basically, if the wait_recovery_pause flag is false then we 
> > > > > > > > will
> > > > > > > > immediately return true if the pause is requested?  I agree 
> > > > > > > > that it is
> > > > > > > > good to have an API to know whether the recovery pause is 
> > > > > > > > requested or
> > > > > > > > not but I am not sure is it good idea to make this API serve 
> > > > > > > > both the
> > > > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > > > >
> > > > > >
> > > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > > purpose;
> > > > > > this waits recovery to actually get paused. If we want to limit 
> > > > > > this API's
> > > > > > purpose only to return the pause state, it seems better to fix this 
> > > > > > to return
> > > > > > the actual state at the cost of lacking the backward compatibility. 
> > > > > > If we want
> > > > > > to know whether pause is requested, we may add a new API like
> > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > > recovery to actually
> > > > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > > > purpose.
> > > > > >
> > > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I 
> > > > > > don't care either.
> > > > >
> > > > > I don't think that it will be blocked ever, because
> > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > > recovery process will not be stuck on waiting for the WAL.
> > > > >
> > > > > > > > > As another comment, while pg_is_wal_replay_paused is 
> > > > > > > > > blocking, I can not cancel
> > > > > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > > > > waiting loop.
> > > > > >
> > > > > > How about this fix? I think users may want to cancel 
> > > > > > pg_is_wal_replay_paused() during
> > > > > > this is blocking.
> > > > >
> > > > > Yeah, we can do this.  I will send the updated patch after putting
> > > > > some more thought into these comments.  Thanks again for the feedback.
> > > > >
> > > >
> > > > Please find the updated patch.
> > >
> > > I've looked at the patch. Here are review comments:
> > >
> > > +   /* Recovery pause state */
> > > +   RecoveryPauseState  recoveryPause;
> > >
> > > Now that the value can have tri-state, how about renaming it to
> > > recoveryPauseState?
> >
> > This makes sense to me.
> >
> > > ---
> > >  bool
> > >  RecoveryIsPaused(void)
> > > +{
> > > +   boolrecoveryPause;
> > > +
> > > +   SpinLockAcquire(&XLogCtl->info_lck);
> > > +   recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ?
> > > true : false;
> > > +   SpinLockRelease(&XLogCtl->info_lck);
> > > +
> > > +   return recoveryPause;
> > > +}
> > >

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda

On 2021-01-25 13:15, Masahiro Ikeda wrote:

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
 wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:
> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>  wrote:
>>
>> Hi,
>>
>> I rebased the patch to the master branch.
>
> Thank you for working on this. I've read the latest patch. Here are
> comments:
>
> ---
> +   if (track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_write_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
>
> * I think it should add the time in micro sec.
> After running pgbench with track_wal_io_timing = on for 30 sec,
> pg_stat_wal showed the following on my environment:
>
> postgres(1:61569)=# select * from pg_stat_wal;
> -[ RECORD 1 ]+-
> wal_records  | 285947
> wal_fpi  | 53285
> wal_bytes| 442008213
> wal_buffers_full | 0
> wal_write| 25516
> wal_write_time   | 0
> wal_sync | 25437
> wal_sync_time| 14490
> stats_reset  | 2021-01-22 10:56:13.29464+09
>
> Since writes can complete less than a millisecond, wal_write_time
> didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

> ---
> +   /*
> +* Measure i/o timing to fsync WAL data.
> +*
> +* The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> +* If sync_method doesn't have its fsync method, to skip too.
> +*/
> +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> fsyncMethodCalled())
> +   INSTR_TIME_SET_CURRENT(start);
>
> * Why does only the wal receiver skip it even if track_wal_io_timinig
> is true? I think the performance degradation is also true for backend
> processes. If there is another reason for that, I think it's better to
> mention in both the doc and comment.
> * How about checking track_wal_io_timing first?
> * s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

> ---
> +   /* increment the i/o timing and the number of times to fsync WAL
> data */
> +   if (fsyncMethodCalled())
> +   {
> +   if (!AmWalReceiverProcess() && track_wal_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(duration);
> +   INSTR_TIME_SUBTRACT(duration, start);
> +   WalStats.m_wal_sync_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +   }
> +
> +   WalStats.m_wal_sync++;
> +   }
>
> * I'd avoid always calling fsyncMethodCalled() in this path. How about
> incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

  (do fsync and error handling here)

   /* increment the i/o timing and the number of times to fsync WAL 
data */

   if (track_wal_io_timing)
   {
   INSTR_TIME_SET_CURRENT(duration);
   INSTR_TIME_SUBTRACT(duration, start);
   WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
   }

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

  (do fsync and error handling here

Re: Is Recovery actually paused?

2021-01-24 Thread Dilip Kumar

On Mon, Jan 25, 2021 at 8:42 AM Kyotaro Horiguchi
 wrote:
>
> At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar  wrote 
> in
> > On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar  wrote:
> > > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy 
> > > >  wrote:
> > > >>
> > > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar  
> > > >> wrote:
> > > >> > Please find the patch for the same.  I haven't added a test case for
> > > >> > this yet.  I mean we can write a test case to pause the recovery and
> > > >> > get the status.  But I am not sure that we can really write a 
> > > >> > reliable
> > > >> > test case for 'pause requested' and 'paused'.
> > > >>
> > > >> +1 to just show the recovery pause state in the output of
> > > >> pg_is_wal_replay_paused. But, should the function name
> > > >> "pg_is_wal_replay_paused" be something like
> > > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > > >> in a function, I expect a boolean output. Others may have better
> > > >> thoughts.
> > > >>
> > > >> IIUC the above change, ensuring the recovery is paused after it's
> > > >> requested lies with the user. IMHO, the main problem we are trying to
> > > >> solve is not met
> > > >
> > > >
> > > > Basically earlier their was no way for the user yo know whether the 
> > > > recovery is actually paused or not because it was always returning true 
> > > > after pause requested.  Now, we will return whether pause requested or 
> > > > actually paused.  So > for tool designer who want to wait for recovery 
> > > > to get paused can have a loop and wait until the recovery state reaches 
> > > > to paused.  That will give a better control.
> > >
> > > I get it and I agree to have that change. My point was whether we can
> > > have a new function pg_wal_replay_pause_and_wait that waits until
> > > recovery is actually paused ((along with pg_is_wal_replay_paused
> > > returning the actual state than a true/false) so that tool developers
> > > don't need to have the waiting code outside, if at all they care about
> > > it? Others may have better thoughts than me.
> >
> > I think the previous patch was based on that idea where we thought
> > that we can pass an argument to pg_is_wal_replay_paused which can
> > decide whether to wait or return without the wait.  I think this
> > version looks better to me where we give the status instead of
> > waiting.  I am not sure whether we want another version of
> > pg_wal_replay_pause which will wait for actually it to get paused.  I
> > mean there is always a scope to include the functionality in the
> > database which can be achieved by the tool, but this patch was trying
> > to solve the problem that there was no way to know the status so I
> > think returning the correct status should be the scope of this.
>
> I understand that the requirement here is that no record is replayed
> after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused()
> returns true, and delays taken while recovery don't delay the state
> change. That requirements are really synchronous.
>
> On the other hand the machinery is designed to be asynchronous.
>
> >* Note that we intentionally don't take the info_lck spinlock
> >* here.  We might therefore read a slightly stale value of
> >* the recoveryPause flag, but it can't be very stale (no
> >* worse than the last spinlock we did acquire).  Since a
> >* pause request is a pretty asynchronous thing anyway,
> >* possibly responding to it one WAL record later than we
> >* otherwise would is a minor issue, so it doesn't seem worth
> >* adding another spinlock cycle to prevent that.
>
> As the result, this patch tries to introduce several new checkpoints
> to some delaying point so that that waits can find pause request in a
> timely manner. I think we had better use locking (or atomics) for the
> information instead of such scattered checkpoints if we expect that
> machinery to work in a such syhcnronous manner.
>
> That would make the tri-state state variable and many checkpoints
> unnecessary. Maybe.

I don't think the intention was so to make it synchronous,  I think
the main intention was that pg_is_wal_replay_paused can return us the
correct state, in short user can know that whether any more wal will
be replayed after pg_is_wal_replay_paused returns true or some other
state.  I agree that along with that we have also introduced some
extra checkpoint where the recovery process is waiting for WAL and
apply delay and from the pg_wal_replay_pause we had sent a signal to
wakeup the recovery process.  So I am not sure is it worth adding the
lock/atomic variable to make this synchronous.  Any other thoughts on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiko Sawada

On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda  wrote:
>
> On 2021-01-25 13:15, Masahiro Ikeda wrote:
> > On 2021-01-25 10:36, Masahiko Sawada wrote:
> >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> >>  wrote:
> >>>
> >>> On 2021-01-22 14:50, Masahiko Sawada wrote:
> >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> >>> >  wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I rebased the patch to the master branch.
> >>> >
> >>> > Thank you for working on this. I've read the latest patch. Here are
> >>> > comments:
> >>> >
> >>> > ---
> >>> > +   if (track_wal_io_timing)
> >>> > +   {
> >>> > +   INSTR_TIME_SET_CURRENT(duration);
> >>> > +   INSTR_TIME_SUBTRACT(duration, start);
> >>> > +   WalStats.m_wal_write_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > +   }
> >>> >
> >>> > * I think it should add the time in micro sec.
> >>> > After running pgbench with track_wal_io_timing = on for 30 sec,
> >>> > pg_stat_wal showed the following on my environment:
> >>> >
> >>> > postgres(1:61569)=# select * from pg_stat_wal;
> >>> > -[ RECORD 1 ]+-
> >>> > wal_records  | 285947
> >>> > wal_fpi  | 53285
> >>> > wal_bytes| 442008213
> >>> > wal_buffers_full | 0
> >>> > wal_write| 25516
> >>> > wal_write_time   | 0
> >>> > wal_sync | 25437
> >>> > wal_sync_time| 14490
> >>> > stats_reset  | 2021-01-22 10:56:13.29464+09
> >>> >
> >>> > Since writes can complete less than a millisecond, wal_write_time
> >>> > didn't increase. I think sync_time could also have the same problem.
> >>>
> >>> Thanks for your comments. I didn't notice that.
> >>> I changed the unit from milliseconds to microseconds.
> >>>
> >>> > ---
> >>> > +   /*
> >>> > +* Measure i/o timing to fsync WAL data.
> >>> > +*
> >>> > +* The wal receiver skip to collect it to avoid performance
> >>> > degradation of standy servers.
> >>> > +* If sync_method doesn't have its fsync method, to skip too.
> >>> > +*/
> >>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> >>> > fsyncMethodCalled())
> >>> > +   INSTR_TIME_SET_CURRENT(start);
> >>> >
> >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> >>> > is true? I think the performance degradation is also true for backend
> >>> > processes. If there is another reason for that, I think it's better to
> >>> > mention in both the doc and comment.
> >>> > * How about checking track_wal_io_timing first?
> >>> > * s/standy/standby/
> >>>
> >>> I fixed it.
> >>> As kuroda-san mentioned too, the skip is no need to be considered.
> >>
> >> I think you also removed the code to have the wal receiver report the
> >> stats. So with the latest patch, the wal receiver tracks those
> >> statistics but doesn't report.
> >> And maybe XLogWalRcvWrite() also needs to track I/O?
> >
> > Thanks, I forgot to add them.
> > I'll fix it.
> >
> >
> >>>
> >>> > ---
> >>> > +   /* increment the i/o timing and the number of times to fsync WAL
> >>> > data */
> >>> > +   if (fsyncMethodCalled())
> >>> > +   {
> >>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing)
> >>> > +   {
> >>> > +   INSTR_TIME_SET_CURRENT(duration);
> >>> > +   INSTR_TIME_SUBTRACT(duration, start);
> >>> > +   WalStats.m_wal_sync_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > +   }
> >>> > +
> >>> > +   WalStats.m_wal_sync++;
> >>> > +   }
> >>> >
> >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> >>> > incrementing m_wal_sync after each sync operation?
> >>>
> >>> I think if syncing the disk does not occur, m_wal_sync should not be
> >>> incremented.
> >>> It depends enableFsync and sync_method.
> >>>
> >>> enableFsync is checked in each fsync method like
> >>> pg_fsync_no_writethrough(),
> >>> so if incrementing m_wal_sync after each sync operation, it should be
> >>> implemented
> >>> in each fsync method. It leads to many duplicated codes.
> >>
> >> Right. I missed that each fsync function checks enableFsync.
> >>
> >>> So, why don't you change the function to a flag whether to
> >>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
> >>
> >> Looks better. Since we don't necessarily need to increment m_wal_sync
> >> after doing fsync we can write the code without an additional variable
> >> as follows:
> >>
> >> if (enableFsync)
> >> {
> >> switch (sync_method)
> >> {
> >> case SYNC_METHOD_FSYNC:
> >> #ifdef HAVE_FSYNC_WRITETHROUGH
> >> case SYNC_METHOD_FSYNC_WRITETHROUGH:
> >> #endif
> >> #ifdef HAVE_FDATASYNC
> >> case SYNC_METHOD_FDATASYNC:
> >> #endif
> >> WalStats.m_wal_sync++;
> >> if (track_wal_io_timing)
> >> INSTR_TIME_SET_CURRENT(start);
> >> break;
> >> default:
> >>

Re: Is it useful to record whether plans are generic or custom?

2021-01-24 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 20:36:58 +0900, torikoshia  
wrote in 
> I suppose it would be normal practice to store past results of
> pg_stat_statements for future comparisons.
> If this is the case, I think that if we only add the number of
> generic plan execution, it will give us a hint to notice the cause
> of performance degradation due to changes in the plan between
> generic and custom.

Agreed.

> Attached a patch that just adds a generic call counter to
> pg_stat_statements.
> 
> Any thoughts?

Note that ActivePortal is the closest nested portal. So it gives the
wrong result for nested portals.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Is Recovery actually paused?

2021-01-24 Thread Kyotaro Horiguchi

At Mon, 25 Jan 2021 10:05:19 +0530, Dilip Kumar  wrote 
in 
> On Mon, Jan 25, 2021 at 8:42 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Sun, 24 Jan 2021 14:26:08 +0530, Dilip Kumar  
> > wrote in
> > > On Sun, Jan 24, 2021 at 12:16 PM Bharath Rupireddy
> > >  wrote:
> > > >
> > > > On Sun, Jan 24, 2021 at 7:17 AM Dilip Kumar  
> > > > wrote:
> > > > > On Sat, 23 Jan 2021 at 4:40 PM, Bharath Rupireddy 
> > > > >  wrote:
> > > > >>
> > > > >> On Sat, Jan 23, 2021 at 1:36 PM Dilip Kumar  
> > > > >> wrote:
> > > > >> > Please find the patch for the same.  I haven't added a test case 
> > > > >> > for
> > > > >> > this yet.  I mean we can write a test case to pause the recovery 
> > > > >> > and
> > > > >> > get the status.  But I am not sure that we can really write a 
> > > > >> > reliable
> > > > >> > test case for 'pause requested' and 'paused'.
> > > > >>
> > > > >> +1 to just show the recovery pause state in the output of
> > > > >> pg_is_wal_replay_paused. But, should the function name
> > > > >> "pg_is_wal_replay_paused" be something like
> > > > >> "pg_get_wal_replay_pause_state" or some other? To me, when "is" 
> > > > >> exists
> > > > >> in a function, I expect a boolean output. Others may have better
> > > > >> thoughts.
> > > > >>
> > > > >> IIUC the above change, ensuring the recovery is paused after it's
> > > > >> requested lies with the user. IMHO, the main problem we are trying to
> > > > >> solve is not met
> > > > >
> > > > >
> > > > > Basically earlier their was no way for the user yo know whether the 
> > > > > recovery is actually paused or not because it was always returning 
> > > > > true after pause requested.  Now, we will return whether pause 
> > > > > requested or actually paused.  So > for tool designer who want to 
> > > > > wait for recovery to get paused can have a loop and wait until the 
> > > > > recovery state reaches to paused.  That will give a better control.
> > > >
> > > > I get it and I agree to have that change. My point was whether we can
> > > > have a new function pg_wal_replay_pause_and_wait that waits until
> > > > recovery is actually paused ((along with pg_is_wal_replay_paused
> > > > returning the actual state than a true/false) so that tool developers
> > > > don't need to have the waiting code outside, if at all they care about
> > > > it? Others may have better thoughts than me.
> > >
> > > I think the previous patch was based on that idea where we thought
> > > that we can pass an argument to pg_is_wal_replay_paused which can
> > > decide whether to wait or return without the wait.  I think this
> > > version looks better to me where we give the status instead of
> > > waiting.  I am not sure whether we want another version of
> > > pg_wal_replay_pause which will wait for actually it to get paused.  I
> > > mean there is always a scope to include the functionality in the
> > > database which can be achieved by the tool, but this patch was trying
> > > to solve the problem that there was no way to know the status so I
> > > think returning the correct status should be the scope of this.
> >
> > I understand that the requirement here is that no record is replayed
> > after pg_wal_replay_pause() is returned, or pg_is_wal_replay_paused()
> > returns true, and delays taken while recovery don't delay the state
> > change. That requirements are really synchronous.
> >
> > On the other hand the machinery is designed to be asynchronous.
> >
> > >* Note that we intentionally don't take the info_lck spinlock
> > >* here.  We might therefore read a slightly stale value of
> > >* the recoveryPause flag, but it can't be very stale (no
> > >* worse than the last spinlock we did acquire).  Since a
> > >* pause request is a pretty asynchronous thing anyway,
> > >* possibly responding to it one WAL record later than we
> > >* otherwise would is a minor issue, so it doesn't seem worth
> > >* adding another spinlock cycle to prevent that.
> >
> > As the result, this patch tries to introduce several new checkpoints
> > to some delaying point so that that waits can find pause request in a
> > timely manner. I think we had better use locking (or atomics) for the
> > information instead of such scattered checkpoints if we expect that
> > machinery to work in a such syhcnronous manner.
> >
> > That would make the tri-state state variable and many checkpoints
> > unnecessary. Maybe.
> 
> I don't think the intention was so to make it synchronous,  I think
> the main intention was that pg_is_wal_replay_paused can return us the
> correct state, in short user can know that whether any more wal will
> be replayed after pg_is_wal_replay_paused returns true or some other
> state.  I agree that along with that we have also introduced some

I meant that kind of correctness in a time-series by using the word
"synchronous". So it can be implemented both by adopting many
checkpoints and by just makeing the state-change synchronou

Re: ModifyTable overheads in generic plans

2021-01-24 Thread Amit Langote

On Tue, Dec 22, 2020 at 5:16 PM Amit Langote  wrote:
> On Mon, Dec 7, 2020 at 3:53 PM Amit Langote  wrote:
> > On Thu, Nov 12, 2020 at 5:04 PM Amit Langote  
> > wrote:
> > > Attached new 0002 which does these adjustments.  I went with
> > > ri_RootTargetDesc to go along with ri_RelationDesc.
> > >
> > > Also, I have updated the original 0002 (now 0003) to make
> > > GetChildToRootMap() use ri_RootTargetDesc instead of
> > > ModifyTableState.rootResultRelInfo.ri_RelationDesc, so that even
> > > AfterTriggerSaveEvent() can now use that function.  This allows us to
> > > avoid having to initialize ri_ChildToRootMap anywhere but inside
> > > GetChildRootMap(), with that long comment defending doing so. :-)
> >
> > These needed to be rebased due to recent copy.c upheavals.  Attached.
>
> Needed to be rebased again.

And again, this time over the recent batch insert API related patches.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v13-0001-Set-ForeignScanState.resultRelInfo-lazily.patch
Description: Binary data


v13-0002-Rethink-ResultRelInfo.ri_PartitionRoot.patch
Description: Binary data


v13-0003-Initialize-result-relation-information-lazily.patch
Description: Binary data

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread tsunakawa.ta...@fujitsu.com

From: Greg Nancarrow 
> > (1)
> Yes, you're right the wording is not right (and I don't really like
> the wording used before the patch).
> 
> Perhaps it could say:
> 
> (Note that we do allow CREATE TABLE AS, INSERT INTO...SELECT, SELECT
> INTO, and CREATE MATERIALIZED VIEW to use parallel plans. However, as
> of now, other than in the case of INSERT INTO...SELECT, only the leader
> backend
> writes into a completely new table. In the future, we can extend it to
> allow workers for the
> other commands to write into the table. However, to allow parallel
> updates and deletes, we
> have to solve other problems, especially around combo CIDs.)

That looks good to me, thanks.


> > (4)
> You could, but I'm not sure it would make the code easier to read,
> especially for those who don't know !isParallelWorker() means it's a
> parallel leader.
...
> I really can't see a problem. PrepareParallelMode() is only needed
> prior to execution of a parallel plan, so it's not needed for "other
> call sites" using EnterParallelMode().
My frank first impressions were (and are):

* Why do we have to call a separate function for preparation despite the actual 
entering follows immediately?  We can do necessary preparation in the entering 
function.

* Those who read the parallel index build and parallel VACUUM code for the 
first time might be startled at the missing PrepareParallelMode() call: "Oh, 
EnterParallelMode() is called without preparation unlike the other site I saw 
the other day.  Isn't this a but?"


> Perhaps the name can be changed to disassociate it from generic
> EnterParallelMode() usage. So far, I've only thought of long names
> like: PrepareParallelModePlanExec().
> Ideas?

What PrepareParallelMode() handles is the XID and command ID, which are managed 
by access/transam/ module and are not executor-specific.  It's natural (or at 
least not unnatural) that EnterParallelMode() prepares them, because 
EnterParallelMode() is part of access/transam/.


Regards
Takayuki Tsunakawa

Re: Single transaction in the tablesync worker?

2021-01-24 Thread Amit Kapila

On Mon, Jan 25, 2021 at 8:03 AM Peter Smith  wrote:
>
> Hi Amit.
>
> PSA the v19 patch for the Tablesync Solution1.
>

I see one race condition in this patch where we try to drop the origin
via apply process and DropSubscription. I think it can lead to the
error "cache lookup failed for replication origin with oid %u". The
same problem can happen via exposed API pg_replication_origin_drop but
probably because this is not used concurrently so nobody faced this
issue. I think for the matter of this patch we can try to suppress
such an error either via try..catch, or by adding missing_ok argument
to replorigin_drop API, or we can just add to comments that such a
race exists. Additionally, we should try to start a new thread for the
existence of this problem in pg_replication_origin_drop. What do you
think?

-- 
With Regards,
Amit Kapila.

Re: Release SPI plans for referential integrity with DISCARD ALL

2021-01-24 Thread yuzuko

Hello,

Thank you for your comments.
Following Corey's advice, I applied Amit's patches proposed in this
email [1], and confirmed our memory pressure problem was solved.
So dropping cached plan with DISCARD is not necessary anymore.

[1] 
https://www.postgresql.org/message-id/CA%2BHiwqG1qQuBwApueaUfA855UJ4TiSgFkPF34hQDWx3tOChV5w%40mail.gmail.com

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread Greg Nancarrow

On Mon, Jan 25, 2021 at 2:22 PM Hou, Zhijie 
wrote:

> Hi,
>
> After doing some test to cover the code path in the PATCH 0001.
> I have some suggestions for the 0002 testcase.
>
>
> (1)
> +   /* Check parallel-safety of any expressions in the
> partition key */
> +   if (get_partition_col_attnum(pkey, i) == 0)
> +   {
> +   Node   *check_expr = (Node *)
> lfirst(partexprs_item);
> +
> +   if (max_parallel_hazard_walker(check_expr,
> context))
> +   {
> +   table_close(rel, lockmode);
> +   return true;
> +   }
>
> The testcase seems does not cover the above code(test when the table have
> parallel unsafe expression in the partition key).
>
> Personally, I use the following sql to cover this:
> -
> create table partkey_unsafe_key_expr_t (a int4, b name) partition by range
> ((fullname_parallel_unsafe('',a::varchar)));
> explain (costs off) insert into partkey_unsafe_key_expr_t select unique1,
> stringu1 from tenk1;
> -
>
>
Thanks. It looks like that test case was accidently missed (since the
comment said to test the index expressions, but it actually tested the
support functions).
I'll update the test code (and comments) accordingly, using your
suggestion.


>
> (2)
> I noticed that most of testcase test both (parallel
> safe/unsafe/restricted).
> But the index expression seems does not test the parallel restricted.
> How about add a testcase like:
> -
> create or replace function fullname_parallel_restricted(f text, l text)
> returns text as $$
> begin
> return f || l;
> end;
> $$ language plpgsql immutable parallel restricted;
>
> create table names4(index int, first_name text, last_name text);
> create index names4_fullname_idx on names4
> (fullname_parallel_restricted(first_name, last_name));
>
> --
> -- Test INSERT with parallel-restricted index expression
> -- (should create a parallel plan)
> --
> explain (costs off) insert into names4 select * from names;
> -
>
>
Thanks, looks like that test case is missing, I'll add it as you suggest.


> (3)
> +   /* Recursively check each partition ... */
> +   pdesc = RelationGetPartitionDesc(rel);
> +   for (i = 0; i < pdesc->nparts; i++)
> +   {
> +   if
> (rel_max_parallel_hazard_for_modify(pdesc->oids[i],
> +
>  command_type,
> +
>  context,
> +
>  AccessShareLock))
> +   {
> +   table_close(rel, lockmode);
> +   return true;
> +   }
> +   }
>
> It seems we do not have a testcase to test (some parallel unsafe
> expression or.. in partition)
> Hoe about add one testcase to test parallel unsafe partition ?
>
>
>
OK, I have to create a more complex table to test those other potential
parallel-safety issues of partitions (other than what was tested before the
recursive call, or support functions and expression in index key), but
since it's a recursive call, invoking code that's already been tested, I
would not anticipate any problems.


Thanks,

Greg Nancarrow
Fujitsu Australia

The mysterious pg_proc.protrftypes

2021-01-24 Thread Joel Jacobson

Hi,

I'm trying to understand how pg_proc.protrftypes works.

The documentation says "Data type OIDs for which to apply transforms.".
For this column, there is no reference to any catalog table?
I would guess it should be "(references pg_type.oid)", right?

I tried to generate a value for this column to verify my hypothesis,
but I struggle to find an example that produces a not null value here.

I grepped the sources and found the "CREATE TRANSFORM FOR type_name" command,
and found an extension using it named "bool_plperl" which I installed.

I assumed this would cause a value, but no.

Both of bool_plperl's two functions get null pg_proc.protrftypes values.

I've tried running the full regression "make installcheck", but protrftypes 
doesn't seem to be covered:

$ cd postgresql
$ make installcheck
...
===
All 203 tests passed.
===
$ psql regression
regression=# SELECT COUNT(*) FROM pg_proc WHERE protrftypes IS NOT NULL;
count
---
 0
(1 row)

Can someone please show me how to generate a function with a not null 
pg_proc.protrftypes value?

Many thanks.

/Joel

Re: Heap's backwards scan scans the incorrect pages with heap_setscanlimits()

2021-01-24 Thread David Rowley

On Thu, 21 Jan 2021 at 13:16, David Rowley  wrote:
> Proposed patch attached.

I ended up pushing a slightly revised version of this which kept the
code the same as before when rs_numblocks had not been changed. I
backpatched to 9.5 as it seemed low risk and worthy of stopping some
head-scratching and a future report for any extension authors that
make use of heap_setscanlimits() with backwards scans at some point in
the future.

David

RE: libpq debug log

2021-01-24 Thread tsunakawa.ta...@fujitsu.com

First, some possibly major questions:


(23)
From: 'Alvaro Herrera' 
> Maybe we can create a new file specifically for this to avoid mixing
> unrelated stuff in fe-misc.c -- say fe-trace.c where all this new
> tracing stuff goes.

What do you think about this suggestion?  I think this is reasonable, including 
moving PQtrace/PQuntrace to the new file.

OTOH, fe-misc also has humble reasons to contain them.  One is that the file 
header comment is as follows, accepting miscellaneous stuff.  Another is that 
although most functions in fe-misc.c are related to protocol data transmission 
and receipt, several functions at the end of the file are already not related.

 *  fe-misc.c
 *
 *   DESCRIPTION
 *   miscellaneous useful functions


(24)
-void PQtrace(PGconn *conn, FILE *stream);
+void PQtrace(PGconn *conn, FILE *stream, bits32 flags);
 
  
 
+ 
+  flags contains flag bits describing the operating mode
+  of tracing.  If (flags & TRACE_SUPPRESS_TIMESTAMPS) is
+  true, then timestamps are not printed with each message.
+ 

As I asked in the previous mail, I'm afraid we cannot change the signature of 
existing API functions.  If we want this flag bits, we have to add something 
like PQtraceEx(), don't we?

The flag name differs between in the manual and in the source code:

+#define PQTRACE_SUPPRESS_TIMESTAMPS(1 << 0)


P.S.
Also, please note this as:

> Also, why don't you try running the regression tests with a temporary 
> modification to PQtrace() to output the trace to a file?  The sole purpose is 
> to confirm that this patch doesn't make the test crash (core dump).



Regards
Takayuki Tsunakawa

Re: The mysterious pg_proc.protrftypes

2021-01-24 Thread Pavel Stehule

po 25. 1. 2021 v 8:05 odesílatel Joel Jacobson  napsal:

> Hi,
>
> I'm trying to understand how pg_proc.protrftypes works.
>
> The documentation says "Data type OIDs for which to apply transforms.".
> For this column, there is no reference to any catalog table?
> I would guess it should be "(references pg_type.oid)", right?
>
> I tried to generate a value for this column to verify my hypothesis,
> but I struggle to find an example that produces a not null value here.
>
> I grepped the sources and found the "CREATE TRANSFORM FOR type_name"
> command,
> and found an extension using it named "bool_plperl" which I installed.
>
> I assumed this would cause a value, but no.
>
> Both of bool_plperl's two functions get null pg_proc.protrftypes values.
>
> I've tried running the full regression "make installcheck",
> but protrftypes doesn't seem to be covered:
>
> $ cd postgresql
> $ make installcheck
> ...
> ===
> All 203 tests passed.
> ===
> $ psql regression
> regression=# SELECT COUNT(*) FROM pg_proc WHERE protrftypes IS NOT NULL;
> count
> ---
>  0
> (1 row)
>
> Can someone please show me how to generate a function with a not null
> pg_proc.protrftypes value?
>

you should to use TRANSFORM clause in CREATE FUNCTION statement

https://www.postgresql.org/docs/current/sql-createfunction.html

CREATE EXTENSION hstore_plperl CASCADE;

CREATE FUNCTION test2() RETURNS hstore
LANGUAGE plperl
TRANSFORM FOR TYPE hstore
AS $$
$val = {a => 1, b => 'boo', c => undef};
return $val;
$$;

Regards

Pavel


> Many thanks.
>
> /Joel
>
>
>
>
>
>
>

Re: Identify missing publications from publisher while create/alter subscription.

2021-01-24 Thread vignesh C

On Thu, Jan 21, 2021 at 10:21 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 21, 2021 at 6:56 PM vignesh C  wrote:
> >
> > Hi,
> >
> > Creating/altering subscription is successful when we specify a
> > publication which does not exist in the publisher. I felt we should
> > throw an error in this case, that will help the user to check if there
> > is any typo in the create subscription command or to create the
> > publication before the subscription is created.
> > If the above analysis looks correct, then please find a patch that
> > checks if the specified publications are present in the publisher and
> > throws an error if any of the publications is missing in the
> > publisher.
> > Thoughts?
>
> I was having similar thoughts (while working on  the logical
> replication bug on alter publication...drop table behaviour) on why
> create subscription succeeds without checking the publication
> existence. I checked in documentation, to find if there's a strong
> reason for that, but I couldn't. Maybe it's because of the principle
> "first let users create subscriptions, later the publications can be
> created on the publisher system", similar to this behaviour
> "publications can be created without any tables attached to it
> initially, later they can be added".
>
> Others may have better thoughts.
>
> If we do check publication existence for CREATE SUBSCRIPTION, I think
> we should also do it for ALTER SUBSCRIPTION ... SET PUBLICATION.
>
> I wonder, why isn't dropping a publication from a list of publications
> that are with subscription is not allowed?
>
> Some comments so far on the patch:
>
> 1) I see most of the code in the new function check_publications() and
> existing fetch_table_list() is the same. Can we have a generic
> function, with maybe a flag to separate out the logic specific for
> checking publication and fetching table list from the publisher.

I have made the common code between the check_publications and
fetch_table_list into a common function
get_appended_publications_query. I felt the rest of the code is better
off kept as it is.
The Attached patch has the changes for the same and also the change to
check publication exists during alter subscription set publication.
Thoughts?


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 6ef5a35d82065ec7e497ebe38e575afa45ad9469 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 21 Jan 2021 18:38:54 +0530
Subject: [PATCH v2] Identify missing publications from publisher while
 create/alter subscription.

Creating/altering subscription is successful when we specify a publication which
does not exist in the publisher. This patch checks if the specified publications
are present in the publisher and throws an error if any of the publication is
missing in the publisher. One of the Alter Subscritpion test from regression was
moved to tap test as the error message "/tmp/pg_regress-P6KQrj/.s.PGSQL.58080"
varies from exeuction to execution.
---
 src/backend/commands/subscriptioncmds.c| 158 -
 src/test/regress/expected/subscription.out |  33 +++---
 src/test/regress/sql/subscription.sql  |   1 -
 src/test/subscription/t/100_bugs.pl|  78 +-
 4 files changed, 224 insertions(+), 46 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 082f785..59076b5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -46,6 +46,7 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static void check_publications(WalReceiverConn *wrconn, List *publications);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -490,6 +491,9 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 
 		PG_TRY();
 		{
+			/* Verify specified publications exists in the publisher. */
+			check_publications(wrconn, publications);
+
 			/*
 			 * Set sync state based on if we were asked to do data copy or
 			 * not.
@@ -557,7 +561,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 }
 
 static void
-AlterSubscription_refresh(Subscription *sub, bool copy_data)
+AlterSubscription_refresh(Subscription *sub, bool copy_data, bool check_pub)
 {
 	char	   *err;
 	List	   *pubrel_names;
@@ -576,6 +580,10 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 		ereport(ERROR,
 (errmsg("could not connect to the publisher: %s", err)));
 
+	/* Verify specified publications exists in the publisher. */
+	if (check_pub)
+		check_publications(wrconn, sub->publications);
+
 	/* Get the table list from publisher. */
 	pubrel_names = fetch_table_list(wrconn, sub->publications);
 
@@ -822,6 +830,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 			{
 bool		copy_data;
 bool		refresh;
+char	   *err;
+WalReceiverConn *wrconn;
 
 parse_subscription_options(stmt->options,

Re: The mysterious pg_proc.protrftypes

2021-01-24 Thread Joel Jacobson

On Mon, Jan 25, 2021, at 08:14, Pavel Stehule wrote:
>you should to use TRANSFORM clause in CREATE FUNCTION statement

Thanks, it worked, and like expected it references the pg_type.oid of the 
transform.

Attached patch adds "(references pg_type.oid)" to the documentation for 
pg_proc.protrftypes.

Suggested commit message: "Document the fact that pg_proc.protrftypes 
references pg_type.oid"

/Joel

pg-proc-protrftypes-references-pg-type-oid.patch
Description: Binary data

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-24 Thread Fujii Masao





On 2021/01/23 13:40, Bharath Rupireddy wrote:

On Fri, Jan 22, 2021 at 6:43 PM Fujii Masao  wrote:

Please review the v16 patch set further.


Thanks! Will review that later.


+   /*
+* For the given server, if we closed connection or it 
is still in
+* use, then no need of scanning the cache further. We 
do this
+* because the cache can not have multiple cache 
entries for a
+* single foreign server.
+*/

On second thought, ISTM that single foreign server can have multiple cache
entries. For example,

CREATE ROLE foo1 SUPERUSER;
CREATE ROLE foo2 SUPERUSER;
CREATE EXTENSION postgres_fdw;
CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port '5432');
CREATE USER MAPPING FOR foo1 SERVER loopback OPTIONS (user 'postgres');
CREATE USER MAPPING FOR foo2 SERVER loopback OPTIONS (user 'postgres');
CREATE TABLE t (i int);
CREATE FOREIGN TABLE ft (i int) SERVER loopback OPTIONS (table_name 't');
SET SESSION AUTHORIZATION foo1;
SELECT * FROM ft;
SET SESSION AUTHORIZATION foo2;
SELECT * FROM ft;


Then you can see there are multiple open connections for the same server
as follows. So we need to scan all the entries even when the serverid is
specified.

SELECT * FROM postgres_fdw_get_connections();

   server_name | valid
-+---
   loopback| t
   loopback| t
(2 rows)


This is a great finding. Thanks a lot. I will remove
hash_seq_term(&scan); in disconnect_cached_connections and add this as
a test case for postgres_fdw_get_connections function, just to show
there can be multiple connections with a single server name.


This means that user (even non-superuser) can disconnect the connection
established by another user (superuser), by using postgres_fdw_disconnect_all().
Is this really OK?


Yeah, connections can be discarded by non-super users using
postgres_fdw_disconnect_all and postgres_fdw_disconnect. Given the
fact that a non-super user requires a password to access foreign
tables [1], IMO a non-super user changing something related to a super
user makes no sense at all. If okay, we can have a check in
disconnect_cached_connections something like below:


Also like pg_terminate_backend(), we should disallow non-superuser to 
disconnect the connections established by other non-superuser if the requesting 
user is not a member of the other? Or that's overkill because the target to 
discard is just a connection and it can be established again if necessary?

For now I'm thinking that it might better to add the restriction like 
pg_terminate_backend() at first and relax that later if possible. But I'd like 
hear more opinions about this.




+static bool
+disconnect_cached_connections(Oid serverid)
+{
+HASH_SEQ_STATUSscan;
+ConnCacheEntry*entry;
+boolall = !OidIsValid(serverid);
+boolresult = false;
+
+if (!superuser())
+ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser to discard open connections")));
+
+if (!ConnectionHash)

Having said that, it looks like dblink_disconnect doesn't perform
superuser checks.


Also non-superuser (set by SET ROLE or SET SESSION AUTHORIZATION) seems to be 
able to run SQL using the dblink connection established by superuser. If we 
didn't think that this is a problem, we also might not need to care about issue 
even for postgres_fdw.


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-24 Thread Masahiro Ikeda


Hi, thanks for the reviews.

I updated the attached patch.
The summary of the changes is following.

1. fix document

I followed another view's comments.


2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.


3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last 
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force" 
as false.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION--- v5-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-23 09:26:01.919248712 +0900
+++ v6-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-25 16:27:50.749429666 +0900
@@ -1,6 +1,6 @@
-From ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001
+From e9aad92097c5cff5565b67ce1a8ec6d7b4c8a4d9 Mon Sep 17 00:00:00 2001
 From: Masahiro Ikeda 
-Date: Fri, 22 Jan 2021 21:38:31 +0900
+Date: Mon, 25 Jan 2021 16:26:04 +0900
 Subject: [PATCH] Add statistics related to write/sync wal records.
 
 This patch adds following statistics to pg_stat_wal view
@@ -24,20 +24,22 @@
 
 (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
 ---
- doc/src/sgml/config.sgml  | 21 +++
- doc/src/sgml/monitoring.sgml  | 48 ++-
- src/backend/access/transam/xlog.c | 59 ++-
+ doc/src/sgml/config.sgml  | 21 
+ doc/src/sgml/monitoring.sgml  | 48 -
+ src/backend/access/transam/xlog.c | 51 ++-
  src/backend/catalog/system_views.sql  |  4 ++
- src/backend/postmaster/pgstat.c   |  4 ++
- src/backend/postmaster/walwriter.c|  3 +
- src/backend/utils/adt/pgstatfuncs.c   | 24 +++-
- src/backend/utils/misc/guc.c  |  9 +++
+ src/backend/postmaster/checkpointer.c |  2 +-
+ src/backend/postmaster/pgstat.c   | 26 --
+ src/backend/postmaster/walwriter.c|  3 ++
+ src/backend/replication/walreceiver.c | 30 +++
+ src/backend/utils/adt/pgstatfuncs.c   | 24 +++--
+ src/backend/utils/misc/guc.c  |  9 
  src/backend/utils/misc/postgresql.conf.sample |  1 +
  src/include/access/xlog.h |  1 +
  src/include/catalog/pg_proc.dat   | 14 ++---
- src/include/pgstat.h  |  8 +++
- src/test/regress/expected/rules.out   |  6 +-
- 13 files changed, 189 insertions(+), 13 deletions(-)
+ src/include/pgstat.h  | 10 +++-
+ src/test/regress/expected/rules.out   |  6 ++-
+ 15 files changed, 232 insertions(+), 18 deletions(-)
 
 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
 index 82864bbb24..43f3fbcaf8 100644
@@ -72,7 +74,7 @@
track_functions (enum)

 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
-index f05140dd42..36764dc64f 100644
+index f05140dd42..5a8fc4eb0c 100644
 --- a/doc/src/sgml/monitoring.sgml
 +++ b/doc/src/sgml/monitoring.sgml
 @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
@@ -80,7 +82,7 @@


 -   Number of times WAL data was written to disk because WAL buffers became full
-+   Total number of times WAL data was written to disk because WAL buffers became full
++   Total number of WAL data written to disk because WAL buffers became full
 +  
 + 
 +
@@ -89,7 +91,7 @@
 +   wal_write bigint
 +  
 +  
-+   Total number of times WAL data was written to disk
++   Total number of WAL data written to disk
 +  
 + 
 +
@@ -109,7 +111,7 @@
 +   wal_sync bigint
 +  
 +  
-+   Total number of times WAL data was synced to disk
++   Total number of WAL data synced to disk
 +   (if  is open_datasync or 
 +   open_sync, this value is zero because WAL data is synced 
 +   when to write it).
@@ -131,7 +133,7 @@
   
  
 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
-index 470e113b33..b3890f97a2 100644
+index 470e113b33..1c4860bee7 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
@@ -142,16 +144,15 @@
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
-@@ -2540,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+@@ -2540,6 +2541,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  			Size		nbytes;
  			Size		nleft;
  			int

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-24 Thread Greg Nancarrow

On Mon, Jan 25, 2021 at 4:37 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
>
> > > (4)
> > You could, but I'm not sure it would make the code easier to read,
> > especially for those who don't know !isParallelWorker() means it's a
> > parallel leader.
> ...
> > I really can't see a problem. PrepareParallelMode() is only needed
> > prior to execution of a parallel plan, so it's not needed for "other
> > call sites" using EnterParallelMode().
> My frank first impressions were (and are):
>
> * Why do we have to call a separate function for preparation despite the 
> actual entering follows immediately?  We can do necessary preparation in the 
> entering function.
>
> * Those who read the parallel index build and parallel VACUUM code for the 
> first time might be startled at the missing PrepareParallelMode() call: "Oh, 
> EnterParallelMode() is called without preparation unlike the other site I saw 
> the other day.  Isn't this a but?"
>
>
> > Perhaps the name can be changed to disassociate it from generic
> > EnterParallelMode() usage. So far, I've only thought of long names
> > like: PrepareParallelModePlanExec().
> > Ideas?
>
> What PrepareParallelMode() handles is the XID and command ID, which are 
> managed by access/transam/ module and are not executor-specific.  It's 
> natural (or at least not unnatural) that EnterParallelMode() prepares them, 
> because EnterParallelMode() is part of access/transam/.
>
>

EnterParallelMode() is part of a generic interface for execution of a
parallel operation, and EnterParallelMode() is called in several
different places to enter parallel mode prior to execution of
different parallel operations. At the moment it is assumed that
EnterParallelMode() just essentially sets a flag to prohibit certain
unsafe operations when doing the parallel operation. If I move
PrepareParallelMode() into EnterParallelMode() then I need to pass in
contextual information to distinguish who the caller is, and possibly
extra information needed by that caller - and change the function call
for each caller, and probably update the comments for each, and in
other places, etc. etc.
I think that it just complicates things doing this. The other callers
of EnterParallelMode() are obviously currently doing their own "pre"
parallel-mode code themselves, specific to whatever parallel operation
they are doing - but nobody has thought it necessary to have to hook
this code into EnterParallelMode().
I think the "PrepareParallelMode()" name can just be changed to
something specific to plan execution, so nobody gets confused with a
name like "PrepareParallelMode()", which as you point out sounds
generic to all callers of EnterParallelMode().

Regards,
Greg Nancarrow
Fujitsu Australia

[PATCH] Document the fact that pg_proc.protrftypes references pg_type.oid

2021-01-24 Thread Joel Jacobson

Hi,

This patch fixes $SUBJECT.

/Joel

pg-proc-protrftypes-references-pg-type-oid.patch
Description: Binary data

67 matches

Mail list logo