Re: shared-memory based stats collector

2020-10-08 Thread Kyotaro Horiguchi
At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> The previous version failed to flush local database stats for certain
> condition. That behavior caused useless retries and finally a forced
> flush that leads to contention. I fixed that and will measure
> performance with this version.

I (we) got some performance numbers.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

Those showed speed of over 400,000 TPS at maximum, and no siginificant
difference is seen between patched and unpatched at the all range of
the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
but that made no difference.

- Fetching 1 tuple from 1 table from 800 clients.

No graph for this is not attached but this test shows speed of over 42
TPS with or without the v39 patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Dilip Kumar
On Fri, Oct 2, 2020 at 12:26 PM Keisuke Kuroda
 wrote:
>
> Hi Dilip, Amit,
>
> > > 5. Can you please once repeat the performance test done by Keisuke-San
> > > to see if you have similar observations? Additionally, see if you are
> > > also seeing the inconsistency related to the Truncate message reported
> > > by him and if so why?
> > >
> >
> > Okay, I will set up and do this test early next week.  Keisuke-San,
> > can you send me your complete test script?
>
> Yes, I've attached a test script(test_pg_recvlogical.sh)
>
> Sorry, the issue with TRUNCATE not outputting was due to a build miss
> on my part.
> Even before the patch, TRUNCATE decodes and outputting correctly.
> So, please check the performance only.
>
> I have tested it again and will share the results with you.
>
> Also, the argument of palloc was still MemoryContextAlloc,
> which prevented me from applying the patch, so I've only fixed that part.
>
> # test script
>
> Please set PGHOME and CLUSTER_PUB before run.
>
> sh test_pg_recvlogical.sh
>
> # perf command
>
> perf record --call-graph dwarf -p [walsender pid]
> perf report -i perf.data --no-children
>
> # before patch
>
> decode + invalidation = 222s
>
> 2020-10-02 14:55:50 BEGIN 509
> 2020-10-02 14:59:42 table nsp_001.tbl_001, nsp_001.part_0001 ...
> nsp_001.part_0999, nsp_001.part_1000: TRUNCATE: (no-flags)
> 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09)

I could not see this issue even without the patch, it is taking less
than 1s even without the patch.  See below results

2020-10-08 13:00:49 BEGIN 509
2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:...
2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30)

Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




dynamic result sets support in extended query protocol

2020-10-08 Thread Peter Eisentraut
I want to progress work on stored procedures returning multiple result 
sets.  Examples of how this could work on the SQL side have previously 
been shown [0].  We also have ongoing work to make psql show multiple 
result sets [1].  This appears to work fine in the simple query 
protocol.  But the extended query protocol doesn't support multiple 
result sets at the moment [2].  This would be desirable to be able to 
use parameter binding, and also since one of the higher-level goals 
would be to support the use case of stored procedures returning multiple 
result sets via JDBC.


[0]: 
https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com

[1]: https://commitfest.postgresql.org/29/2096/
[2]: https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us

(Terminology: I'm calling this project "dynamic result sets", which 
includes several concepts: 1) multiple result sets, 2) those result sets 
can have different structures, 3) the structure of the result sets is 
decided at run time, not declared in the schema/procedure definition/etc.)


One possibility I rejected was to invent a third query protocol beside 
the simple and extended one.  This wouldn't really match with the 
requirements of JDBC and similar APIs because the APIs for sending 
queries don't indicate whether dynamic result sets are expected or 
required, you only indicate that later by how you process the result 
sets.  So we really need to use the existing ways of sending off the 
queries.  Also, avoiding a third query protocol is probably desirable in 
general to avoid extra code and APIs.


So here is my sketch on how this functionality could be woven into the 
extended query protocol.  I'll go through how the existing protocol 
exchange works and then point out the additions that I have in mind.


These additions could be enabled by a _pq_ startup parameter sent by the 
client.  Alternatively, it might also work without that because the 
client would just reject protocol messages it doesn't understand, but 
that's probably less desirable behavior.


So here is how it goes:

C: Parse
S: ParseComplete

At this point, the server would know whether the statement it has parsed 
can produce dynamic result sets.  For a stored procedure, this would be 
declared with the procedure definition, so when the CALL statement is 
parsed, this can be noticed.  I don't actually plan any other cases, but 
for the sake of discussion, perhaps some variant of EXPLAIN could also 
return multiple result sets, and that could also be detected from 
parsing the EXPLAIN invocation.


At this point a client would usually do

C: Describe (statement)
S: ParameterDescription
S: RowDescription

New would be that the server would now also respond with a new message, say,

S: DynamicResultInfo

that indicates that dynamic result sets will follow later.  The message 
would otherwise be empty.  (We could perhaps include the number of 
result sets, but this might not actually be useful, and perhaps it's 
better not to spent effort on counting things that don't need to be 
counted.)


(If we don't guard this by a _pq_ startup parameter from the client, an 
old client would now error out because of an unexpected protocol message.)


Now the normal bind and execute sequence follows:

C: Bind
S: BindComplete
(C: Describe (portal))
(S: RowDescription)
C: Execute
S: ... (DataRows)
S: CommandComplete

In the case of a CALL with output parameters, this "primary" result set 
contains one row with the output parameters (existing behavior).


Now, if the client has seen DynamicResultInfo earlier, it should now go 
into a new subsequence to get the remaining result sets, like this 
(naming obviously to be refined):


C: NextResult
S: NextResultReady
C: Describe (portal)
S: RowDescription
C: Execute

S: CommandComplete
C: NextResult
...
C: NextResult
S: NoNextResult
C: Sync
S: ReadyForQuery

I think this would all have to use the unnamed portal, but perhaps there 
could be other uses with named portals.  Some details to be worked out.


One could perhaps also do without the DynamicResultInfo message and just 
put extra information into the CommandComplete message indicating "there 
are more result sets after this one".


(Following the model from the simple query protocol, CommandComplete 
really means one result set complete, not the whole top-level command. 
ReadyForQuery means the whole command is complete.  This is perhaps 
debatable, and interesting questions could also arise when considering 
what should happen in the simple query protocol when a query string 
consists of multiple commands each returning multiple result sets.  But 
it doesn't really seem sensible to cater to that.)


One thing that's missing in this sequence is a way to specify the 
desired output format (text/binary) for each result set.  This could be 
added to the NextResult message, but at that point the client doesn't 
yet know the number of columns in the

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-08 Thread Greg Nancarrow
On Wed, Oct 7, 2020 at 7:25 PM Greg Nancarrow  wrote:
>
> On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy
>  wrote:
> > In parallel, we are not doing anything(due to the same reason
> > explained in above comment) to find whether there is a foreign
> > partition or not while deciding to go with parallel/non-parallel copy,
> > we are just throwing an error during the first tuple insertion into
> > the partition.
> >
> > errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
> > OF triggers, or if the partition is foreign partition"),
> > errhint("Try COPY without PARALLEL option")));
> >
>
> I'm wondering whether code similar to the following can safely be used
> to detect a foreign partition:
>
> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> {
> int i;
> PartitionDesc pd = RelationGetPartitionDesc(rel);
> for (i = 0; i < pd->nparts; i++)
> {
> if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE)
> {
> table_close(rel, NoLock);
> return false;
> }
> }
> }
>

Actually, the addition of this kind of check is still not good enough.
Partitions can have their own constraints, triggers, column default
expressions etc. and a partition itself can be partitioned.
I've written code to recursively walk the partitions and do all the
various checks for parallel-insert-safety as before, but it's doing a
fair bit of work.
Any other idea of dealing with this? Seems it can't be avoided if you
want to support partitioned tables and partitions.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-08 Thread Bharath Rupireddy
On Thu, Oct 8, 2020 at 1:42 PM Greg Nancarrow  wrote:
>
> On Wed, Oct 7, 2020 at 7:25 PM Greg Nancarrow  wrote:
> >
> > On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy
> >  wrote:
> > > In parallel, we are not doing anything(due to the same reason
> > > explained in above comment) to find whether there is a foreign
> > > partition or not while deciding to go with parallel/non-parallel copy,
> > > we are just throwing an error during the first tuple insertion into
> > > the partition.
> > >
> > > errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
> > > OF triggers, or if the partition is foreign partition"),
> > > errhint("Try COPY without PARALLEL option")));
> > >
> >
> > I'm wondering whether code similar to the following can safely be used
> > to detect a foreign partition:
> >
> > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> > {
> > int i;
> > PartitionDesc pd = RelationGetPartitionDesc(rel);
> > for (i = 0; i < pd->nparts; i++)
> > {
> > if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE)
> > {
> > table_close(rel, NoLock);
> > return false;
> > }
> > }
> > }
> >
>
> Actually, the addition of this kind of check is still not good enough.
> Partitions can have their own constraints, triggers, column default
> expressions etc. and a partition itself can be partitioned.
> I've written code to recursively walk the partitions and do all the
> various checks for parallel-insert-safety as before, but it's doing a
> fair bit of work.
> Any other idea of dealing with this? Seems it can't be avoided if you
> want to support partitioned tables and partitions.
>

IMHO, it's good not to do all of this recursive checking right now,
which may complicate the code or may restrict the performance gain.
Having said that, in future we may have to do something about it.

Others may have better opinions on this point.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: dynamic result sets support in extended query protocol

2020-10-08 Thread Tatsuo Ishii
Are you proposing to bump up the protocol version (either major or
minor)?  I am asking because it seems you are going to introduce some
new message types.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> I want to progress work on stored procedures returning multiple result
> sets.  Examples of how this could work on the SQL side have previously
> been shown [0].  We also have ongoing work to make psql show multiple
> result sets [1].  This appears to work fine in the simple query
> protocol.  But the extended query protocol doesn't support multiple
> result sets at the moment [2].  This would be desirable to be able to
> use parameter binding, and also since one of the higher-level goals
> would be to support the use case of stored procedures returning
> multiple result sets via JDBC.
> 
> [0]:
> https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com
> [1]: https://commitfest.postgresql.org/29/2096/
> [2]:
> https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us
> 
> (Terminology: I'm calling this project "dynamic result sets", which
> includes several concepts: 1) multiple result sets, 2) those result
> sets can have different structures, 3) the structure of the result
> sets is decided at run time, not declared in the schema/procedure
> definition/etc.)
> 
> One possibility I rejected was to invent a third query protocol beside
> the simple and extended one.  This wouldn't really match with the
> requirements of JDBC and similar APIs because the APIs for sending
> queries don't indicate whether dynamic result sets are expected or
> required, you only indicate that later by how you process the result
> sets.  So we really need to use the existing ways of sending off the
> queries.  Also, avoiding a third query protocol is probably desirable
> in general to avoid extra code and APIs.
> 
> So here is my sketch on how this functionality could be woven into the
> extended query protocol.  I'll go through how the existing protocol
> exchange works and then point out the additions that I have in mind.
> 
> These additions could be enabled by a _pq_ startup parameter sent by
> the client.  Alternatively, it might also work without that because
> the client would just reject protocol messages it doesn't understand,
> but that's probably less desirable behavior.
> 
> So here is how it goes:
> 
> C: Parse
> S: ParseComplete
> 
> At this point, the server would know whether the statement it has
> parsed can produce dynamic result sets.  For a stored procedure, this
> would be declared with the procedure definition, so when the CALL
> statement is parsed, this can be noticed.  I don't actually plan any
> other cases, but for the sake of discussion, perhaps some variant of
> EXPLAIN could also return multiple result sets, and that could also be
> detected from parsing the EXPLAIN invocation.
> 
> At this point a client would usually do
> 
> C: Describe (statement)
> S: ParameterDescription
> S: RowDescription
> 
> New would be that the server would now also respond with a new
> message, say,
> 
> S: DynamicResultInfo
> 
> that indicates that dynamic result sets will follow later.  The
> message would otherwise be empty.  (We could perhaps include the
> number of result sets, but this might not actually be useful, and
> perhaps it's better not to spent effort on counting things that don't
> need to be counted.)
> 
> (If we don't guard this by a _pq_ startup parameter from the client,
> an old client would now error out because of an unexpected protocol
> message.)
> 
> Now the normal bind and execute sequence follows:
> 
> C: Bind
> S: BindComplete
> (C: Describe (portal))
> (S: RowDescription)
> C: Execute
> S: ... (DataRows)
> S: CommandComplete
> 
> In the case of a CALL with output parameters, this "primary" result
> set contains one row with the output parameters (existing behavior).
> 
> Now, if the client has seen DynamicResultInfo earlier, it should now
> go into a new subsequence to get the remaining result sets, like this
> (naming obviously to be refined):
> 
> C: NextResult
> S: NextResultReady
> C: Describe (portal)
> S: RowDescription
> C: Execute
> 
> S: CommandComplete
> C: NextResult
> ...
> C: NextResult
> S: NoNextResult
> C: Sync
> S: ReadyForQuery
> 
> I think this would all have to use the unnamed portal, but perhaps
> there could be other uses with named portals.  Some details to be
> worked out.
> 
> One could perhaps also do without the DynamicResultInfo message and
> just put extra information into the CommandComplete message indicating
> "there are more result sets after this one".
> 
> (Following the model from the simple query protocol, CommandComplete
> really means one result set complete, not the whole top-level
> command. ReadyForQuery means the whole command is complete.  This is
> perhaps debatable, and interesting questions could also ari

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-08 Thread Masahiko Sawada
On Thu, 8 Oct 2020 at 14:10, Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 7 Oct 2020 at 17:52, Amit Kapila  wrote:
> > >
> > > On Wed, Oct 7, 2020 at 11:24 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Tue, 6 Oct 2020 at 17:56, Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Oct 6, 2020 at 9:34 AM Masahiko Sawada
> > > > >  wrote:
> > > > > >
> > > > > > Looking at pgstat_reset_replslot_counter() in the v8 patch, even if 
> > > > > > we
> > > > > > pass a physical slot name to pg_stat_reset_replication_slot() a
> > > > > > PgStat_MsgResetreplslotcounter is sent to the stats collector. I’m
> > > > > > okay with not raising an error but maybe we can have it not to send
> > > > > > the message in that case.
> > > > > >
> > > > >
> > > > > makes sense, so changed accordingly.
> > > > >
> > > >
> > > > Thank you for updating the patch!
> > > >
> > >
> > > Thanks, I will push the first one tomorrow unless I see more comments
> > > and test-case one later.
> >
> > I thought we could have a test case for the reset function, what do you 
> > think?
> >
>
> We can write if we want but there are few things we need to do for
> that like maybe a new function like wait_for_spill_stats which will
> check if the counters have become zero. Then probably call a reset
> function, call a new wait function, and then again check stats to
> ensure they are reset to 0.

Yes.

> We can't write any advanced test which means reset the existing stats
> perform some tests and again check stats because *slot_get_changes()
> function can start from the previous WAL for which we have covered the
> stats. We might write that if we can somehow track the WAL positions
> from the previous test. I am not sure if we want to go there.

Can we use pg_logical_slot_peek_changes() instead to decode the same
transactions multiple times?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Keisuke Kuroda
Hi Dilip,

> I could not see this issue even without the patch, it is taking less
> than 1s even without the patch.  See below results
>
> 2020-10-08 13:00:49 BEGIN 509
> 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:...
> 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30)
>
> Am I missing something?

Thanks for running the tests.
It is the TRUNCATE decoding that takes time.
INSERT decoding is fast, even before the patch is applied.

2020-10-02 14:55:48 BEGIN 508
2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ...
2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09)

However, TRUNCATE decode is slow
and take 222s in my environment.

2020-10-02 14:55:50 BEGIN 509
2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000:
TRUNCATE: (no-flags)
2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09)

This script will wait 10 seconds after INSERT exits
before executing TRUNCATE, please wait for it to run.

When TRUNCATE completes,
the walsender process should be at 100% CPU.

Best Regards,
-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com




Fix typos in reorderbuffer.c

2020-10-08 Thread Amit Kapila
@@ -1432,7 +1432,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  ReorderBufferCleanupTXN(rb, subtxn);
  }

- /* cleanup changes in the toplevel txn */
+ /* cleanup changes in the txn */
  dlist_foreach_modify(iter, &txn->changes)
  {
  ReorderBufferChange *change;
@@ -1533,7 +1533,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  ReorderBufferTruncateTXN(rb, subtxn);
  }

- /* cleanup changes in the toplevel txn */
+ /* cleanup changes in the txn */
  dlist_foreach_modify(iter, &txn->changes)
  {
  ReorderBufferChange *change;

Both the above functions are recursive and will clean the changes for
both the top-level transaction and subtransactions. So, I feel the
comments should be accordingly updated.

-- 
With Regards,
Amit Kapila.


v1-0001-Fix-typos-in-reorderbuffer.c.patch
Description: Binary data


Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Dilip Kumar
On Thu, Oct 8, 2020 at 2:05 PM Keisuke Kuroda
 wrote:
>
> Hi Dilip,
>
> > I could not see this issue even without the patch, it is taking less
> > than 1s even without the patch.  See below results
> >
> > 2020-10-08 13:00:49 BEGIN 509
> > 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:...
> > 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30)
> >
> > Am I missing something?
>
> Thanks for running the tests.
> It is the TRUNCATE decoding that takes time.
> INSERT decoding is fast, even before the patch is applied.
>
> 2020-10-02 14:55:48 BEGIN 508
> 2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ...
> 2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09)
>
> However, TRUNCATE decode is slow
> and take 222s in my environment.
>
> 2020-10-02 14:55:50 BEGIN 509
> 2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000:
> TRUNCATE: (no-flags)
> 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09)
>
> This script will wait 10 seconds after INSERT exits
> before executing TRUNCATE, please wait for it to run.
>
> When TRUNCATE completes,
> the walsender process should be at 100% CPU.

Okay, thanks for the info, I will run again and see this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: speed up unicode normalization quick check

2020-10-08 Thread John Naylor
On Thu, Oct 8, 2020 at 2:48 AM Michael Paquier  wrote:

> On Wed, Oct 07, 2020 at 03:18:44PM +0900, Michael Paquier wrote:
> I looked at this one again today, and applied it.  I looked at what
> MSVC compiler was able to do in terms of optimizationswith
> shift-and-add for multipliers, and it is by far not as good as gcc or
> clang, applying imul for basically all the primes we could use for the
> perfect hash generation.
>

Thanks for picking this up! As I recall, godbolt.org also showed MSVC
unable to do this optimization.


> > I have tested 0002 and 0003, that had better be merged together at the
> > end, and I can see performance improvements with MSVC and gcc similar
> > to what is being reported upthread, with 20~30% gains for simple
> > data sample using IS NFC/NFKC.  That's cool.
>
> For these two, I have merged both together and did some adjustments as
> per the attached.  Not many tweaks, mainly some more comments for the
> unicode header files as the number of structures generated gets
> higher.


Looks fine overall, but one minor nit: I'm curious why you made a separate
section in the pgindent exclusions. The style in that file seems to be one
comment per category.

--
John Naylor


Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 1:55 PM Masahiko Sawada
 wrote:
>
> On Thu, 8 Oct 2020 at 14:10, Amit Kapila  wrote:
> >
> >
> > We can write if we want but there are few things we need to do for
> > that like maybe a new function like wait_for_spill_stats which will
> > check if the counters have become zero. Then probably call a reset
> > function, call a new wait function, and then again check stats to
> > ensure they are reset to 0.
>
> Yes.
>

I am not sure if it is worth but probably it is not a bad idea
especially if we extend the existing tests based on your below idea?

> > We can't write any advanced test which means reset the existing stats
> > perform some tests and again check stats because *slot_get_changes()
> > function can start from the previous WAL for which we have covered the
> > stats. We might write that if we can somehow track the WAL positions
> > from the previous test. I am not sure if we want to go there.
>
> Can we use pg_logical_slot_peek_changes() instead to decode the same
> transactions multiple times?
>

I think this will do the trick. If we want to go there then I suggest
we can have a separate regression test file in test_decoding with name
as decoding_stats, stats, or something like that. We can later add the
tests related to streaming stats in that file as well.

-- 
With Regards,
Amit Kapila.




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Dilip Kumar
On Thu, Oct 8, 2020 at 2:17 PM Dilip Kumar  wrote:
>
> On Thu, Oct 8, 2020 at 2:05 PM Keisuke Kuroda
>  wrote:
> >
> > Hi Dilip,
> >
> > > I could not see this issue even without the patch, it is taking less
> > > than 1s even without the patch.  See below results
> > >
> > > 2020-10-08 13:00:49 BEGIN 509
> > > 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:...
> > > 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30)
> > >
> > > Am I missing something?
> >
> > Thanks for running the tests.
> > It is the TRUNCATE decoding that takes time.
> > INSERT decoding is fast, even before the patch is applied.
> >
> > 2020-10-02 14:55:48 BEGIN 508
> > 2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ...
> > 2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09)
> >
> > However, TRUNCATE decode is slow
> > and take 222s in my environment.
> >
> > 2020-10-02 14:55:50 BEGIN 509
> > 2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000:
> > TRUNCATE: (no-flags)
> > 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09)
> >
> > This script will wait 10 seconds after INSERT exits
> > before executing TRUNCATE, please wait for it to run.
> >
> > When TRUNCATE completes,
> > the walsender process should be at 100% CPU.
>
> Okay, thanks for the info, I will run again and see this.
>

Now, I can see the truncate time reduced from 5mins to just 1 sec

Before patch

2020-10-08 14:18:48 BEGIN 510
2020-10-08 14:23:02 COMMIT 510 (at 2020-10-08 14:18:48.88462+05:30)

truncate time: ~5mins

After patch :

2020-10-08 14:30:22 BEGIN 510
2020-10-08 14:30:22 COMMIT 510 (at 2020-10-08 14:30:22.766092+05:30)

truncate time: < 1s


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Simon Riggs
On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:

> > This script will wait 10 seconds after INSERT exits
> > before executing TRUNCATE, please wait for it to run.

Has this been tested with anything other than the one test case?

It would be good to know how the patch handles a transaction that
contains many aborted subtransactions that contain invals.

-- 
Simon Riggshttp://www.enterprisedb.com/




RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-08 Thread tsunakawa.ta...@fujitsu.com
Sorry to be late to respond.  (My PC is behaving strangely after upgrading 
Win10 2004)

From: Masahiko Sawada 
> After more thoughts on Tsunakawa-san’s idea it seems to need the
> following conditions:
> 
> * At least postgres_fdw is viable to implement these APIs while
> guaranteeing not to happen any error.
> * A certain number of FDWs (or majority of FDWs) can do that in a
> similar way to postgres_fdw by using the guideline and probably
> postgres_fdw as a reference.
> 
> These are necessary for FDW implementors to implement APIs while
> following the guideline and for the core to trust them.
> 
> As far as postgres_fdw goes, what we need to do when committing a
> foreign transaction resolution is to get a connection from the
> connection cache or create and connect if not found, construct a SQL
> query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size
> buffer, send the query, and get the result. The possible place to
> raise an error is limited. In case of failures such as connection
> error FDW can return false to the core along with a flag indicating to
> ask the core retry. Then the core will retry to resolve foreign
> transactions after some sleep. OTOH if FDW sized up that there is no
> hope of resolving the foreign transaction, it also could return false
> to the core along with another flag indicating to remove the entry and
> not to retry. Also, the transaction resolution by FDW needs to be
> cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS().
> 
> Probably, as Tsunakawa-san also suggested, it’s not impossible to
> implement these APIs in postgres_fdw while guaranteeing not to happen
> any error, although not sure the code complexity. So I think the first
> condition may be true but not sure about the second assumption,
> particularly about the interruptible part.

Yeah, I expect the commit of the second phase should not be difficult for the 
FDW developer.

As for the cancellation during commit retry, I don't think we necessarily have 
to make the TM responsible for retrying the commits.  Many DBMSs have their own 
timeout functionality such as connection timeout, socket timeout, and statement 
timeout.  Users can set those parameters in the foreign server options based on 
how long the end user can wait.  That is, TM calls FDW's commit routine just 
once.

If the TM makes efforts to retry commits, the duration would be from a few 
seconds to 30 seconds.  Then, we can hold back the cancellation during that 
period.


> I thought we could support both ideas to get their pros; supporting
> Tsunakawa-san's idea and then my idea if necessary, and FDW can choose
> whether to ask the resolver process to perform 2nd phase of 2PC or
> not. But it's not a good idea in terms of complexity.

I don't feel the need for leaving the commit to the resolver during normal 
operation.


 seems like if failed to resolve, the backend would return an
> acknowledgment of COMMIT to the client and the resolver process
> resolves foreign prepared transactions in the background. So we can
> ensure that the distributed transaction is completed at the time when
> the client got an acknowledgment of COMMIT if 2nd phase of 2PC is
> successfully completed in the first attempts. OTOH, if it failed for
> whatever reason, there is no such guarantee. From an optimistic
> perspective, i.g., the failures are unlikely to happen, it will work
> well but IMO it’s not uncommon to fail to resolve foreign transactions
> due to network issue, especially in an unreliable network environment
> for example geo-distributed database. So I think it will end up
> requiring the client to check if preceding distributed transactions
> are completed or not in order to see the results of these
> transactions.

That issue exists with any method, doesn't it?


 Regards
Takayuki Tsunakawa



Re: [HACKERS] Custom compression methods

2020-10-08 Thread Dilip Kumar
On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar  wrote:
>
> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar  wrote:
> >
> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> >  wrote:
> > >
> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > > > wrote:
> > > >>
> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > > >> > wrote:
> > > >> >>
> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> >Thanks, Tomas for your feedback.
> > > >> >> >
> > > >> >> >> 9) attcompression ...
> > > >> >> >>
> > > >> >> >> The main issue I see is what the patch does with attcompression. 
> > > >> >> >> Instead
> > > >> >> >> of just using it to store a the compression method, it's also 
> > > >> >> >> used to
> > > >> >> >> store the preserved compression methods. And using NameData to 
> > > >> >> >> store
> > > >> >> >> this seems wrong too - if we really want to store this info, the 
> > > >> >> >> correct
> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > > >> >> >
> > > >> >> >The reason for using the NameData is the get it in the fixed part 
> > > >> >> >of
> > > >> >> >the data structure.
> > > >> >> >
> > > >> >>
> > > >> >> Why do we need that? It's possible to have varlena fields with 
> > > >> >> direct
> > > >> >> access (see pg_index.indkey for example).
> > > >> >
> > > >> >I see.  While making it NameData I was thinking whether we have an
> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > > >> >will change this.
> > > >> >
> > > >> > Adding NameData just to make
> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > > >> >> single byte, which means ~30% overhead for the 
> > > >> >> FormData_pg_attribute.
> > > >> >> That seems a bit unnecessary, and might be an issue with many 
> > > >> >> attributes
> > > >> >> (e.g. with many temp tables, etc.).
> > > >> >
> > > >> >You are right.  Even I did not like to keep 64B for this, so I will 
> > > >> >change it.
> > > >> >
> > > >> >>
> > > >> >> >> But to me this seems very much like a misuse of attcompression 
> > > >> >> >> to track
> > > >> >> >> dependencies on compression methods, necessary because we don't 
> > > >> >> >> have a
> > > >> >> >> separate catalog listing compression methods. If we had that, I 
> > > >> >> >> think we
> > > >> >> >> could simply add dependencies between attributes and that 
> > > >> >> >> catalog.
> > > >> >> >
> > > >> >> >Basically, up to this patch, we are having only built-in 
> > > >> >> >compression
> > > >> >> >methods and those can not be dropped so we don't need any 
> > > >> >> >dependency
> > > >> >> >at all.  We just want to know what is the current compression 
> > > >> >> >method
> > > >> >> >and what is the preserve compression methods supported for this
> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > > >> >> >
> > > >> >>
> > > >> >> Sure, I understand what the goal was - all I'm saying is that it 
> > > >> >> looks
> > > >> >> very much like a workaround needed because we don't have the 
> > > >> >> catalog.
> > > >> >>
> > > >> >> I don't quite understand how could we support custom compression 
> > > >> >> methods
> > > >> >> without listing them in some sort of catalog?
> > > >> >
> > > >> >Yeah for supporting custom compression we need some catalog.
> > > >> >
> > > >> >> >> Moreover, having the catalog would allow adding compression 
> > > >> >> >> methods
> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > > >> >> >> compression methods. Which seems like a strange limitation, 
> > > >> >> >> considering
> > > >> >> >> this thread is called "custom compression methods".
> > > >> >> >
> > > >> >> >I think I forgot to mention while submitting the previous patch 
> > > >> >> >that
> > > >> >> >the next patch I am planning to submit is, Support creating the 
> > > >> >> >custom
> > > >> >> >compression methods wherein we can use pg_am catalog to insert the 
> > > >> >> >new
> > > >> >> >compression method.  And for dependency handling, we can create an
> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create 
> > > >> >> >the
> > > >> >> >attribute dependency on the current compression method AM as well 
> > > >> >> >as
> > > >> >> >on the preserved compression methods AM.   As part of this, we will
> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression 
> > > >> >> >field
> > > >> >> >will be converted to the oid_vector (first OID will be of the 
> > > >> >> >current
> > > >> >> >compression method, followed by the preserved compression method's
> > > >> >> >oids).
> > > >> >> >
> > > 

Re: Fix typos in reorderbuffer.c

2020-10-08 Thread Masahiko Sawada
On Thu, 8 Oct 2020 at 17:37, Amit Kapila  wrote:
>
> @@ -1432,7 +1432,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>   ReorderBufferCleanupTXN(rb, subtxn);
>   }
>
> - /* cleanup changes in the toplevel txn */
> + /* cleanup changes in the txn */
>   dlist_foreach_modify(iter, &txn->changes)
>   {
>   ReorderBufferChange *change;
> @@ -1533,7 +1533,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>   ReorderBufferTruncateTXN(rb, subtxn);
>   }
>
> - /* cleanup changes in the toplevel txn */
> + /* cleanup changes in the txn */
>   dlist_foreach_modify(iter, &txn->changes)
>   {
>   ReorderBufferChange *change;
>
> Both the above functions are recursive and will clean the changes for
> both the top-level transaction and subtransactions.

Right.

> So, I feel the
> comments should be accordingly updated.

+1 for this change.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread k.jami...@fujitsu.com
On Thursday, October 8, 2020 3:38 PM, Tsunakawa-san wrote:

> Hi Kirk san,
Thank you for looking into my patches!

> (1)
> + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not
> + * available and when not in recovery path.
> 
> + /*
> +  * We cannot believe the result from smgr_nblocks is always accurate
> +  * because lseek of buggy Linux kernels doesn't account for a recent
> +  * write.
> +  */
> + if (!InRecovery && result == InvalidBlockNumber)
> + return InvalidBlockNumber;
> +
> 
> These are unnecessary, because mdnblocks() never returns
> InvalidBlockNumber and conseuently smgrnblocks() doesn't return
> InvalidBlockNumber.

Yes. Thanks for carefully looking into that. Removed.

> (2)
> +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *isCached)
> 
> I think it's better to make the argument name iscached so that camel case
> aligns with forknum, which is not forkNum.

This is kinda tricky because of the surrounding code which follows inconsistent 
coding style too.
So I just followed the same like below and retained the change.

extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber 
blocknum, char *buffer, bool skipFsync);

> (3)
> +  * This is caused by buggy Linux kernels that might not have
> accounted
> +  * the recent write.  If a fork's nblocks is invalid, exit loop.
> 
> "accounted for" is the right English?
> I think The second sentence should be described in terms of its meaning, not
> the program logic.  For example, something like "Give up the optimization if
> the block count of any fork cannot be trusted."

Fixed.

> Likewise, express the following part in semantics.
> 
> +  * Do explicit hashtable lookup if the total of nblocks of relation's
> forks
> +  * is not invalid and the nblocks to be invalidated is less than the

I revised it like below:
"Look up the buffer in the hashtable if the block size is known to 
 be accurate and the total blocks to be invalidated is below the
 full scan threshold.  Otherwise, give up the optimization."

> (4)
> + if (nForkBlocks[i] == InvalidBlockNumber)
> + {
> + nTotalBlocks = InvalidBlockNumber;
> + break;
> + }
> 
> Use isCached in if condition because smgrnblocks() doesn't return
> InvalidBlockNumber.

Fixed. if (!isCached)

> (5)
> + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
> 
> should be
> 
> + nBlocksToInvalidate += (nForkBlocks[i] - firstDelBlock[i]);

Fixed.

> (6)
> + bufHdr->tag.blockNum >=
> firstDelBlock[j])
> + InvalidateBuffer(bufHdr);   /*
> releases spinlock */
> 
> The right side of >= should be cur_block.

Fixed.


Attached are the updated patches.
Thank you again for the reviews.

Regards,
Kirk Jamison


0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread k.jami...@fujitsu.com
Hi, 
> Attached are the updated patches.

Sorry there was an error in the 3rd patch. So attached is a rebase one.

Regards,
Kirk Jamison


0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch


Re: Asynchronous Append on postgres_fdw nodes.

2020-10-08 Thread Andrey V. Lepikhov

On 10/5/20 11:35 AM, Etsuro Fujita wrote:
Hi,
I found a small problem. If we have a mix of async and sync subplans 
when we catch an assertion on a busy connection. Just for example:


PLAN

Nested Loop  (cost=100.00..174316.95 rows=975 width=8) (actual 
time=5.191..9.262 rows=9 loops=1)

   Join Filter: (frgn.a = l.a)
   Rows Removed by Join Filter: 8991
   ->  Append  (cost=0.00..257.20 rows=11890 width=4) (actual 
time=0.419..2.773 rows=1000 loops=1)

 Async subplans: 4
 ->  Async Foreign Scan on f_1 l_2  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
 ->  Async Foreign Scan on f_2 l_3  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
 ->  Async Foreign Scan on f_3 l_4  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
 ->  Async Foreign Scan on f_4 l_5  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
 ->  Seq Scan on l_0 l_1  (cost=0.00..2.90 rows=190 width=4) 
(actual time=0.017..0.057 rows=190 loops=1)
   ->  Materialize  (cost=100.00..170.94 rows=975 width=4) (actual 
time=0.001..0.002 rows=9 loops=1000)
 ->  Foreign Scan on frgn  (cost=100.00..166.06 rows=975 
width=4) (actual time=0.766..0.768 rows=9 loops=1)


Reproduction script 'test1.sql' see in attachment. Here I force the 
problem reproduction with setting enable_hashjoin and enable_mergejoin 
to off.


'asyncmix.patch' contains my solution to this problem.

--
regards,
Andrey Lepikhov
Postgres Professional


test1.sql
Description: application/sql
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 14824368cc..613d406982 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -455,7 +455,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 	  void *arg);
 static void create_cursor(ForeignScanState *node);
 static void request_more_data(ForeignScanState *node);
-static void fetch_received_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node, bool vacateconn);
 static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -1706,15 +1706,19 @@ postgresIterateForeignScan(ForeignScanState *node)
 		{
 			/*
 			 * finish the running query before sending the next command for
-			 * this node
+			 * this node.
+			 * When the plan contains both asynchronous subplans and non-async
+			 * subplans backend could request more data in async mode and want to
+			 * get data in sync mode by the same connection. Here it must wait
+			 * for async data before request another.
 			 */
-			if (!fsstate->s.commonstate->busy)
-vacate_connection((PgFdwState *)fsstate, false);
+			if (fsstate->s.commonstate->busy)
+vacate_connection(&fsstate->s, false);
 
 			request_more_data(node);
 
 			/* Fetch the result immediately. */
-			fetch_received_data(node);
+			fetch_received_data(node, false);
 		}
 		else if (!fsstate->s.commonstate->busy)
 		{
@@ -1749,7 +1753,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 			/* fetch the leader's data and enqueue it for the next request */
 			if (available)
 			{
-fetch_received_data(leader);
+fetch_received_data(leader, false);
 add_async_waiter(leader);
 			}
 		}
@@ -3729,7 +3733,7 @@ request_more_data(ForeignScanState *node)
  * Fetches received data and automatically send requests of the next waiter.
  */
 static void
-fetch_received_data(ForeignScanState *node)
+fetch_received_data(ForeignScanState *node, bool vacateconn)
 {
 	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
@@ -3817,7 +3821,8 @@ fetch_received_data(ForeignScanState *node)
 	waiter = move_to_next_waiter(node);
 
 	/* send the next request if any */
-	if (waiter)
+	if (waiter && (!vacateconn ||
+		GetPgFdwScanState(node)->s.conn != GetPgFdwScanState(waiter)->s.conn))
 		request_more_data(waiter);
 
 	MemoryContextSwitchTo(oldcontext);
@@ -3843,7 +3848,7 @@ vacate_connection(PgFdwState *fdwstate, bool clear_queue)
 	 * query
 	 */
 	leader = commonstate->leader;
-	fetch_received_data(leader);
+	fetch_received_data(leader, true);
 
 	/* let the first waiter be the next leader of this connection */
 	move_to_next_waiter(leader);


Wired if-statement in gen_partprune_steps_internal

2020-10-08 Thread Andy Fan
Hi:

  I found the following code in gen_partprune_steps_internal,  which
looks the if-statement to be always true since list_length(results) > 1;
I added an Assert(step_ids != NIL) and all the test cases passed.
if the if-statement is always true,  shall we remove it to avoid confusion?


gen_partprune_steps_internal(GeneratePruningStepsContext *context,


if (list_length(result) > 1)
{
List   *step_ids = NIL;

foreach(lc, result)
{
PartitionPruneStep *step = lfirst(lc);

step_ids = lappend_int(step_ids, step->step_id);
}
Assert(step_ids != NIL);
if (step_ids != NIL) // This should always be true.
{
PartitionPruneStep *step;

step = gen_prune_step_combine(context, step_ids,

  PARTPRUNE_COMBINE_INTERSECT);
result = lappend(result, step);
}
}


-- 
Best Regards
Andy Fan


Re: [Patch] ALTER SYSTEM READ ONLY

2020-10-08 Thread Amul Sul
On Wed, Oct 7, 2020 at 11:19 PM Robert Haas  wrote:
>
> On Wed, Sep 16, 2020 at 3:33 PM Robert Haas  wrote:
> > I don't mind a loop, but that one looks broken. We have to clear the
> > bit before we call the function that processes that type of barrier.
> > Otherwise, if we succeed in absorbing the barrier but a new instance
> > of the same barrier arrives meanwhile, we'll fail to realize that we
> > need to absorb the new one.
>
> Here's a new version of the patch for allowing errors in
> barrier-handling functions and/or rejection of barriers by those
> functions. I think this responds to all of the previous review
> comments from Andres. Also, here is an 0002 which is a handy bit of
> test code that I wrote. It's not for commit, but it is useful for
> finding bugs.
>
> In addition to improving 0001 based on the review comments, I also
> tried to write a better commit message for it, but it might still be
> possible to do better there. It's a bit hard to explain the idea in
> the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
> with an XID -- and possibly a bunch of sub-XIDs, and possibly while
> idle-in-transaction -- can elect to FATAL rather than absorbing the
> barrier. I suspect for other barrier types we might have certain
> (hopefully short) stretches of code where a barrier of a particular
> type can't be absorbed because we're in the middle of doing something
> that relies on the previous value of whatever state is protected by
> the barrier. Holding off interrupts in those stretches of code would
> prevent the barrier from being absorbed, but would also prevent query
> cancel, backend termination, and absorption of other barrier types, so
> it seems possible that just allowing the barrier-absorption function
> for a barrier of that type to just refuse the barrier until after the
> backend exits the critical section of code will work out better.
>
> Just for kicks, I tried running 'make installcheck-parallel' while
> emitting placeholder barriers every 0.05 s after altering the
> barrier-absorption function to always return false, just to see how
> ugly that was. In round figures, it made it take 24 s vs. 21 s, so
> it's actually not that bad. However, it all depends on how many times
> you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
> that the effect might be very non-uniform. That is, if you can get the
> code to be running a tight loop that does little real work but does
> CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
> barrier, it will probably suck. Therefore, I'm inclined to think that
> the fairly strong cautionary logic in the patch is reasonable, but
> perhaps it can be better worded somehow. Thoughts welcome.
>
> I have not rebased the remainder of the patch series over these two.
>
That I'll do.

On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:

+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }

When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop.  The case when it has an
error, the rest of the flags get reset in the catch block.  Correct me if I am
missing something here.

Regards,
Amul




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-10-08 Thread Dmitry Dolgov
> On Thu, Oct 08, 2020 at 09:34:51AM +0800, Andy Fan wrote:
>
> > Other than that I wanted to ask what are the plans to proceed with this
> > patch? It's been a while since the question was raised in which format
> > to keep unique key expressions, and as far as I can see no detailed
> > suggestions or patch changes were proposed as a follow up. Obviously I
> > would love to see the first two preparation patches committed to avoid
> > dependencies between patches, and want to suggest an incremental
> > approach with simple format for start (what we have right now) with the
> > idea how to extend it in the future to cover more cases.
> >
>
> I think the hardest part of this series is commit 2,  it probably needs
> lots of
> dedicated time to review which would be the hardest part for the reviewers.
> I don't have a good suggestion, however.

Sure, and I would review the patch as well. But as far as I understand
the main issue is "how to store uniquekey expressions", and as long as
it is not decided, no additional review will move the patch forward I
guess.




Re: [HACKERS] Runtime Partition Pruning

2020-10-08 Thread Ashutosh Bapat
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan  wrote:
>
>
>
> On Wed, Oct 7, 2020 at 5:05 PM Andy Fan  wrote:
>>
>>
>>
>> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan  wrote:



 Now, in my experience, the current system for custom plans vs. generic
 plans doesn't approach the problem in this way at all, and in my
 experience that results in some pretty terrible behavior.  It will do
 things like form a custom plan every time because the estimated cost
 of the custom plan is lower than the estimated cost of the generic
 plan even though the two plans are structurally identical; only the
 estimates differ.  It will waste gobs of CPU cycles by replanning a
 primary key lookup 5 times just on the off chance that a lookup on the
 primary key index isn't the best option.  But this patch isn't going
 to fix any of that.  The best we can probably do is try to adjust the
 costing for Append paths in some way that reflects the costs and
 benefits of pruning.  I'm tentatively in favor of trying to do
 something modest in that area, but I don't have a detailed proposal.

>>>
>>> I just realized this issue recently and reported it at [1], then Amit 
>>> pointed
>>> me to this issue being discussed here, so I would like to continue this 
>>> topic
>>> here.
>>>
>>> I think we can split the issue into 2 issues.  One is the partition prune 
>>> in initial
>>> partition prune, which maybe happen in custom plan case only and caused
>>> the above issue.  The other one happens in the "Run-Time" partition prune,
>>> I admit that is an important issue to resolve as well, but looks harder.   
>>> So I
>>> think we can fix the first one at first.
>>>
>>> ... When we count for the cost of a
>>> generic plan, we can reduce the cost based on such information.
>>
>>
>> This way doesn't work since after the initial partition prune,  not only the
>> cost of the Append node should be reduced,  the cost of other plans should
>> be reduced as well [1]
>>
>> However I think if we can use partition prune information from a custom plan
>> at the cost_append_path stage,  it looks the issue can be fixed.  If so, the 
>> idea
>> is similar to David's idea in [2],  however Robert didn't agree with this[2].
>> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
>> some real results from the past are probably better than some hard-coded
>> assumptions IMO.
>
>
> I can understand Robert's idea now,  he intended to resolve both the
> "initial-partition-prune" case and "runtime partition prune" case while I 
> always think
> about the former case (Amit reminded me about that at the beginning, but I 
> just
> wake up right now..)
>
> With the Robert's idea,  I think we can do some hack at create_append_path,
> we can figure out the pruneinfo (it was done at create_append_plan now) at 
> that
> stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> only the cost of AppendPath,  but also rows of AppendPath, For example:
> SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> plan is:
> Nest Loop
>Nest Loop
>   t1
>   Append (p)
>t2
>
> Then the rows of Append (p) will be very important.
>
> For this idea,  how to estimate how many partitions/rows can be pruned is a 
> key
> part.  Robert said "I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match.  That's probably a common case.  If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match.  partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.".   I think we can't say the 75% or 50% is a good
> number,  but the keypoint may be "partition_column = $1" is a common
> case.  And for the others case, we probably don't make it worse.
>
> I think we need to do something here, any thoughts? Personally I'm more
> like this idea now.

Yes, I think we have to do something on those lines. But I am
wondering why our stats machinery is failing to do that already. For
equality I think it's straight forward. But even for other operators
we should use the statistics from the partitioned table to estimate
the selectivity and then adjust number of scanned partitions = (number
of partitions * fraction of rows scanned). That might be better than
50% or 75%.

-- 
Best Wishes,
Ashutosh Bapat




Re: Asynchronous Append on postgres_fdw nodes.

2020-10-08 Thread Andrey Lepikhov

Hi,
I want to suggest one more improvement. Currently the
is_async_capable_path() routine allow only ForeignPath nodes as async 
capable path. But in some cases we can allow SubqueryScanPath as async 
capable too.


For example:
SELECT * FROM ((SELECT * FROM foreign_1)
UNION ALL
(SELECT a FROM foreign_2)) AS b;

is async capable, but:

SELECT * FROM ((SELECT * FROM foreign_1 LIMIT 10)
UNION ALL
(SELECT a FROM foreign_2 LIMIT 10)) AS b;

doesn't async capable.

The patch in attachment tries to improve this situation.

--
regards,
Andrey Lepikhov
Postgres Professional
>From fa73b84e8c456c48ef4788304d2ed14f31365aac Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 8 Oct 2020 15:46:41 +0500
Subject: [PATCH] 2

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 3 ++-
 src/backend/optimizer/path/allpaths.c  | 4 
 src/backend/optimizer/plan/createplan.c| 2 +-
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index eca44a4f40..e5972574b6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2082,6 +2082,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
  Output: t1.c1, t2.c1
  Group Key: t1.c1, t2.c1
  ->  Append
+   Async subplans: 2 
->  Foreign Scan
  Output: t1.c1, t2.c1
  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
@@ -2090,7 +2091,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
  Output: t1_1.c1, t2_1.c1
  Relations: (public.ft1 t1_1) INNER JOIN (public.ft2 t2_1)
  Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1"
-(20 rows)
+(21 rows)
 
 SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
  t1c1 | avg  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 17e9a7a897..5822ba83e0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3991,6 +3991,10 @@ is_async_capable_path(Path *path)
 	fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
 	return true;
 			}
+			break;
+		case T_SubqueryScanPath:
+			if (is_async_capable_path(((SubqueryScanPath *) path)->subpath))
+return true;
 		default:
 			break;
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3ae46ed6f1..efb1b0cb4e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1231,7 +1231,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
 		 * Classify as async-capable or not. If we have decided to run the
 		 * children in parallel, we cannot any one of them run asynchronously.
 		 * Planner thinks that all subnodes are executed in order if this
-		 * append is orderd. No subpaths cannot be run asynchronously in that
+		 * append is ordered. No subpaths cannot be run asynchronously in that
 		 * case.
 		 */
 		if (pathkeys == NIL &&
-- 
2.17.1



Re: speed up unicode normalization quick check

2020-10-08 Thread Michael Paquier
On Thu, Oct 08, 2020 at 04:52:18AM -0400, John Naylor wrote:
> Looks fine overall, but one minor nit: I'm curious why you made a separate
> section in the pgindent exclusions. The style in that file seems to be one
> comment per category.

Both parts indeed use PerfectHash.pm, but are generated by different
scripts so that looked better as separate items.
--
Michael


signature.asc
Description: PGP signature


Re: partition routing layering in nodeModifyTable.c

2020-10-08 Thread Amit Langote
On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas  wrote:
> On 07/10/2020 12:50, Amit Langote wrote:
> > On Tue, Oct 6, 2020 at 12:45 AM Heikki Linnakangas  wrote:
> >> It would be better to set it in make_modifytable(), just
> >> after calling PlanDirectModify().
> >
> > Actually, that's how it was done in earlier iterations but I think I
> > decided to move that into the FDW's functions due to some concern of
> > one of the other patches that depended on this patch.  Maybe it makes
> > sense to bring that back into make_modifytable() and worry about the
> > other patch later.
>
> On second thoughts, I take back my earlier comment. Setting it in
> make_modifytable() relies on the assumption that the subplan is a single
> ForeignScan node, on the target relation. The documentation for
> PlanDirectModify says:
>
> > To execute the direct modification on the remote server, this
> > function must rewrite the target subplan with a ForeignScan plan node
> > that executes the direct modification on the remote server.
>>
> So I guess that assumption is safe. But I'd like to have some wiggle
> room here. Wouldn't it be OK to have a Result node on top of the
> ForeignScan, for example? If it really must be a simple ForeignScan
> node, the PlanDirectModify API seems pretty strange.
>
> I'm not entirely sure what I would like to do with this now. I could
> live with either version, but I'm not totally happy with either. (I like
> your suggestion below)

Assuming you mean the idea of using RT index to access ResultRelInfos
in es_result_relations, we would still need to store the index in the
ForeignScan node, so the question of whether to do it in
make_modifytable() or in PlanDirectModify() must still be answered.

> Looking at this block in postgresBeginDirectModify:
>
> >   /*
> >* Identify which user to do the remote access as.  This should match 
> > what
> >* ExecCheckRTEPerms() does.
> >*/
> >   Assert(fsplan->resultRelIndex >= 0);
> >   dmstate->resultRelIndex = fsplan->resultRelIndex;
> >   rtindex = list_nth_int(resultRelations, fsplan->resultRelIndex);
> >   rte = exec_rt_fetch(rtindex, estate);
> >   userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
>
> That's a complicated way of finding out the target table's RTI. We
> should probably store the result RTI in the ForeignScan in the first place.
>
> >> Another idea is to merge "resultRelIndex" and a "range table index" into
> >> one value. Range table entries that are updated would have a
> >> ResultRelInfo, others would not. I'm not sure if that would end up being
> >> cleaner or messier than what we have now, but might be worth trying.
> >
> > I have thought about something like this before.  An idea I had is to
> > make es_result_relations array indexable by plain RT indexes, then we
> > don't need to maintain separate indexes that we do today for result
> > relations.
>
> That sounds like a good idea. es_result_relations is currently an array
> of ResultRelInfos, so that would leave a lot of unfilled structs in the
> array. But in on of your other threads, you proposed turning
> es_result_relations into an array of pointers anyway
> (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=kphrdtfs...@mail.gmail.com).

Okay, I am reorganizing the patches around that idea and will post an
update soon.


--
Amit Langote
EDB: http://www.enterprisedb.com




Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-08 Thread Masahiko Sawada
On Thu, 8 Oct 2020 at 18:05, tsunakawa.ta...@fujitsu.com
 wrote:
>
> Sorry to be late to respond.  (My PC is behaving strangely after upgrading 
> Win10 2004)
>
> From: Masahiko Sawada 
> > After more thoughts on Tsunakawa-san’s idea it seems to need the
> > following conditions:
> >
> > * At least postgres_fdw is viable to implement these APIs while
> > guaranteeing not to happen any error.
> > * A certain number of FDWs (or majority of FDWs) can do that in a
> > similar way to postgres_fdw by using the guideline and probably
> > postgres_fdw as a reference.
> >
> > These are necessary for FDW implementors to implement APIs while
> > following the guideline and for the core to trust them.
> >
> > As far as postgres_fdw goes, what we need to do when committing a
> > foreign transaction resolution is to get a connection from the
> > connection cache or create and connect if not found, construct a SQL
> > query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size
> > buffer, send the query, and get the result. The possible place to
> > raise an error is limited. In case of failures such as connection
> > error FDW can return false to the core along with a flag indicating to
> > ask the core retry. Then the core will retry to resolve foreign
> > transactions after some sleep. OTOH if FDW sized up that there is no
> > hope of resolving the foreign transaction, it also could return false
> > to the core along with another flag indicating to remove the entry and
> > not to retry. Also, the transaction resolution by FDW needs to be
> > cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS().
> >
> > Probably, as Tsunakawa-san also suggested, it’s not impossible to
> > implement these APIs in postgres_fdw while guaranteeing not to happen
> > any error, although not sure the code complexity. So I think the first
> > condition may be true but not sure about the second assumption,
> > particularly about the interruptible part.
>
> Yeah, I expect the commit of the second phase should not be difficult for the 
> FDW developer.
>
> As for the cancellation during commit retry, I don't think we necessarily 
> have to make the TM responsible for retrying the commits.  Many DBMSs have 
> their own timeout functionality such as connection timeout, socket timeout, 
> and statement timeout.
> Users can set those parameters in the foreign server options based on how 
> long the end user can wait.  That is, TM calls FDW's commit routine just once.

What about temporary network failures? I think there are users who
don't want to give up resolving foreign transactions failed due to a
temporary network failure. Or even they might want to wait for
transaction completion until they send a cancel request. If we want to
call the commit routine only once and therefore want FDW to retry
connecting the foreign server within the call, it means we require all
FDW implementors to write a retry loop code that is interruptible and
ensures not to raise an error, which increases difficulty.

Also, what if the user sets the statement timeout to 60 sec and they
want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
that client libraries of other DBMSs don't have asynchronous execution
functionality. If the SQL execution function is not interruptible, the
user will end up waiting for 60 sec, which seems not good.

> If the TM makes efforts to retry commits, the duration would be from a few 
> seconds to 30 seconds.  Then, we can hold back the cancellation during that 
> period.
>
>
> > I thought we could support both ideas to get their pros; supporting
> > Tsunakawa-san's idea and then my idea if necessary, and FDW can choose
> > whether to ask the resolver process to perform 2nd phase of 2PC or
> > not. But it's not a good idea in terms of complexity.
>
> I don't feel the need for leaving the commit to the resolver during normal 
> operation.

I meant it's for FDWs that cannot guarantee not to happen error during
resolution.

>  seems like if failed to resolve, the backend would return an
> > acknowledgment of COMMIT to the client and the resolver process
> > resolves foreign prepared transactions in the background. So we can
> > ensure that the distributed transaction is completed at the time when
> > the client got an acknowledgment of COMMIT if 2nd phase of 2PC is
> > successfully completed in the first attempts. OTOH, if it failed for
> > whatever reason, there is no such guarantee. From an optimistic
> > perspective, i.g., the failures are unlikely to happen, it will work
> > well but IMO it’s not uncommon to fail to resolve foreign transactions
> > due to network issue, especially in an unreliable network environment
> > for example geo-distributed database. So I think it will end up
> > requiring the client to check if preceding distributed transactions
> > are completed or not in order to see the results of these
> > transactions.
>
> That issue exists with any method, doesn't it?

Yes, but if we don’t ret

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada
 wrote:
>
> On Wed, 7 Oct 2020 at 17:52, Amit Kapila  wrote:
> >
>
> > I think after we are done with this the next
> > step would be to finish the streaming stats work [1]. We probably need
> > to review and add the test case in that patch. If nobody else shows up
> > I will pick it up and complete it.
>
> +1
> I can review that patch.
>

I have rebased the stream stats patch and made minor modifications. I
haven't done a detailed review but one thing that I think is not
correct is:
@@ -3496,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  txn->snapshot_now = NULL;
  }

+
+ rb->streamCount += 1;
+ rb->streamBytes += txn->total_size;
+
+ /* Don't consider already streamed transaction. */
+ rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
  /* Process and send the changes to output plugin. */
  ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
  command_id, true);

I think we should update the stream stats after
ReorderBufferProcessTXN rather than before because any error in
ReorderBufferProcessTXN can lead to an unnecessary update of stats.
But OTOH, the txn flags, and other data can be changed after
ReorderBufferProcessTXN so we need to save them in a temporary
variable before calling the function.

-- 
With Regards,
Amit Kapila.


v2-0001-Track-streaming-stats.patch
Description: Binary data


Re: extensible options syntax for replication parser?

2020-10-08 Thread Robert Haas
On Wed, Jun 24, 2020 at 11:51 AM Robert Haas  wrote:
> Thanks for the review. v2 attached, hopefully fixing the compilation
> issue you mentioned.

Tushar Ahuja reported to me off-list that my basebackup refactoring
patch set was changing whether or not the following message appeared:

NOTICE:  WAL archiving is not enabled; you must ensure that all
required WAL segments are copied through other means to complete the
backup

That patch set includes this patch, and the reason for the behavior
difference turned out to be that I had gotten an if-test that is part
of this patch backwards. Here is v3, fixing that. It is a little
disappointing that this mistake didn't cause any existing regression
tests to fail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v3-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
Description: Binary data


Re: Wrong example in the bloom documentation

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote:
> Hi,
> 
> as this does not get any attention on the docs-list, once again here.
> Any thoughts on this?

I was hoping someone more experienced with this would comment, but
seeing none, I will apply it in a day or two to all supported versions? 
Have you tested this output back to 9.5?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 10:35:41AM +0900, Michael Paquier wrote:
> On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote:
> > Fix progname memory leak in ecpg client.
> > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo.
> 
> FWIW, I don't see much point in doing that.  For one, we have a
> more-or-less established rule that progname remains set until the
> application leaves, and there are much more places where we leak
> memory like that.  As one example, just see the various places where
> we use pg_strdup for option parsing.  At the end, it does not really
> matter as these are applications running for a short amount of time.

Agreed, but what does the TODO item mean then?

Fix small memory leaks in ecpg
Memory leaks in a short running application like ecpg are not really
a problem, but make debugging more complicated 

Should it be removed?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Probably typo in multixact.c

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 10:26:39AM +0900, Michael Paquier wrote:
> On Thu, Oct 08, 2020 at 01:15:35AM +, Hou, Zhijie wrote:
> > Hi
> > 
> > In multixact.c I found some comments like the following:
> > 
> > *   Similar to AtEOX_MultiXact but for COMMIT PREPARED
> > * Discard the local MultiXactId cache like in AtEOX_MultiXact
> > 
> > Since there's no function called "AtEOX_MultiXact" in the code,
> > I think the "AtEOX_MultiXact" may be a typo.
> > 
> > AtEOXact_MultiXact seems to be the right function here.
> 
> Yes, that looks like a simple typo to me as well.
> AtEOXact_MultiXact() shares portions of the logics in
> PostPrepare_MultiXact and multixact_twophase_postcommit.

FYI, this patch was applied.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Michael Meskes
On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote:
> On Thu, Oct  8, 2020 at 10:35:41AM +0900, Michael Paquier wrote:
> > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote:
> > > Fix progname memory leak in ecpg client.
> > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo.
> > 
> > FWIW, I don't see much point in doing that.  For one, we have a
> > more-or-less established rule that progname remains set until the
> > application leaves, and there are much more places where we leak
> > memory like that.  As one example, just see the various places
> > where
> > we use pg_strdup for option parsing.  At the end, it does not
> > really
> > matter as these are applications running for a short amount of
> > time.
> 
> Agreed, but what does the TODO item mean then?
> 
>   Fix small memory leaks in ecpg
>   Memory leaks in a short running application like ecpg are
> not really
>   a problem, but make debugging more complicated 
> 
> Should it be removed?

I'd say yes, let's remove it. Actually I wasn't even aware it's on
there. While I agree that it makes debugging of memory handling in ecpg
more difficult, I don't see much of a point in it.

Michael 
-- 
Michael Meskes
Michael at Fam-Meskes dot De
Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org





Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 06:58:14PM +0200, Michael Meskes wrote:
> On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote:
> > On Thu, Oct  8, 2020 at 10:35:41AM +0900, Michael Paquier wrote:
> > > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote:
> > > > Fix progname memory leak in ecpg client.
> > > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo.
> > > 
> > > FWIW, I don't see much point in doing that.  For one, we have a
> > > more-or-less established rule that progname remains set until the
> > > application leaves, and there are much more places where we leak
> > > memory like that.  As one example, just see the various places
> > > where
> > > we use pg_strdup for option parsing.  At the end, it does not
> > > really
> > > matter as these are applications running for a short amount of
> > > time.
> > 
> > Agreed, but what does the TODO item mean then?
> > 
> > Fix small memory leaks in ecpg
> > Memory leaks in a short running application like ecpg are
> > not really
> > a problem, but make debugging more complicated 
> > 
> > Should it be removed?
> 
> I'd say yes, let's remove it. Actually I wasn't even aware it's on
> there. While I agree that it makes debugging of memory handling in ecpg
> more difficult, I don't see much of a point in it.

OK, TODO removed.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread John W Higgins
On Wed, Oct 7, 2020 at 6:35 PM Michael Paquier  wrote:

> On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote:
> > Fix progname memory leak in ecpg client.
> > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo.
>
> FWIW, I don't see much point in doing that.
>

I hope that everyone takes 10 seconds and realizes that this appears to be
this person's first submitted patch. I would think a little more respect
for the attempt to patch a minor issue would be afforded to such a person.
Seems technically sound and they are not trying to change the world with
their first attempt.

Maybe a reminder that the TODO list is not always spectacular and that a
double check with the list before doing something might be in order (in
fact adding that to the top of the TODO list might be a great option here
as well).

It's not going to win a Turing award - but I thought this project was a
little more friendly then what I've seen in this thread towards a first
time contributor.

John


Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 10:13:53AM -0700, John W Higgins wrote:
> 
> On Wed, Oct 7, 2020 at 6:35 PM Michael Paquier  wrote:
> 
> On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote:
> > Fix progname memory leak in ecpg client.
> > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo.
> 
> FWIW, I don't see much point in doing that.  
> 
> 
> I hope that everyone takes 10 seconds and realizes that this appears to be 
> this
> person's first submitted patch. I would think a little more respect for the
> attempt to patch a minor issue would be afforded to such a person. Seems
> technically sound and they are not trying to change the world with their first
> attempt. 
> 
> Maybe a reminder that the TODO list is not always spectacular and that a 
> double
> check with the list before doing something might be in order (in fact adding
> that to the top of the TODO list might be a great option here as well).
> 
> It's not going to win a Turing award - but I thought this project was a little
> more friendly then what I've seen in this thread towards a first time
> contributor.

You mean like the warning at the top of the TODO list?

https://wiki.postgresql.org/wiki/Todo#Development_Process

WARNING for Developers: Unfortunately this list does not contain all the
information necessary for someone to start coding a feature. Some of
these items might have become unnecessary since they were added ---
others might be desirable but the implementation might be unclear. When
selecting items listed below, be prepared to first discuss the value of
the feature. Do not assume that you can select one, code it and then
expect it to be committed. Always discuss design on Hackers list before
starting to code. The flow should be:

Desirability -> Design -> Implement -> Test -> Review -> Commit

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Tom Lane
Michael Meskes  writes:
> On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote:
>> Agreed, but what does the TODO item mean then?
>> 
>>  Fix small memory leaks in ecpg
>>  Memory leaks in a short running application like ecpg are not really
>>  a problem, but make debugging more complicated 
>> 
>> Should it be removed?

> I'd say yes, let's remove it. Actually I wasn't even aware it's on
> there. While I agree that it makes debugging of memory handling in ecpg
> more difficult, I don't see much of a point in it.

Agreed.  In theory there might be some point in removing leaks that happen
per-statement, so as to avoid unreasonable memory bloat when processing an
extremely long input file.  In practice nobody has complained about that,
and if somebody did I'd probably question the sanity of putting so much
code into one file.  (The C compiler would likely bloat even more while
processing the output...)  Moreover, given the way the ecpg grammar works,
it'd be really painful to avoid all such leaks, and it might well
introduce bugs not fix them.

In any case, if this TODO item is going to lead to ideas as dubious
as "let's free progname before exiting", it's not helpful.

regards, tom lane




Re: Wrong example in the bloom documentation

2020-10-08 Thread Daniel Westermann (DWE)
Hi Bruce,

>On Thu, Oct  8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote:
>> Hi,
>>
>> as this does not get any attention on the docs-list, once again here.
>> Any thoughts on this?

>I was hoping someone more experienced with this would comment, but
>seeing none, I will apply it in a day or two to all supported versions?
>Have you tested this output back to 9.5?

I hoped that as well. No, I tested down to 9.6 because the change happened in 
10.

Regards
Daniel


Re: Wrong example in the bloom documentation

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 06:12:47PM +, Daniel Westermann (DWE) wrote:
> Hi Bruce,
>  
> >On Thu, Oct  8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote:
> >> Hi,
> >>
> >> as this does not get any attention on the docs-list, once again here.
> >> Any thoughts on this?
> 
> >I was hoping someone more experienced with this would comment, but
> >seeing none, I will apply it in a day or two to all supported versions?
> >Have you tested this output back to 9.5?
> 
> I hoped that as well. No, I tested down to 9.6 because the change happened in
> 10.

OK, so the patch should be applied only to PG 10 and later?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Onder Kalaci
Hi,

I hit an assertion failure. When asserts disabled, it works fine even with more 
tables  (>5000).

Steps to reproduce:

CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 
int, value_3 float, value_4 bigint);

250 relations work fine, see the query (too long to copy & paste here): 
https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

--  when # relations >500, we hit the assertion (too long to copy & paste here):
See the query: 
https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L45


And, the backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x7fff639fa2c2 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff63ab5bf1 libsystem_pthread.dylib`pthread_kill + 284
frame #2: 0x7fff639646a6 libsystem_c.dylib`abort + 127
frame #3: 0x000102180a02 
postgres`ExceptionalCondition(conditionName=, 
errorType=, fileName=, lineNumber=) at 
assert.c:67:2
frame #4: 0x000101ece9b2 
postgres`initial_cost_mergejoin(root=0x7ff0, 
workspace=0x7ffeedf5b528, jointype=JOIN_INNER, mergeclauses=, 
outer_path=0x00012ebf12d0, inner_path=0x4093d800, 
outersortkeys=0x, innersortkeys=0x00012ebf68e8, 
extra=0x7ffeedf5b6f8) at costsize.c:3043:2
frame #5: 0x000101eda01b 
postgres`try_mergejoin_path(root=0x000104a12618, 
joinrel=0x00012ebeede0, outer_path=0x00012ebf12d0, 
inner_path=0x0001283d00e8, pathkeys=0x00012ebf67e0, 
mergeclauses=0x00012ebf6890, outersortkeys=0x, 
innersortkeys=0x00012ebf68e8, jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8, 
is_partial=) at joinpath.c:615:2
frame #6: 0x000101ed9426 
postgres`sort_inner_and_outer(root=0x000104a12618, 
joinrel=0x00012ebeede0, outerrel=, innerrel=, 
jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8) at joinpath.c:1038:3
frame #7: 0x000101ed8f7a 
postgres`add_paths_to_joinrel(root=0x000104a12618, 
joinrel=0x00012ebeede0, outerrel=0x00012ebe7b48, 
innerrel=0x000127f146e0, jointype=, sjinfo=, 
restrictlist=0x00012ebf42b0) at joinpath.c:269:3
frame #8: 0x000101edbdc6 
postgres`populate_joinrel_with_paths(root=0x000104a12618, 
rel1=0x00012ebe7b48, rel2=0x000127f146e0, joinrel=0x00012ebeede0, 
sjinfo=0x00012809edc8, restrictlist=0x00012ebf42b0) at joinrels.c:824:4
frame #9: 0x000101edb57a 
postgres`make_join_rel(root=0x000104a12618, rel1=0x00012ebe7b48, 
rel2=0x000127f146e0) at joinrels.c:760:2
frame #10: 0x000101edb1ec 
postgres`make_rels_by_clause_joins(root=0x000104a12618, 
old_rel=0x00012ebe7b48, other_rels_list=, 
other_rels=) at joinrels.c:312:11
frame #11: 0x000101edada3 
postgres`join_search_one_level(root=0x000104a12618, level=2) at 
joinrels.c:123:4
frame #12: 0x000101ec7feb 
postgres`standard_join_search(root=0x000104a12618, levels_needed=8, 
initial_rels=0x00012ebf4078) at allpaths.c:3097:3
frame #13: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280a5618) at allpaths.c:2993:14
frame #14: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280ab320) at allpaths.c:2993:14
frame #15: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280b1028) at allpaths.c:2993:14
frame #16: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280b6d30) at allpaths.c:2993:14
frame #17: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280bca38) at allpaths.c:2993:14
frame #18: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280c2740) at allpaths.c:2993:14
frame #19: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280c8448) at allpaths.c:2993:14
frame #20: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280ce150) at allpaths.c:2993:14
frame #21: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280d3e58) at allpaths.c:2993:14
frame #22: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280d9b60) at allpaths.c:2993:14
frame #23: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280df868) at allpaths.c:2993:14
frame #24: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280e5570) at allpaths.c:2993:14
frame #25: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280eb278) at allpaths.c:2993:14

Expansion of our checks for connection-loss errors

2020-10-08 Thread Tom Lane
Over in the thread at [1], we've tentatively determined that the
reason buildfarm member lorikeet is currently failing is that its
network stack returns ECONNABORTED for (some?) connection failures,
whereas our code is only expecting ECONNRESET.  Fujii Masao therefore
proposes that we treat ECONNABORTED the same as ECONNRESET.  I think
this is a good idea, but after a bit of research I feel it does not
go far enough.  I find these POSIX-standard errnos that also seem
likely candidates to be returned for a hard loss of connection:

ECONNABORTED
EHOSTUNREACH
ENETDOWN
ENETUNREACH

All of these have been in POSIX since SUSv2, so it seems unlikely
that we need to #ifdef any of them.  (It is in any case pretty silly
that we have #ifdefs around a very small minority of our references
to ECONNRESET :-(.)

There are some other related errnos, such as ECONNREFUSED, that
don't seem like they'd be returned for a failure of a pre-existing
connection, so we don't need to include them in such tests.

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET.  I felt that this was getting to
the point where we'd better centralize the knowledge of what to check,
so the patch does that, via an inline function and an admittedly hacky
macro.  I also upgraded some places such as strerror.c to have full
support for these symbols.

All of the machines I have (even as far back as HPUX 10.20) also
define ENETRESET and EHOSTDOWN.  However, those symbols do not appear
in SUSv2.  ENETRESET was added at some later point, but EHOSTDOWN is
still not in POSIX.  For the moment I've left these second-tier
symbols out of the patch, but there's a case for adding them.  I'm
not sure whether there'd be any point in trying to #ifdef them.

BTW, I took out the conditional defines of some of these errnos in
libpq's win32.h; AFAICS that's been dead code ever since we added
#define's for them to win32_port.h.  Am I missing something?

This seems like a bug fix to me, so I'm inclined to back-patch.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/E1kPc9v-0005L4-2l%40gemulon.postgresql.org

diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c
index 6fbd1ed6fb..0f28c07ed1 100644
--- a/src/backend/port/win32/socket.c
+++ b/src/backend/port/win32/socket.c
@@ -123,10 +123,14 @@ TranslateSocketError(void)
 		case WSAEHOSTUNREACH:
 		case WSAEHOSTDOWN:
 		case WSAHOST_NOT_FOUND:
+			errno = EHOSTUNREACH;
+			break;
 		case WSAENETDOWN:
-		case WSAENETUNREACH:
 		case WSAENETRESET:
-			errno = EHOSTUNREACH;
+			errno = ENETDOWN;
+			break;
+		case WSAENETUNREACH:
+			errno = ENETUNREACH;
 			break;
 		case WSAENOTCONN:
 		case WSAESHUTDOWN:
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d0b368530e..8937f223a8 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -712,9 +712,7 @@ errcode_for_socket_access(void)
 	{
 			/* Loss of connection */
 		case EPIPE:
-#ifdef ECONNRESET
-		case ECONNRESET:
-#endif
+		case ALL_CONNECTION_LOSS_ERRNOS:
 			edata->sqlerrcode = ERRCODE_CONNECTION_FAILURE;
 			break;
 
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index f0587f41e4..b8ab939179 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -1825,7 +1825,7 @@ piperead(int s, char *buf, int len)
 {
 	int			ret = recv(s, buf, len, 0);
 
-	if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+	if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))
 	{
 		/* EOF on the pipe! */
 		ret = 0;
diff --git a/src/include/port.h b/src/include/port.h
index 84bf2c363f..ffa21e782a 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -99,6 +99,30 @@ extern void pgfnames_cleanup(char **filenames);
 )
 #endif
 
+/* Test for all errnos that report loss of an established TCP connection */
+static inline bool
+errno_is_connection_loss(int e)
+{
+	if (e == ECONNRESET ||
+		e == ECONNABORTED ||
+		e == EHOSTUNREACH ||
+		e == ENETDOWN ||
+		e == ENETUNREACH)
+		return true;
+	return false;
+}
+
+/*
+ * To test for connection-loss errnos in a switch statement, write
+ * "case ALL_CONNECTION_LOSS_ERRNOS:".
+ */
+#define ALL_CONNECTION_LOSS_ERRNOS \
+	ECONNRESET: \
+	case ECONNABORTED: \
+	case EHOSTUNREACH: \
+	case ENETDOWN: \
+	case ENETUNREACH
+
 /* Portable locale initialization (in exec.c) */
 extern void set_pglocale_pgservice(const char *argv0, const char *app);
 
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 8b6576b23d..7ce8c87e8d 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -351,6 +351,10 @@ extern int	pgwin32_safestat(const char *path, struct stat *buf);
 #define EADDRNOTAVAIL WSAEADDRNOTAVAIL
 #undef EHOSTUNREACH
 #define EHOSTUNREACH WSAEHOSTUNREACH
+#undef ENETDOWN
+#define ENETDOWN WSAENETDO

Re: Wrong example in the bloom documentation

2020-10-08 Thread Tom Lane
"Daniel Westermann (DWE)"  writes:
>> I was hoping someone more experienced with this would comment, but
>> seeing none, I will apply it in a day or two to all supported versions?
>> Have you tested this output back to 9.5?

> I hoped that as well. No, I tested down to 9.6 because the change happened in 
> 10.

The patch assumes that parallel query is enabled, which is not true by
default before v10, so it should certainly not be applied before v10
(at least not without significant revisions).

regards, tom lane




Re: Wrong example in the bloom documentation

2020-10-08 Thread Bruce Momjian
On Thu, Oct  8, 2020 at 03:43:32PM -0400, Tom Lane wrote:
> "Daniel Westermann (DWE)"  writes:
> >> I was hoping someone more experienced with this would comment, but
> >> seeing none, I will apply it in a day or two to all supported versions?
> >> Have you tested this output back to 9.5?
> 
> > I hoped that as well. No, I tested down to 9.6 because the change happened 
> > in 10.
> 
> The patch assumes that parallel query is enabled, which is not true by
> default before v10, so it should certainly not be applied before v10
> (at least not without significant revisions).

I think we should just go for simple application cases for this.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Minor documentation error regarding streaming replication protocol

2020-10-08 Thread Bruce Momjian
On Sun, Sep 13, 2020 at 10:25:01PM +0200, Brar Piening wrote:
> While implementing streaming replication client functionality for Npgsql
> I stumbled upon a minor documentation error at
> https://www.postgresql.org/docs/current/protocol-replication.html
> 
> The "content" return value for the TIMELINE_HISTORYcommand is advertised
> as bytea while it is in fact raw ASCII bytes.
> 
> How did I find out?
> Since the value I get doesn't start with a "\x" and contains ascii text,
> although I've bytea_outputset to hex, I first thought that the streaming
> replication protocol simply doesn't honor bytea_output, but then I
> realized that I also get unencoded tabs and newlines which wouldn't be
> possible if the value woud be passed through byteaout.
> 
> This is certainly a minor problem since the timeline history file only
> contains generated strings that are ASCII-only, so just using the
> unencoded bytes is actually easier than decoding bytea.
> OTOH it did cost me a few hours (writing a bytea decoder and figuring
> out why it doesn't work by looking at varlena.c and proving the docs
> wrong) so I want to point this out here since it is possibly an
> unintended behavior or at least a documentation error.
> Also I'm wary of taking dependency on an undocumented implementation
> detail that could possibly change at any point.

I have looked at this.  It seems SendTimeLineHistory() is sending raw
bytes from the history file, with no encoding conversion, and
ReceiveXlogStream() is receiving it, again assuming it is just plain
text.  I am not sure we really have an SQL data type where we do this. 
BYTEA doesn't do encoding conversion, but does backslash procesing, and
TEXT does encoding conversion.

I suppose we either have to document this as BYTEA with no backslash
processing, or TEXT with no encoding conversion --- I think I prefer the
later.

Attached is a patch to update the docs, and the data type passed by
SendTimeLineHistory().  Does this look safe to everyone?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f5e3318106..3cc30a480b 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1877,11 +1877,11 @@ The commands accepted in replication mode are:
 
   
   
-   content (bytea)
+   content (text)
   
   
   
-   Contents of the timeline history file.
+   Contents of the timeline history file;  no encoding conversion is performed.
   
   
   
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7c9d1b67df..7db8975065 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -496,7 +496,7 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
 	pq_sendstring(&buf, "content"); /* col name */
 	pq_sendint32(&buf, 0);		/* table oid */
 	pq_sendint16(&buf, 0);		/* attnum */
-	pq_sendint32(&buf, BYTEAOID);	/* type oid */
+	pq_sendint32(&buf, TEXTOID);	/* type oid */
 	pq_sendint16(&buf, -1);		/* typlen */
 	pq_sendint32(&buf, 0);		/* typmod */
 	pq_sendint16(&buf, 0);		/* format code */


Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Ranier Vilela
On Thu, Oct 8, 2020 at 10:13:53AM -0700, John W Higgins wrote:

>It's not going to win a Turing award - but I thought this project was a
>little more friendly then what I've seen in this thread towards a first
>time contributor.

Instead, it is unfriendly.

It takes a lot of motivation to "try" to submit a patch.

Good lucky Maksim Kita.

Thanks for the support John

regards,

Ranier Vilela


Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-08 Thread Thomas Munro
On Tue, Oct 6, 2020 at 10:38 PM Greg Nancarrow  wrote:
+if (estate->es_plannedstmt->commandType == CMD_INSERT)
...
+if ((XactReadOnly || (IsInParallelMode() &&
queryDesc->plannedstmt->commandType != CMD_INSERT)) &&
...
+isParallelInsertLeader = nodeModifyTableState->operation == CMD_INSERT;
...

One thing I noticed is that you have logic, variable names and
assertions all over the tree that assume that we can only do parallel
*inserts*.  I agree 100% with your plan to make Parallel Insert work
first, it is an excellent goal and if we get it in it'll be a headline
feature of PG14 (along with COPY etc).  That said, I wonder if it
would make sense to use more general naming (isParallelModifyLeader?),
be more liberal where you really mean "is it DML", and find a way to
centralise the logic about which DML commands types are currently
allowed (ie insert only for now) for assertions and error checks etc,
so that in future we don't have to go around and change all these
places and rename things again and again.

While contemplating that, I couldn't resist taking a swing at the main
(?) show stopper for Parallel Update and Parallel Delete, judging by
various clues left in code comments by Robert: combo command IDs
created by other processes.  Here's a rapid prototype to make that
work (though perhaps not as efficiently as we'd want, not sure).  With
that in place, I wonder what else we'd need to extend your patch to
cover all three operations... it can't be much!  Of course I don't
want to derail your work on Parallel Insert, I'm just providing some
motivation for my comments on the (IMHO) shortsightedness of some of
the coding.

PS Why not use git format-patch to create patches?
From ad2b5e07a09603b09859dfcbde6addd51096cbb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 9 Oct 2020 00:27:07 +1300
Subject: [PATCH] Coordinate combo command IDs with parallel workers.

Previously, we would serialize the leader's combo command IDs and
restore a read-only copy of them in worker processes, not allowing
updates.  Instead, migrate them into shared memory, in preparation for
parallel update/delete queries where new combo command IDs might need to
be created in any process and visible to others.

XXX This design causes every backend to maintain its own deduplication
hash table, and requires a shared lock to look up any combocid.  Both
policies could be reconsidered, basically a memory size vs locking
tradeoff.  Need some experience/profiling of real work to see how much
any of this really matters.

Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com
---
 src/backend/access/common/session.c|  28 +-
 src/backend/access/heap/heapam.c   |  10 -
 src/backend/access/transam/README.parallel |   6 -
 src/backend/access/transam/parallel.c  |  14 -
 src/backend/storage/lmgr/lwlock.c  |   2 +
 src/backend/utils/time/combocid.c  | 290 -
 src/include/access/session.h   |  20 +-
 src/include/storage/lwlock.h   |   1 +
 src/include/utils/combocid.h   |   8 +-
 9 files changed, 275 insertions(+), 104 deletions(-)

diff --git a/src/backend/access/common/session.c b/src/backend/access/common/session.c
index 0ec61d48a2..7e1bffb680 100644
--- a/src/backend/access/common/session.c
+++ b/src/backend/access/common/session.c
@@ -23,6 +23,7 @@
 #include "access/session.h"
 #include "storage/lwlock.h"
 #include "storage/shm_toc.h"
+#include "utils/combocid.h"
 #include "utils/memutils.h"
 #include "utils/typcache.h"
 
@@ -43,6 +44,7 @@
  */
 #define SESSION_KEY_DSA		UINT64CONST(0x0001)
 #define SESSION_KEY_RECORD_TYPMOD_REGISTRY	UINT64CONST(0x0002)
+#define SESSION_KEY_FIXED	UINT64CONST(0x0003)
 
 /* This backend's current session. */
 Session*CurrentSession = NULL;
@@ -74,8 +76,10 @@ GetSessionDsmHandle(void)
 	dsm_segment *seg;
 	size_t		typmod_registry_size;
 	size_t		size;
+	void	   *fixed_space;
 	void	   *dsa_space;
 	void	   *typmod_registry_space;
+	SessionFixed *fixed;
 	dsa_area   *dsa;
 	MemoryContext old_context;
 
@@ -91,6 +95,10 @@ GetSessionDsmHandle(void)
 	old_context = MemoryContextSwitchTo(TopMemoryContext);
 	shm_toc_initialize_estimator(&estimator);
 
+	/* Estimate size for the fixed-sized per-session state. */
+	shm_toc_estimate_keys(&estimator, 1);
+	shm_toc_estimate_chunk(&estimator, sizeof(SessionFixed));
+
 	/* Estimate space for the per-session DSA area. */
 	shm_toc_estimate_keys(&estimator, 1);
 	shm_toc_estimate_chunk(&estimator, SESSION_DSA_SIZE);
@@ -113,6 +121,14 @@ GetSessionDsmHandle(void)
 		 dsm_segment_address(seg),
 		 size);
 
+	/* Create the simple fixed-sized session state. */
+	fixed_space = shm_toc_allocate(toc, sizeof(SessionFixed));
+	fixed = (SessionFixed *) fixed_space;
+	memset(fixed, 0, sizeof(*fixed));
+	LWLockInitialize(&fixed->shared_combocid_lock, LWTRANCHE_SHARED_COMBO

Re: [HACKERS] Custom compression methods

2020-10-08 Thread Tomas Vondra

On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:

On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar  wrote:


On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar  wrote:
>
> On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
>  wrote:
> >
> > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > > wrote:
> > >>
> > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > >> > wrote:
> > >> >>
> > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > >> >> > wrote:
> > >> >> >
> > >> >> >Thanks, Tomas for your feedback.
> > >> >> >
> > >> >> >> 9) attcompression ...
> > >> >> >>
> > >> >> >> The main issue I see is what the patch does with attcompression. 
Instead
> > >> >> >> of just using it to store a the compression method, it's also used 
to
> > >> >> >> store the preserved compression methods. And using NameData to 
store
> > >> >> >> this seems wrong too - if we really want to store this info, the 
correct
> > >> >> >> way is either using text[] or inventing charvector or similar.
> > >> >> >
> > >> >> >The reason for using the NameData is the get it in the fixed part of
> > >> >> >the data structure.
> > >> >> >
> > >> >>
> > >> >> Why do we need that? It's possible to have varlena fields with direct
> > >> >> access (see pg_index.indkey for example).
> > >> >
> > >> >I see.  While making it NameData I was thinking whether we have an
> > >> >option to direct access the varlena. Thanks for pointing me there. I
> > >> >will change this.
> > >> >
> > >> > Adding NameData just to make
> > >> >> it fixed-length means we're always adding 64B even if we just need a
> > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > >> >> That seems a bit unnecessary, and might be an issue with many 
attributes
> > >> >> (e.g. with many temp tables, etc.).
> > >> >
> > >> >You are right.  Even I did not like to keep 64B for this, so I will 
change it.
> > >> >
> > >> >>
> > >> >> >> But to me this seems very much like a misuse of attcompression to 
track
> > >> >> >> dependencies on compression methods, necessary because we don't 
have a
> > >> >> >> separate catalog listing compression methods. If we had that, I 
think we
> > >> >> >> could simply add dependencies between attributes and that catalog.
> > >> >> >
> > >> >> >Basically, up to this patch, we are having only built-in compression
> > >> >> >methods and those can not be dropped so we don't need any dependency
> > >> >> >at all.  We just want to know what is the current compression method
> > >> >> >and what is the preserve compression methods supported for this
> > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > >> >> >but I don't think it makes sense to add a separate catalog?
> > >> >> >
> > >> >>
> > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > >> >> very much like a workaround needed because we don't have the catalog.
> > >> >>
> > >> >> I don't quite understand how could we support custom compression 
methods
> > >> >> without listing them in some sort of catalog?
> > >> >
> > >> >Yeah for supporting custom compression we need some catalog.
> > >> >
> > >> >> >> Moreover, having the catalog would allow adding compression methods
> > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > >> >> >> compression methods. Which seems like a strange limitation, 
considering
> > >> >> >> this thread is called "custom compression methods".
> > >> >> >
> > >> >> >I think I forgot to mention while submitting the previous patch that
> > >> >> >the next patch I am planning to submit is, Support creating the 
custom
> > >> >> >compression methods wherein we can use pg_am catalog to insert the 
new
> > >> >> >compression method.  And for dependency handling, we can create an
> > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> > >> >> >attribute dependency on the current compression method AM as well as
> > >> >> >on the preserved compression methods AM.   As part of this, we will
> > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > >> >> >will be converted to the oid_vector (first OID will be of the current
> > >> >> >compression method, followed by the preserved compression method's
> > >> >> >oids).
> > >> >> >
> > >> >>
> > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods 
don't
> > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > >> >>
> > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > >> >> then add the catalog later - I think we should start with the catalog
> > >> >> right away. The advantage is that if we end up committing only some of
> > >> >> the patches in this cycle, we already have all the infrastructure etc

Re: POC: postgres_fdw insert batching

2020-10-08 Thread Tomas Vondra

On Thu, Oct 08, 2020 at 02:40:10AM +, tsunakawa.ta...@fujitsu.com wrote:

Hello Tomas san,


Thank you for picking up this.  I'm interested in this topic, too.  (As an 
aside, we'd like to submit a bulk insert patch for ECPG in the near future.)

As others referred, Andrey-san's fast COPY to foreign partitions is also 
promising.  But I think your bulk INSERT is a separate feature and offers COPY 
cannot do -- data transformation during loading with INSERT SELECT and CREATE 
TABLE AS SELECT.

Is there anything that makes you worry and stops development?  Could I give it 
a try to implement this (I'm not sure I can, sorry.  I'm worried if we can 
change the executor's call chain easily.)



It's primarily a matter of having too much other stuff on my plate, thus
not having time to work on this feature. I was not too worried about any
particular issue, but I wanted some feedback before spending more time
on extending the API.

I'm not sure when I'll have time to work on this again, so if you are
interested and willing to work on it, please go ahead. I'll gladly do
reviews and help you with it.




1) Extend the FDW API?


Yes, I think, because FDWs for other DBMSs will benefit from this.  (But it's 
questionable whether we want users to transfer data in Postgres database to 
other DBMSs...)



I think transferring data to other databases is fine - interoperability
is a big advantage for users, I don't see it as something threatening
the PostgreSQL project. I doubt this would make it more likely for users
to migrate from PostgreSQL - there are many ways to do that already.



MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT 
INTO table VALUES(record1), (record2), ...  Oracle doesn't have this syntax, 
but it can use CTE as follows:

 INSERT INTO table
 WITH t AS (
   SELECT record1 FROM DUAL UNION ALL
   SELECT record2 FROM DUAL UNION ALL
   ...
 )
 SELECT * FROM t;

And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION 
ALL SELECT record2 ...



True. In some cases INSERT may be replaced by COPY, but it has various
other features too.


The API would simply be:

TupleTableSlot **
ExecForeignMultiInsert(EState *estate,
 ResultRelInfo *rinfo,
 TupleTableSlot **slot,
 TupleTableSlot **planSlot,
 int numSlots);




+1, seems quite reasonable


2) What about the insert results?


I'm wondering if we can report success or failure of each inserted row, because 
the remote INSERT will fail entirely.  Other FDWs may be able to do it, so the 
API can be like above.



Yeah. I think handling complete failure should not be very difficult,
but there are cases that worry me more. For example, what if there's a
before trigger (on the remote db) that "skips" inserting some of the
rows by returning NULL?


For the same reason, support for RETURNING clause will vary from DBMS to DBMS.



Yeah. I wonder if the FDW needs to indicate which features are supported
by the ExecForeignMultiInsert, e.g. by adding a function that decides
whether batch insert is supported (it might also do that internally by
calling ExecForeignInsert, of course).




3) What about the other DML operations (DELETE/UPDATE)?


I don't think they are necessary for the time being.  If we want them, they 
will be implemented using the libpq batch/pipelining as Andres-san said.



I agree.




3) Should we do batching for COPY insteads?


I'm thinking of issuing INSERT with multiple records as your patch does, 
because:

* When the user executed INSERT statements, it would look strange to the user 
if the remote SQL is displayed as COPY.

* COPY doesn't invoke rules unlike INSERT.  (I don't think the rule is a 
feature what users care about, though.)  Also, I'm a bit concerned that there 
might be, or will be, other differences between INSERT and COPY.



I agree.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: speed up unicode normalization quick check

2020-10-08 Thread John Naylor
On Thu, Oct 8, 2020 at 8:29 AM Michael Paquier  wrote:

> On Thu, Oct 08, 2020 at 04:52:18AM -0400, John Naylor wrote:
> > Looks fine overall, but one minor nit: I'm curious why you made a
> separate
> > section in the pgindent exclusions. The style in that file seems to be
> one
> > comment per category.
>
> Both parts indeed use PerfectHash.pm, but are generated by different
> scripts so that looked better as separate items.


Okay, thanks.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 08:16, Onder Kalaci  wrote:
> I hit an assertion failure. When asserts disabled, it works fine even with 
> more tables  (>5000).
>
> Steps to reproduce:
> CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 
> int, value_3 float, value_4 bigint);
> 250 relations work fine, see the query (too long to copy & paste here): 
> https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

I had a quick look at this and I can recreate it using the following
(using psql)

select 'explain select count(*) from users_table ' || string_Agg('LEFT
JOIN users_table u'|| x::text || ' USING (user_id)',' ') from
generate_Series(1,379)x;
\gexec

That triggers the assert due to the Assert(outer_skip_rows <=
outer_rows); failing in initial_cost_mergejoin().

The reason it fails is that outer_path_rows has become infinity due to
calc_joinrel_size_estimate continually multiplying in the join
selectivity of 0.05 (due to our 200 default num distinct from lack of
any stats) which after a number of iterations causes the number to
become very large.

Instead of running 379 joins from above, try with 378 and you get:

 Aggregate  (cost=NaN..NaN rows=1 width=8)
   ->  Nested Loop Left Join  (cost=33329.16..NaN rows=Infinity width=0)
 Join Filter: (users_table.user_id = u378.user_id)
 ->  Merge Left Join  (cost=33329.16.. width=4)
   Merge Cond: (users_table.user_id = u377.user_id)
   ->  Merge Left Join  (cost=33240.99.. width=4)

Changing the code in initial_cost_mergejoin() to add:

if (outer_path_rows <= 0 || isnan(outer_path_rows))
outer_path_rows = 1;
+else if (isinf(outer_path_rows))
+outer_path_rows = DBL_MAX;

does seem to fix the problem, but that's certainly not the right fix.

Perhaps the right fix is to modify clamp_row_est() with:

@@ -193,7 +194,9 @@ clamp_row_est(double nrows)
 * better and to avoid possible divide-by-zero when interpolating costs.
 * Make it an integer, too.
 */
-   if (nrows <= 1.0)
+   if (isinf(nrows))
+   nrows = rint(DBL_MAX);
+   else if (nrows <= 1.0)
nrows = 1.0;
else
nrows = rint(nrows);

but the row estimates are getting pretty insane well before then.
DBL_MAX is 226 orders of magnitude more than the estimated number of
atoms in the observable universe, so it seems pretty unreasonable that
someone might figure out a way to store that many tuples on a disk any
time soon.

Perhaps DBL_MAX is way to big a number to clamp at. I'm just not sure
what we should reduce it to so that it is reasonable.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> The reason it fails is that outer_path_rows has become infinity due to
> calc_joinrel_size_estimate continually multiplying in the join
> selectivity of 0.05 (due to our 200 default num distinct from lack of
> any stats) which after a number of iterations causes the number to
> become very large.

0.005, but yeah.  We're estimating that each additional join inflates
the output size by about 6x (1270 * 0.005), and after a few hundred
of those, it'll overflow.

> Perhaps the right fix is to modify clamp_row_est() with:

I thought of that too, but as you say, if the rowcount has overflowed a
double then we've got way worse problems.  It'd make more sense to try
to keep the count to a saner value in the first place.  

In the end, (a) this is an Assert, so not a problem for production
systems, and (b) it's going to take you longer than you want to
wait to join 500+ tables, anyhow, unless maybe they're empty.
I'm kind of disinclined to do anything in the way of a band-aid fix.

If somebody has an idea for a different way of estimating the join
size with no stats, we could talk about that.  I notice though that
the only way a plan of this sort isn't going to blow up at execution
is if the join multiplication factor is at most 1, ie the join
key is unique.  But guess what, we already know what to do in that
case.  Adding a unique or pkey constraint to users_table.user_id
causes the plan to collapse entirely (if they're left joins) or
at least still produce a small rowcount estimate (if plain joins).

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 12:16, Tom Lane  wrote:
>
> > Perhaps the right fix is to modify clamp_row_est() with:
>
> I thought of that too, but as you say, if the rowcount has overflowed a
> double then we've got way worse problems.  It'd make more sense to try
> to keep the count to a saner value in the first place.

I wonder if there was something more logical we could do to maintain
sane estimates too, but someone could surely still cause it to blow up
by writing a long series of clause-less joins. We can't really get
away from the fact that we must estimate those as inner_rows *
outer_rows

I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

David




Re: Minor documentation error regarding streaming replication protocol

2020-10-08 Thread Michael Paquier
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
> I have looked at this.  It seems SendTimeLineHistory() is sending raw
> bytes from the history file, with no encoding conversion, and
> ReceiveXlogStream() is receiving it, again assuming it is just plain
> text.  I am not sure we really have an SQL data type where we do this. 
> BYTEA doesn't do encoding conversion, but does backslash procesing, and
> TEXT does encoding conversion.
> 
> I suppose we either have to document this as BYTEA with no backslash
> processing, or TEXT with no encoding conversion --- I think I prefer the
> later.

As StartupXLOG() tells, The timeline history file can include as
reason the recovery target name which may not be made just of ASCII
characters as that's the value specified in pg_create_restore_point by
the user, so bytea is correct, no?
--
Michael


signature.asc
Description: PGP signature


Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

I poked at this a bit more closely, and noted that the actual problem is
that when we do this:

outer_skip_rows = rint(outer_path_rows * outerstartsel);

we have outer_path_rows = inf, outerstartsel = 0, and of course inf times
zero is NaN.  So we end up asserting "NaN <= Inf", not "Inf <= Inf"
(which wouldn't have caused a problem).

If we did want to do something here, I'd consider something like

if (isnan(outer_skip_rows))
outer_skip_rows = 0;
if (isnan(inner_skip_rows))
inner_skip_rows = 0;

(We shouldn't need that for outer_rows/inner_rows, since the endsel
values can't be 0.)  Messing with clamp_row_est would be a much more
indirect way of fixing it, as well as having more widespread effects.

In the end though, I'm still not terribly excited about this.

regards, tom lane




Re: [HACKERS] logical decoding of two-phase transactions

2020-10-08 Thread Peter Smith
On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila  wrote:
> > COMMENT
> > Line 177
> > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * 
> > prepare_data)
> >
> > prepare_data->prepare_type = flags;
> > This code may be OK but it does seem a bit of an abuse of the flags.
> >
> > e.g. Are they flags or are the really enum values?
> > e.g. And if they are effectively enums (it appears they are) then
> > seemed inconsistent that |= was used when they were previously
> > assigned.
> >
> > ;
>
> I don't understand this point. As far as I can see at the time of
> write (logicalrep_write_prepare()), the patch has used |=, and at the
> time of reading (logicalrep_read_prepare()) it has used assignment
> which seems correct from the code perspective. Do you have a better
> proposal?

OK. I will explain my thinking when I wrote that review comment.

I agree all is "correct" from a code perspective.

But IMO using bit arithmetic implies that different combinations are
also possible, whereas in current code they are not.
So code is kind of having a bet each-way - sometimes treating "flags"
as bit flags and sometimes as enums.

e.g. If these flags are not really bit flags at all then the
logicalrep_write_prepare() code might just as well be written as
below:

BEFORE
if (rbtxn_commit_prepared(txn))
flags |= LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags |= LOGICALREP_IS_PREPARE;

/* Make sure exactly one of the expected flags is set. */
if (!PrepareFlagsAreValid(flags))
elog(ERROR, "unrecognized flags %u in prepare message", flags);


AFTER
if (rbtxn_commit_prepared(txn))
flags = LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags = LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags = LOGICALREP_IS_PREPARE;

~

OTOH, if you really do want to anticipate having future flag bit
combinations then maybe the PrepareFlagsAreValid() macro ought to to
be tweaked accordingly, and the logicalrep_read_prepare() code maybe
should look more like below:

BEFORE
/* set the action (reuse the constants used for the flags) */
prepare_data->prepare_type = flags;

AFTER
/* set the action (reuse the constants used for the flags) */
prepare_data->prepare_type =
flags & LOGICALREP_IS_COMMIT_PREPARED ? LOGICALREP_IS_COMMIT_PREPARED :
flags & LOGICALREP_IS_ROLLBACK_PREPARED ? LOGICALREP_IS_ROLLBACK_PREPARED :
LOGICALREP_IS_PREPARE;

Kind Regards.
Peter Smith
Fujitsu Australia




Re: [PATCH] ecpg: fix progname memory leak

2020-10-08 Thread Michael Paquier
On Thu, Oct 08, 2020 at 01:17:25PM -0400, Tom Lane wrote:
> Agreed.  In theory there might be some point in removing leaks that happen
> per-statement, so as to avoid unreasonable memory bloat when processing an
> extremely long input file.  In practice nobody has complained about that,
> and if somebody did I'd probably question the sanity of putting so much
> code into one file.  (The C compiler would likely bloat even more while
> processing the output...)  Moreover, given the way the ecpg grammar works,
> it'd be really painful to avoid all such leaks, and it might well
> introduce bugs not fix them.

I got to wonder about that, and I don't quite see how to make all the
memory handling clean without having at least a concept of
per-statement memory context to make any cleanup easy once a statement
is done.  That's a lot of infrastructure for a case nobody really
complained about, indeed..

> In any case, if this TODO item is going to lead to ideas as dubious
> as "let's free progname before exiting", it's not helpful.

+1.  Agreed to just remove it.
--
Michael


signature.asc
Description: PGP signature


Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 12:59, Tom Lane  wrote:
> If we did want to do something here, I'd consider something like
>
> if (isnan(outer_skip_rows))
> outer_skip_rows = 0;
> if (isnan(inner_skip_rows))
> inner_skip_rows = 0;

Are you worried about the costs above the join that triggers that
coming out as NaN with that fix? It appears that's the case. Cost
comparisons of paths with that are not going to do anything along the
lines of sane.

I guess whether or not that matters depends on if we expect any real
queries to hit this, or if we just want to stop the Assert failure.

... 500 joins. I'm willing to listen to the explanation use case, but
in absence of that explanation, I'd be leaning towards "you're doing
it wrong".  If that turns out to be true, then perhaps your proposed
fix is okay.

David




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> > (6)
> > +   bufHdr->tag.blockNum >=
> > firstDelBlock[j])
> > +   InvalidateBuffer(bufHdr);   /*
> > releases spinlock */
> >
> > The right side of >= should be cur_block.
> 
> Fixed.

>= should be =, shouldn't it?

Please measure and let us see just the recovery performance again because the 
critical part of the patch was modified.  If the performance is good as the 
previous one, and there's no review interaction with others in progress, I'll 
mark the patch as ready for committer in a few days.


 Regards
Takayuki Tsunakawa



Remove some unnecessary if-condition

2020-10-08 Thread Hou, Zhijie
Hi

I found some likely unnecessary if-condition in code.

1. Some check in else branch seems unnecessary.

In (/src/backend/replication/logical/reorderbuffer.c) 
① @@ -4068,7 +4068,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
> bool   found;
> if (!found)
> {
>...
> }
> else if (found && chunk_seq != ent->last_chunk_seq + 1)
>...

The check of "found" in else if branch seems unnecessary.

② (/src/backend/utils/init/postinit.c)
@@ -924,11 +924,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char 
*username,

> bool  bootstrap = IsBootstrapProcessingMode();
> if (bootstrap)
> {
>...
> }
> else if(...)
> {...}
> else
> {
>if (!bootstrap)
>{
>...
>}
> }

The check of "bootstrap" in else branch seems unnecessary.


2.In (/src/interfaces/ecpg/compatlib/informix.c)
@@ -944,7 +944,7 @@ rupshift(char *str)

> for (len--; str[len] && str[len] == ' '; len--);

The first "str[len]" seems unnecessary since  " str[len] == ' '" will check it 
as well.

Do you think we should remove these if-condition for code clean ?

Best regards,
houzj




0001-Remove-some-unnecessary-path.patch
Description: 0001-Remove-some-unnecessary-path.patch


Re: Expansion of our checks for connection-loss errors

2020-10-08 Thread Kyotaro Horiguchi
At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane  wrote in 
> Over in the thread at [1], we've tentatively determined that the
> reason buildfarm member lorikeet is currently failing is that its
> network stack returns ECONNABORTED for (some?) connection failures,
> whereas our code is only expecting ECONNRESET.  Fujii Masao therefore
> proposes that we treat ECONNABORTED the same as ECONNRESET.  I think
> this is a good idea, but after a bit of research I feel it does not
> go far enough.  I find these POSIX-standard errnos that also seem
> likely candidates to be returned for a hard loss of connection:
> 
>   ECONNABORTED
>   EHOSTUNREACH
>   ENETDOWN
>   ENETUNREACH
> 
> All of these have been in POSIX since SUSv2, so it seems unlikely
> that we need to #ifdef any of them.  (It is in any case pretty silly
> that we have #ifdefs around a very small minority of our references
> to ECONNRESET :-(.)
> 
> There are some other related errnos, such as ECONNREFUSED, that
> don't seem like they'd be returned for a failure of a pre-existing
> connection, so we don't need to include them in such tests.
> 
> Accordingly, I propose the attached patch (an expansion of
> Fujii-san's) that causes us to test for all five errnos anyplace
> we had been checking for ECONNRESET.  I felt that this was getting to
> the point where we'd better centralize the knowledge of what to check,
> so the patch does that, via an inline function and an admittedly hacky
> macro.  I also upgraded some places such as strerror.c to have full
> support for these symbols.
> 
> All of the machines I have (even as far back as HPUX 10.20) also
> define ENETRESET and EHOSTDOWN.  However, those symbols do not appear
> in SUSv2.  ENETRESET was added at some later point, but EHOSTDOWN is
> still not in POSIX.  For the moment I've left these second-tier
> symbols out of the patch, but there's a case for adding them.  I'm
> not sure whether there'd be any point in trying to #ifdef them.
> 
> BTW, I took out the conditional defines of some of these errnos in
> libpq's win32.h; AFAICS that's been dead code ever since we added
> #define's for them to win32_port.h.  Am I missing something?
> 
> This seems like a bug fix to me, so I'm inclined to back-patch.
> 
>   regards, tom lane
> 
> [1] 
> https://www.postgresql.org/message-id/flat/E1kPc9v-0005L4-2l%40gemulon.postgresql.org

+1 for the direction.

In terms of connection errors, connect(2) and bind(2) can return
EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
respective WSA errors in TranslateSocketError())

I'm not sure how we should treat EMFILE/ENFILE/ENOBUFS/ENOMEM from
accept(2). (select(2) can return ENOMEM.)

I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
avoid duplication definition of the errno list.

-   if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+   if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))

Don't we need to use TranslateSocketError() before?

+   /* We might get ECONNRESET etc here if using TCP and backend 
died */
+   if (errno_is_connection_loss(SOCK_ERRNO))

Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
Windows and Linux.

=
/*
 * These macros are needed to let error-handling code be portable between
 * Unix and Windows.  (ugh)
 */
#ifdef WIN32
#define SOCK_ERRNO (WSAGetLastError())
#define SOCK_STRERROR winsock_strerror
#define SOCK_ERRNO_SET(e) WSASetLastError(e)
#else
#define SOCK_ERRNO errno
#define SOCK_STRERROR strerror_r
#define SOCK_ERRNO_SET(e) (errno = (e))
#endif
=

AFAICS SOCK_ERRNO is intended to be used idiomatically as:

> SOCK_STRERROR(SOCK_ERRNO, ...)

The WSAE values from WSAGetLastError() and E values in errno are not
compatible and needs translation by TranslateSocketError()?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Expansion of our checks for connection-loss errors

2020-10-08 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane  wrote in 
>> Accordingly, I propose the attached patch (an expansion of
>> Fujii-san's) that causes us to test for all five errnos anyplace
>> we had been checking for ECONNRESET.

> +1 for the direction.

> In terms of connection errors, connect(2) and bind(2) can return
> EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
> recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
> respective WSA errors in TranslateSocketError())

I do not think we have any issues with connection-time errors;
or at least, if we do, the spots being touched here certainly
shouldn't need to worry about them.  These places are dealing
with already-established connections.

> I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
> avoid duplication definition of the errno list.

Hmm, might be worth doing, but I'm not sure.  I am worried about
whether compilers will generate equally good code that way.

> - if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
> + if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))

> Don't we need to use TranslateSocketError() before?

Oh, I missed that.  But:

> Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
> Windows and Linux.

In that case, nothing would have worked on Windows for the last
ten years, so you're mistaken.  I think the actual explanation
why this works, and why that test in parallel.c probably still
works even with my mistake, is that win32_port.h makes sure that
our values of ECONNRESET etc match WSAECONNRESET etc.

IOW, we'd not actually need TranslateSocketError at all, except
that it maps some not-similarly-named error codes for conditions
that don't exist in Unix into ones that do.  We probably do want
TranslateSocketError in this parallel.c test so that anything that
it maps to one of the errno_is_connection_loss codes will be
recognized; but the basic cases would work anyway, unless I
misunderstand this stuff entirely.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> Are you worried about the costs above the join that triggers that
> coming out as NaN with that fix? It appears that's the case.

[ pokes at that... ]  Yeah, it looks like nestloop cost estimation
also has some issues with inf-times-zero producing NaN; it's just
not asserting about it.

I notice there are some other ad-hoc isnan() checks scattered
about costsize.c, too.  Maybe we should indeed consider fixing
clamp_row_estimate to get rid of inf (and nan too, I suppose)
so that we'd not need those.  I don't recall the exact cases
that made us introduce those checks, but they were for cases
a lot more easily reachable than this one, I believe.

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread Kyotaro Horiguchi
At Fri, 9 Oct 2020 00:41:24 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Jamison, Kirk/ジャミソン カーク 
> > > (6)
> > > + bufHdr->tag.blockNum >=
> > > firstDelBlock[j])
> > > + InvalidateBuffer(bufHdr);   /*
> > > releases spinlock */
> > >
> > > The right side of >= should be cur_block.
> > 
> > Fixed.
> 
> >= should be =, shouldn't it?
> 
> Please measure and let us see just the recovery performance again because the 
> critical part of the patch was modified.  If the performance is good as the 
> previous one, and there's no review interaction with others in progress, I'll 
> mark the patch as ready for committer in a few days.

The performance is expected to be kept since smgrnblocks() is called
in a non-hot code path and actually it is called at most four times
per a buffer drop in this patch. But it's better making it sure.

I have some comments on the latest patch.

@@ -445,6 +445,7 @@ BlockNumber
 visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 {
BlockNumber newnblocks;
+   boolcached;

All the added variables added by 0002 is useless because all the
caller sites are not interested in the value. smgrnblocks should
accept NULL as isCached. (I'm agree with Tsunakawa-san that the
camel-case name is not common there.)

+   nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], &isCached);
+
+   if (!isCached)

"is cached" is not the property that code is interested in. No other callers to 
smgrnblocks are interested in that property. The need for caching is purely 
internal of smgrnblocks().

On the other hand, we are going to utilize the property of "accuracy"
that is a biproduct of reducing fseek calls, and, again, not
interested in how it is achieved.

So I suggest that the name should be "accurite" or something that is
not suggest the mechanism used under the hood.

+   if (nTotalBlocks != InvalidBlockNumber &&
+   nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

I don't think nTotalBlocks is useful. What we need here is only total
blocks for every forks (nForkBlocks[]) and the total number of buffers
to be invalidated for all forks (nBlocksToInvalidate).


> > > The right side of >= should be cur_block.
> > 
> > Fixed.
> 
> >= should be =, shouldn't it?

It's just from a paranoia. What we are going to invalidate is blocks
blockNum of which >= curBlock. Although actually there's no chance of
any other processes having replaced the buffer with another page (with
lower blockid) of the same relation after BugTableLookup(), that
condition makes it sure not to leave blocks to be invalidated left
alone.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread Kyotaro Horiguchi
Oops! Sorry for the mistake.

At Fri, 09 Oct 2020 11:12:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 9 Oct 2020 00:41:24 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in 
> > From: Jamison, Kirk/ジャミソン カーク 
> > > > (6)
> > > > +   bufHdr->tag.blockNum >=
> > > > firstDelBlock[j])
> > > > +   InvalidateBuffer(bufHdr);   
> > > > /*
> > > > releases spinlock */
> > > >
> > > > The right side of >= should be cur_block.
> > > 
> > > Fixed.
> > 
> > >= should be =, shouldn't it?
> > 
> > Please measure and let us see just the recovery performance again because 
> > the critical part of the patch was modified.  If the performance is good as 
> > the previous one, and there's no review interaction with others in 
> > progress, I'll mark the patch as ready for committer in a few days.
> 
> The performance is expected to be kept since smgrnblocks() is called
> in a non-hot code path and actually it is called at most four times
> per a buffer drop in this patch. But it's better making it sure.
> 
> I have some comments on the latest patch.
> 
> @@ -445,6 +445,7 @@ BlockNumber
>  visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
>  {
>   BlockNumber newnblocks;
> + boolcached;
> 
> All the added variables added by 0002 is useless because all the
> caller sites are not interested in the value. smgrnblocks should
> accept NULL as isCached. (I'm agree with Tsunakawa-san that the
> camel-case name is not common there.)
> 
> + nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], &isCached);
> +
> + if (!isCached)
> 
> "is cached" is not the property that code is interested in. No other callers 
> to smgrnblocks are interested in that property. The need for caching is 
> purely internal of smgrnblocks().
> 
> On the other hand, we are going to utilize the property of "accuracy"
> that is a biproduct of reducing fseek calls, and, again, not
> interested in how it is achieved.
> 
> So I suggest that the name should be "accurite" or something that is
> not suggest the mechanism used under the hood.
> 
> + if (nTotalBlocks != InvalidBlockNumber &&
> + nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
> 
> I don't think nTotalBlocks is useful. What we need here is only total
> blocks for every forks (nForkBlocks[]) and the total number of buffers
> to be invalidated for all forks (nBlocksToInvalidate).
> 
> 
> > > > The right side of >= should be cur_block.
> > > 
> > > Fixed.
> > 
> > >= should be =, shouldn't it?
> 
> It's just from a paranoia. What we are going to invalidate is blocks
> blockNum of which >= curBlock. Although actually there's no chance of

Sorry. What we are going to invalidate is blocks that are blocNum >=
firstDelBlock[i]. So what I wanted to suggest was the condition should
be

+   if (RelFileNodeEquals(bufHdr->tag.rnode, 
rnode.node) &&
+   bufHdr->tag.forkNum == forkNum[j] &&
+   bufHdr->tag.blockNum >= 
firstDelBlock[j])

> any other processes having replaced the buffer with another page (with
> lower blockid) of the same relation after BugTableLookup(), that
> condition makes it sure not to leave blocks to be invalidated left
> alone.

And I forgot to mention the patch names. I think many of us name the
patches using -v option of git-format-patch, and assign the version to
a patch-set thus the version number of all files that are posted at
once is same.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-08 Thread tsunakawa.ta...@fujitsu.com
From: Masahiko Sawada 
> What about temporary network failures? I think there are users who
> don't want to give up resolving foreign transactions failed due to a
> temporary network failure. Or even they might want to wait for
> transaction completion until they send a cancel request. If we want to
> call the commit routine only once and therefore want FDW to retry
> connecting the foreign server within the call, it means we require all
> FDW implementors to write a retry loop code that is interruptible and
> ensures not to raise an error, which increases difficulty.
>
> Yes, but if we don’t retry to resolve foreign transactions at all on
> an unreliable network environment, the user might end up requiring
> every transaction to check the status of foreign transactions of the
> previous distributed transaction before starts. If we allow to do
> retry, I guess we ease that somewhat.

OK.  As I said, I'm not against trying to cope with temporary network failure.  
I just don't think it's mandatory.  If the network failure is really temporary 
and thus recovers soon, then the resolver will be able to commit the 
transaction soon, too.

Then, we can have a commit retry timeout or retry count like the following 
WebLogic manual says.  (I couldn't quickly find the English manual, so below is 
in Japanese.  I quoted some text that got through machine translation, which 
appears a bit strange.)

https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
--
Abandon timeout
Specifies the maximum time (in seconds) that the transaction manager attempts 
to complete the second phase of a two-phase commit transaction.

In the second phase of a two-phase commit transaction, the transaction manager 
attempts to complete the transaction until all resource managers indicate that 
the transaction is complete. After the abort transaction timer expires, no 
attempt is made to resolve the transaction. If the transaction enters a ready 
state before it is destroyed, the transaction manager rolls back the 
transaction and releases the held lock on behalf of the destroyed transaction.
--



> Also, what if the user sets the statement timeout to 60 sec and they
> want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> that client libraries of other DBMSs don't have asynchronous execution
> functionality. If the SQL execution function is not interruptible, the
> user will end up waiting for 60 sec, which seems not good.

FDW functions can be uninterruptible in general, aren't they?  We experienced 
that odbc_fdw didn't allow cancellation of SQL execution.


 Regards
Takayuki Tsunakawa



Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-08 Thread Greg Nancarrow
On Fri, Oct 9, 2020 at 8:41 AM Thomas Munro  wrote:
> One thing I noticed is that you have logic, variable names and
> assertions all over the tree that assume that we can only do parallel
> *inserts*.  I agree 100% with your plan to make Parallel Insert work
> first, it is an excellent goal and if we get it in it'll be a headline
> feature of PG14 (along with COPY etc).  That said, I wonder if it
> would make sense to use more general naming (isParallelModifyLeader?),
> be more liberal where you really mean "is it DML", and find a way to
> centralise the logic about which DML commands types are currently
> allowed (ie insert only for now) for assertions and error checks etc,
> so that in future we don't have to go around and change all these
> places and rename things again and again.
>

Fair points.
I agree, it would make more sense to generalise the naming and
centralise the DML-command-type checks, rather than everything being
insert-specific.
It was getting a bit ugly. I'll work on that.

> While contemplating that, I couldn't resist taking a swing at the main
> (?) show stopper for Parallel Update and Parallel Delete, judging by
> various clues left in code comments by Robert: combo command IDs
> created by other processes.  Here's a rapid prototype to make that
> work (though perhaps not as efficiently as we'd want, not sure).  With
> that in place, I wonder what else we'd need to extend your patch to
> cover all three operations... it can't be much!  Of course I don't
> want to derail your work on Parallel Insert, I'm just providing some
> motivation for my comments on the (IMHO) shortsightedness of some of
> the coding.
>

Thanks for your prototype code for coordination of combo command IDs
with the workers.
It does give me the incentive to look beyond that issue and see
whether parallel Update and parallel Delete are indeed possible. I'll
be sure to give it a go!

> PS Why not use git format-patch to create patches?

Guess I was being a bit lazy - will use git format-patch in future.


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Expansion of our checks for connection-loss errors

2020-10-08 Thread Kyotaro Horiguchi
At Thu, 08 Oct 2020 21:41:55 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane  wrote in 
> >> Accordingly, I propose the attached patch (an expansion of
> >> Fujii-san's) that causes us to test for all five errnos anyplace
> >> we had been checking for ECONNRESET.
> 
> > +1 for the direction.
> 
> > In terms of connection errors, connect(2) and bind(2) can return
> > EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I
> > recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from
> > respective WSA errors in TranslateSocketError())
> 
> I do not think we have any issues with connection-time errors;
> or at least, if we do, the spots being touched here certainly
> shouldn't need to worry about them.  These places are dealing
> with already-established connections.

errcode_for_socket_access() is called for connect, bind and lesten but
I understand we don't consider the case since we don't have an actual
issue related to the functions.

> > I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to
> > avoid duplication definition of the errno list.
> 
> Hmm, might be worth doing, but I'm not sure.  I am worried about
> whether compilers will generate equally good code that way.

The two are placed side-by-side so either will do for me.

> > -   if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
> > +   if (ret < 0 && errno_is_connection_loss(WSAGetLastError()))
> 
> > Don't we need to use TranslateSocketError() before?
> 
> Oh, I missed that.  But:
> 
> > Perhaps I'm confused but SOCK_ERROR doesn't seem portable between
> > Windows and Linux.
> 
> In that case, nothing would have worked on Windows for the last
> ten years, so you're mistaken.  I think the actual explanation
> why this works, and why that test in parallel.c probably still
> works even with my mistake, is that win32_port.h makes sure that
> our values of ECONNRESET etc match WSAECONNRESET etc.

Mm. Sure.

> IOW, we'd not actually need TranslateSocketError at all, except
> that it maps some not-similarly-named error codes for conditions
> that don't exist in Unix into ones that do.  We probably do want
> TranslateSocketError in this parallel.c test so that anything that
> it maps to one of the errno_is_connection_loss codes will be
> recognized; but the basic cases would work anyway, unless I
> misunderstand this stuff entirely.

Yeah, that seems to work.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: POC: postgres_fdw insert batching

2020-10-08 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> I'm not sure when I'll have time to work on this again, so if you are
> interested and willing to work on it, please go ahead. I'll gladly do
> reviews and help you with it.

Thank you very much.


> I think transferring data to other databases is fine - interoperability
> is a big advantage for users, I don't see it as something threatening
> the PostgreSQL project. I doubt this would make it more likely for users
> to migrate from PostgreSQL - there are many ways to do that already.

Definitely true.  Users may want to use INSERT SELECT to do some data 
transformation in their OLTP database and load it into a non-Postgres data 
warehouse.


> Yeah. I think handling complete failure should not be very difficult,
> but there are cases that worry me more. For example, what if there's a
> before trigger (on the remote db) that "skips" inserting some of the
> rows by returning NULL?

> Yeah. I wonder if the FDW needs to indicate which features are supported
> by the ExecForeignMultiInsert, e.g. by adding a function that decides
> whether batch insert is supported (it might also do that internally by
> calling ExecForeignInsert, of course).

Thanks for your advice.  I'll try to address them.


 Regards
Takayuki Tsunakawa





Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
>
> On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
>
> > > This script will wait 10 seconds after INSERT exits
> > > before executing TRUNCATE, please wait for it to run.
>
> Has this been tested with anything other than the one test case?
>
> It would be good to know how the patch handles a transaction that
> contains many aborted subtransactions that contain invals.
>

Are you thinking from the angle of performance or functionality? I
don't see how this patch can impact either of those. Basically, it
will not execute any extra invalidations then it is executing without
the patch for aborted subtransactions. Can you please explain in a bit
more detail about your fear?

Having said that, I think it would be a good idea to test the scenario
you mentioned to ensure that we have not broken anything unknowingly.


-- 
With Regards,
Amit Kapila.




Re: Fix typos in reorderbuffer.c

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 2:40 PM Masahiko Sawada
 wrote:
>
> On Thu, 8 Oct 2020 at 17:37, Amit Kapila  wrote:
> >
>
> > So, I feel the
> > comments should be accordingly updated.
>
> +1 for this change.
>

Thanks, I have pushed this and along with it pushed a typo-fix in logical.c.

-- 
With Regards,
Amit Kapila.




Re: Minor documentation error regarding streaming replication protocol

2020-10-08 Thread Bruce Momjian
On Fri, Oct  9, 2020 at 08:52:50AM +0900, Michael Paquier wrote:
> On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
> > I have looked at this.  It seems SendTimeLineHistory() is sending raw
> > bytes from the history file, with no encoding conversion, and
> > ReceiveXlogStream() is receiving it, again assuming it is just plain
> > text.  I am not sure we really have an SQL data type where we do this. 
> > BYTEA doesn't do encoding conversion, but does backslash procesing, and
> > TEXT does encoding conversion.
> > 
> > I suppose we either have to document this as BYTEA with no backslash
> > processing, or TEXT with no encoding conversion --- I think I prefer the
> > later.
> 
> As StartupXLOG() tells, The timeline history file can include as
> reason the recovery target name which may not be made just of ASCII
> characters as that's the value specified in pg_create_restore_point by
> the user, so bytea is correct, no?

Good point.  The reporter was assuming the data would come to the client
in the bytea output format specified by the GUC, e.g. \x..., so that
doesn't happen either.  As I said before, it is more raw bytes, but we
don't have an SQL data type for that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





RE: extension patch of CREATE OR REPLACE TRIGGER

2020-10-08 Thread osumi.takami...@fujitsu.com
Hello


> > * A lesser point is that I think you're overcomplicating the code by
> > applying heap_modify_tuple.  You might as well just build the new
> > tuple normally in all cases, and then apply either CatalogTupleInsert or
> CatalogTupleUpdate.
> >
> > * Also, the search for an existing trigger tuple is being done the hard way.
> > You're using an index on (tgrelid, tgname), so you could include the
> > name in the index key and expect that there's at most one match.
> While waiting for a new reply, I'll doing those 2 refactorings.
I'm done with those refactorings. Please have a look at the changes
of the latest patch.

> > * I don't think that you've fully thought through the implications of
> > replacing a trigger for a table that the current transaction has
> > already modified.  Is it really sufficient, or even useful, to do
> > this:
> >
> > +/*
> > + * If this trigger has pending events, throw an error.
> > + */
> > +if (trigger_deferrable &&
> > + AfterTriggerPendingOnRel(RelationGetRelid(rel)))
> >
> > As an example, if we change a BEFORE trigger to an AFTER trigger,
> > that's not going to affect the fact that we *already* fired that
> > trigger.  Maybe this is okay and we just need to document it, but I'm not
> convinced.
> >
> > * BTW, I don't think a trigger necessarily has to be deferrable in
> > order to have pending AFTER events.  The existing use of
> > AfterTriggerPendingOnRel certainly doesn't assume that.  But really, I
> > think we probably ought to be applying CheckTableNotInUse which'd
> > include that test.  (Another way in which there's fuzzy thinking here
> > is that AfterTriggerPendingOnRel isn't specific to *this*
> > trigger.)
> Hmm, actually, when I just put a code of CheckTableNotInUse() in
> CreateTrigger(), it throws error like "cannot CREATE OR REPLACE TRIGGER
> because it is being used by active queries in this session".
> This causes an break of the protection for internal cases above and a
> contradiction of already passed test cases.
> I though adding a condition of in_partition==false to call 
> CheckTableNotInUse().
> But this doesn't work in a corner case that child trigger generated 
> internally is
> pending and we don't want to allow the replacement of this kind of trigger.
> Did you have any good idea to achieve both points at the same time ?
Still, in terms of this point, I'm waiting for a comment !


Regards,
Takamichi Osumi


CREATE_OR_REPLACE_TRIGGER_v13.patch
Description: CREATE_OR_REPLACE_TRIGGER_v13.patch


Re: [HACKERS] logical decoding of two-phase transactions

2020-10-08 Thread Amit Kapila
On Fri, Oct 9, 2020 at 5:45 AM Peter Smith  wrote:
>
> On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila  wrote:
> > > COMMENT
> > > Line 177
> > > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * 
> > > prepare_data)
> > >
> > > prepare_data->prepare_type = flags;
> > > This code may be OK but it does seem a bit of an abuse of the flags.
> > >
> > > e.g. Are they flags or are the really enum values?
> > > e.g. And if they are effectively enums (it appears they are) then
> > > seemed inconsistent that |= was used when they were previously
> > > assigned.
> > >
> > > ;
> >
> > I don't understand this point. As far as I can see at the time of
> > write (logicalrep_write_prepare()), the patch has used |=, and at the
> > time of reading (logicalrep_read_prepare()) it has used assignment
> > which seems correct from the code perspective. Do you have a better
> > proposal?
>
> OK. I will explain my thinking when I wrote that review comment.
>
> I agree all is "correct" from a code perspective.
>
> But IMO using bit arithmetic implies that different combinations are
> also possible, whereas in current code they are not.
> So code is kind of having a bet each-way - sometimes treating "flags"
> as bit flags and sometimes as enums.
>
> e.g. If these flags are not really bit flags at all then the
> logicalrep_write_prepare() code might just as well be written as
> below:
>
> BEFORE
> if (rbtxn_commit_prepared(txn))
> flags |= LOGICALREP_IS_COMMIT_PREPARED;
> else if (rbtxn_rollback_prepared(txn))
> flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
> else
> flags |= LOGICALREP_IS_PREPARE;
>
> /* Make sure exactly one of the expected flags is set. */
> if (!PrepareFlagsAreValid(flags))
> elog(ERROR, "unrecognized flags %u in prepare message", flags);
>
>
> AFTER
> if (rbtxn_commit_prepared(txn))
> flags = LOGICALREP_IS_COMMIT_PREPARED;
> else if (rbtxn_rollback_prepared(txn))
> flags = LOGICALREP_IS_ROLLBACK_PREPARED;
> else
> flags = LOGICALREP_IS_PREPARE;
>
> ~
>
> OTOH, if you really do want to anticipate having future flag bit
> combinations
>

I don't anticipate more combinations rather I am not yet sure whether
we want to distinguish these operations with flags or have separate
messages for each of these operations. I think for now we can go with
your proposal above.

-- 
With Regards,
Amit Kapila.




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 15:06, Tom Lane  wrote:
> I notice there are some other ad-hoc isnan() checks scattered
> about costsize.c, too.  Maybe we should indeed consider fixing
> clamp_row_estimate to get rid of inf (and nan too, I suppose)
> so that we'd not need those.  I don't recall the exact cases
> that made us introduce those checks, but they were for cases
> a lot more easily reachable than this one, I believe.

Is there actually a case where nrows could be NaN?  If not, then it
seems like a wasted check.  Wouldn't it take one of the input
relations or the input rels to have an Inf row estimate (which won't
happen after changing clamp_row_estimate()), or the selectivity
estimate being NaN.

David




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Dilip Kumar
On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> >
> > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> >
> > > > This script will wait 10 seconds after INSERT exits
> > > > before executing TRUNCATE, please wait for it to run.
> >
> > Has this been tested with anything other than the one test case?
> >
> > It would be good to know how the patch handles a transaction that
> > contains many aborted subtransactions that contain invals.
> >
>
> Are you thinking from the angle of performance or functionality? I
> don't see how this patch can impact either of those. Basically, it
> will not execute any extra invalidations then it is executing without
> the patch for aborted subtransactions. Can you please explain in a bit
> more detail about your fear?
>
> Having said that, I think it would be a good idea to test the scenario
> you mentioned to ensure that we have not broken anything unknowingly.

Yeah, even I feel that nothing should impact in this area because on
abort we are anyway executing all the invalidations and we will
continue to do so with the patch as well.  I will test this scenario
to ensure nothing is broken.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Amit Kapila
On Fri, Oct 9, 2020 at 10:06 AM Dilip Kumar  wrote:
>
> On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila  wrote:
> >
> > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> > >
> > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> > >
> > > > > This script will wait 10 seconds after INSERT exits
> > > > > before executing TRUNCATE, please wait for it to run.
> > >
> > > Has this been tested with anything other than the one test case?
> > >
> > > It would be good to know how the patch handles a transaction that
> > > contains many aborted subtransactions that contain invals.
> > >
> >
> > Are you thinking from the angle of performance or functionality? I
> > don't see how this patch can impact either of those. Basically, it
> > will not execute any extra invalidations then it is executing without
> > the patch for aborted subtransactions. Can you please explain in a bit
> > more detail about your fear?
> >
> > Having said that, I think it would be a good idea to test the scenario
> > you mentioned to ensure that we have not broken anything unknowingly.
>
> Yeah, even I feel that nothing should impact in this area because on
> abort we are anyway executing all the invalidations and we will
> continue to do so with the patch as well.
>

True, but I think we execute the invalidations only for the streaming
case, otherwise, neither we need to execute invalidations nor we are
doing it for abort case.

>  I will test this scenario
> to ensure nothing is broken.
>

If I have not missed anything then probably you need to prepare a
scenario where we need to do streaming. I am fine with the testing of
the non-streaming case as well if you want to ensure that we have not
broken that case by starting to execute invalidations.


-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> >
> >
> > I am convinced by the reason given by Kyotaro-San in that another
> > thread [1] and performance data shown by Peter that this can't be an
> > independent improvement and rather in some cases it can do harm. Now,
> > if you need it for a parallel-copy path then we can change it
> > specifically to the parallel-copy code path but I don't understand
> > your reason completely.
> >
>
> Whenever we need data to be populated, we will get a new data block &
> pass it to CopyGetData to populate the data. In case of file copy, the
> server will completely fill the data block. We expect the data to be
> filled completely. If data is available it will completely load the
> complete data block in case of file copy. There is no scenario where
> even if data is present a partial data block will be returned except
> for EOF or no data available. But in case of STDIN data copy, even
> though there is 8K data available in data block & 8K data available in
> STDIN, CopyGetData will return as soon as libpq buffer data is more
> than the minread. We will pass new data block every time to load data.
> Every time we pass an 8K data block but CopyGetData loads a few bytes
> in the new data block & returns. I wanted to keep the same data
> population logic for both file copy & STDIN copy i.e copy full 8K data
> blocks & then the populated data can be required. There is an
> alternative solution I can have some special handling in case of STDIN
> wherein the existing data block can be passed with the index from
> where the data should be copied. Thoughts?
>

What you are proposing as an alternative solution, isn't that what we
are doing without the patch? IIUC, you require this because of your
corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
right? If so, what is the difficulty in making it behave similar to
the non-parallel case?

-- 
With Regards,
Amit Kapila.




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-08 Thread Dilip Kumar
On Fri, Oct 9, 2020 at 10:29 AM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 10:06 AM Dilip Kumar  wrote:
> >
> > On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila  wrote:
> > >
> > > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> > > >
> > > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> > > >
> > > > > > This script will wait 10 seconds after INSERT exits
> > > > > > before executing TRUNCATE, please wait for it to run.
> > > >
> > > > Has this been tested with anything other than the one test case?
> > > >
> > > > It would be good to know how the patch handles a transaction that
> > > > contains many aborted subtransactions that contain invals.
> > > >
> > >
> > > Are you thinking from the angle of performance or functionality? I
> > > don't see how this patch can impact either of those. Basically, it
> > > will not execute any extra invalidations then it is executing without
> > > the patch for aborted subtransactions. Can you please explain in a bit
> > > more detail about your fear?
> > >
> > > Having said that, I think it would be a good idea to test the scenario
> > > you mentioned to ensure that we have not broken anything unknowingly.
> >
> > Yeah, even I feel that nothing should impact in this area because on
> > abort we are anyway executing all the invalidations and we will
> > continue to do so with the patch as well.
> >
>
> True, but I think we execute the invalidations only for the streaming
> case, otherwise, neither we need to execute invalidations nor we are
> doing it for abort case.

Right

> >  I will test this scenario
> > to ensure nothing is broken.
> >
>
> If I have not missed anything then probably you need to prepare a
> scenario where we need to do streaming.

Yes the test should be like
BEGIN
Savepoint 1
- DDL
- large operation
rollback to s1;
-DDL
large operation
rollback to s1;
...

So ideally every time it tries to stream the subtransaction there
should be concurrent abort detected and it will execute all the
invalidations.

> I am fine with the testing of
> the non-streaming case as well if you want to ensure that we have not
> broken that case by starting to execute invalidations.

Okay

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 12:14 AM vignesh C  wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila  wrote:
> > > > + */
> > > > +typedef struct ParallelCopyLineBoundary
> > > >
> > > > Are we doing all this state management to avoid using locks while
> > > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > > to keep the main patch simple and then provide a later patch to make
> > > > it lock-less.  This will allow us to first focus on the main design of
> > > > the patch rather than trying to make this datastructure processing
> > > > lock-less in the best possible way.
> > > >
> > >
> > > The steps will be more or less same if we use spinlock too. step 1, step 
> > > 3 & step 4 will be common we have to use lock & unlock instead of step 2 
> > > & step 5. I feel we can retain the current implementation.
> > >
> >
> > I'll study this in detail and let you know my opinion on the same but
> > in the meantime, I don't follow one part of this comment: "If they
> > don't follow this order the worker might process wrong line_size and
> > leader might populate the information which worker has not yet
> > processed or in the process of processing."
> >
> > Do you want to say that leader might overwrite some information which
> > worker hasn't read yet? If so, it is not clear from the comment.
> > Another minor point about this comment:
> >
>
> Here leader and worker must follow these steps to avoid any corruption
> or hang issue. Changed it to:
>  * The leader & worker process access the shared line information by following
>  * the below steps to avoid any data corruption or hang:
>

Actually, I wanted more on the lines why such corruption or hang can
happen? It might help reviewers to understand why you have followed
such a sequence.

> >
> > How did you ensure that this is fixed? Have you tested it, if so
> > please share the test? I see a basic problem with your fix.
> >
> > + /* Report WAL/buffer usage during parallel execution */
> > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> > +   &walusage[ParallelWorkerNumber]);
> >
> > You need to call InstrStartParallelQuery() before the actual operation
> > starts, without that stats won't be accurate? Also, after calling
> > WaitForParallelWorkersToFinish(), you need to accumulate the stats
> > collected from workers which neither you have done nor is possible
> > with the current code in your patch because you haven't made any
> > provision to capture them in BeginParallelCopy.
> >
> > I suggest you look into lazy_parallel_vacuum_indexes() and
> > begin_parallel_vacuum() to understand how the buffer/wal usage stats
> > are accumulated. Also, please test this functionality using
> > pg_stat_statements.
> >
>
> Made changes accordingly.
> I have verified it using:
> postgres=# select * from pg_stat_statements where query like '%copy%';
>  userid | dbid  |   queryid|
>  query
>| plans | total_plan_time |
> min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> calls | total_exec_time | min_exec_time | max_exec_time |
> mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> local_blks_hit | local_blks_read | local_blks_dirtied |
> local_blks_written | temp_blks_read | temp_blks_written | blk_
> read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> +---+--+-+---+-+-
> --+---++--+---+-+---+---++--++---
> --+--+-+-++-++++---+-
> --++-+-+---
>  10 | 13743 | -6947756673093447609 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',')   | 0 |   0 |
> 0 | 0 |  0 |0 |
>  1 |  265.195105 |265.195105 |265.195105 | 265.195105
> |0 | 175000 |191
> 6 |0 | 946 | 946 |
>  0 |   0 |  0 |  0
> |  0 | 0 |
> 0 |  0 |1116 |   0 |   3587203
>  10 | 13743 |  8570215596364326047 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',', paral

Re: Wrong example in the bloom documentation

2020-10-08 Thread Daniel Westermann (DWE)
Hi Bruce, Tom,

On Thu, Oct  8, 2020 at 03:43:32PM -0400, Tom Lane wrote:
>> "Daniel Westermann (DWE)"  writes:
>> >> I was hoping someone more experienced with this would comment, but
>> >> seeing none, I will apply it in a day or two to all supported versions?
>> >> Have you tested this output back to 9.5?
>> 
>> > I hoped that as well. No, I tested down to 9.6 because the change happened 
>> > in 10.
>> 
>> The patch assumes that parallel query is enabled, which is not true by
>> default before v10, so it should certainly not be applied before v10
>> (at least not without significant revisions).

Yes, the behavior change was in 10. Before 10 the example is fine, I would not 
apply that to any prior version, otherwise the whole example needs to be 
rewritten.

Regards
Daniel





Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-08 Thread Kyotaro Horiguchi
At Fri, 9 Oct 2020 02:33:37 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Masahiko Sawada 
> > What about temporary network failures? I think there are users who
> > don't want to give up resolving foreign transactions failed due to a
> > temporary network failure. Or even they might want to wait for
> > transaction completion until they send a cancel request. If we want to
> > call the commit routine only once and therefore want FDW to retry
> > connecting the foreign server within the call, it means we require all
> > FDW implementors to write a retry loop code that is interruptible and
> > ensures not to raise an error, which increases difficulty.
> >
> > Yes, but if we don’t retry to resolve foreign transactions at all on
> > an unreliable network environment, the user might end up requiring
> > every transaction to check the status of foreign transactions of the
> > previous distributed transaction before starts. If we allow to do
> > retry, I guess we ease that somewhat.
> 
> OK.  As I said, I'm not against trying to cope with temporary network 
> failure.  I just don't think it's mandatory.  If the network failure is 
> really temporary and thus recovers soon, then the resolver will be able to 
> commit the transaction soon, too.

I should missing something, though...

I don't understand why we hate ERRORs from fdw-2pc-commit routine so
much. I think remote-commits should be performed before local commit
passes the point-of-no-return and the v26-0002 actually places
AtEOXact_FdwXact() before the critical section.

(FWIW, I think remote commits should be performed by backends, not by
another process, because backends should wait for all remote-commits
to end anyway and it is simpler. If we want to multiple remote-commits
in parallel, we could do that by adding some async-waiting interface.)

> Then, we can have a commit retry timeout or retry count like the following 
> WebLogic manual says.  (I couldn't quickly find the English manual, so below 
> is in Japanese.  I quoted some text that got through machine translation, 
> which appears a bit strange.)
> 
> https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> --
> Abandon timeout
> Specifies the maximum time (in seconds) that the transaction manager attempts 
> to complete the second phase of a two-phase commit transaction.
> 
> In the second phase of a two-phase commit transaction, the transaction 
> manager attempts to complete the transaction until all resource managers 
> indicate that the transaction is complete. After the abort transaction timer 
> expires, no attempt is made to resolve the transaction. If the transaction 
> enters a ready state before it is destroyed, the transaction manager rolls 
> back the transaction and releases the held lock on behalf of the destroyed 
> transaction.
> --

That's not a retry timeout but a timeout for total time of all
2nd-phase-commits.  But I think it would be sufficient.  Even if an
fdw could retry 2pc-commit, it's a matter of that fdw and the core has
nothing to do with.

> > Also, what if the user sets the statement timeout to 60 sec and they
> > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> > that client libraries of other DBMSs don't have asynchronous execution
> > functionality. If the SQL execution function is not interruptible, the
> > user will end up waiting for 60 sec, which seems not good.

I think fdw-2pc-commit can be interruptible safely as far as we run
the remote commits before entring critical section of local commit.

> FDW functions can be uninterruptible in general, aren't they?  We experienced 
> that odbc_fdw didn't allow cancellation of SQL execution.

At least postgres_fdw is interruptible while waiting the remote.

create view lt as select 1 as slp from (select pg_sleep(10)) t;
create foreign table ft(slp int) server sv1 options (table_name 'lt');
select * from ft;
^CCancel request sent
ERROR:  canceling statement due to user request

regrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


RE: [PoC] Non-volatile WAL buffer

2020-10-08 Thread Deng, Gang
Hi Takashi,

There are some differences between our HW/SW configuration and test steps. I 
attached postgresql.conf I used for your reference. I would like to try 
postgresql.conf and steps you provided in the later days to see if I can find 
cause.

I also ran pgbench and postgres server on the same server but on different NUMA 
node, and ensure server process and PMEM on the same NUMA node. I used similar 
steps are yours from step 1 to 9. But some difference in later steps, major of 
them are:

In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text 
default 
'75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"

in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to 
warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f 
./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment)

For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled)
OS Distro: CentOS 8.2.2004 
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1

Best regards
Gang

-Original Message-
From: Takashi Menjo  
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang 
Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' 
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when 
inserting 328-byte records. So I think the condition of you and me would be 
different, such as steps to reproduce, postgresql.conf, installation setup, and 
so on.

My results and condition are as follows. May I have your condition in more 
detail? Note that I refer to your "Storage over App Direct" as my "Original 
(PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi


# Results
See the attached figure. In short, Non-volatile WAL buffer got better 
performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but 
separated two NUMA nodes. PMEM and PCI SSD for the server process are on the 
server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M 
dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q 
-F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile 
WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
-l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 
(ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r 
-M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = 
__ (including connections establishing)" of the three as throughput and the 
"latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by 
BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels 
per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 
sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
-

Re: Parallel copy

2020-10-08 Thread Amit Kapila
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr,
+ &rangeTableStr, &attnameListStr,
+ ¬nullListStr, &nullListStr,
+ &convertListStr);

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.




Re: range_agg

2020-10-08 Thread Pavel Stehule
čt 24. 9. 2020 v 2:05 odesílatel Paul A Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On Sun, Aug 16, 2020 at 12:55 PM Paul A Jungwirth
>  wrote:
> > This is rebased on the current master, including some changes to doc
> > tables and pg_upgrade handling of type oids.
>
> Here is a rebased version of this patch, including a bunch of cleanup
> from Alvaro. (Thanks Alvaro!)
>

I tested this patch and It looks well, I have not any objections

1. there are not new warnings
2. make check-world passed
3. build doc without problems
4. doc is enough, regress tests too
5. there was not objection against this feature in discussion, and I think
it is interesting and useful feature - good additional to arrays

Regards

Pavel




> Paul
>