Re: SELECT INTO deprecation

2020-12-03 Thread Thomas Kellerer
Stephen Frost schrieb am 02.12.2020 um 18:58:
> We should either remove it, or remove the comments that it's deprecated,
> not try to make it more deprecated or try to somehow increase the
> recommendation to not use it.

(I am writing from a "user only" perspective, not a developer)

I don't see any warning about the syntax being "deprecated" in the current 
manual.

There is only a note that says that CTAS is "recommended" instead of SELET INTO,
but for me that's something entirely different than "deprecating" it.

I personally have nothing against removing it, but I still see it used
a lot in questions on various online forums, and I would think that
a lot of people would be very unpleasantly surprised if a feature
gets removed without any warning (the current "recommendation" does not
constitute a deprecation or even removal warning for most people I guess)

I would vote for a clear deprecation message as suggested by Peter, but I would
add "and will be removed in a future version" to it.

Not sure if maybe even back-patching that warning would make sense as well, so
that also users of older versions get to see that warning.

Then target 15 or 16 as the release for removal, but not 14

Thomas




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-12-03 Thread Dilip Kumar
On Mon, Nov 30, 2020 at 10:49 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Currently, required logic for multi inserts (such as buffer slots allocation, 
> flushing, tuple size calculation to decide when to flush, cleanup and so on) 
> is being handled outside of the existing tableam APIs. And there are a good 
> number of cases where multi inserts can be used, such as for existing COPY or 
> for CTAS, CREATE/REFRESH MATERIALIZED VIEW [proposed in this thread], and 
> INSERT INTO SELECTs [here] which are currently under discussion. Handling the 
> same multi inserts logic in many places is error prone and duplicates most of 
> the code. To avoid this, proposing here are generic tableam APIs, that can be 
> used in all the cases and which also gives the flexibility to tableam 
> developers in implementing multi inserts logic dependent on the underlying 
> storage engine[1].
>
> I would like to seek thoughts/opinions on the proposed new APIs. Once 
> reviewed, I will start implementing them.

IMHO, if we think that something really specific to the tableam then
it makes sense to move it there.  But just to avoid duplicating the
code it might not be the best idea.  Instead, you can write some
common functions and we can call them from different places.  So if
something is very much common and will not vary based on the storage
type we can keep it outside the tableam interface however we can move
them into some common functions to avoid duplication.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Kyotaro Horiguchi
At Thu, 3 Dec 2020 16:41:45 +0900, Amit Langote  wrote 
in 
> On Thu, Dec 3, 2020 at 2:29 PM Kyotaro Horiguchi
>  wrote:
> > At Thu, 3 Dec 2020 12:27:53 +0900, Amit Langote  
> > wrote in
> > > On Thu, Dec 3, 2020 at 10:15 AM Kyotaro Horiguchi
> > >  wrote:
> > > For the queries on the referencing side ("check" side),
> > > type/collation/attribute name determined using the above are going to
> > > be the same for all partitions in a given tree irrespective of the
> > > attribute number, because they're logically the same column.  On the
> >
> > Yes, I know that, which is what I meant by "practically" or
> > "actually", but it is not explicitly defined AFAICS.
> 
> Well, I think it's great that we don't have to worry *in this part of
> the code* about partition's fk_attnums not being congruent with the
> root parent's, because ensuring that is the responsibility of the
> other parts of the system such as DDL.  If we have any problems in
> this area, they should be dealt with by ensuring that there are no
> bugs in those other parts.

Agreed.

> > Thus that would be no longer an issue if we explicitly define that
> > "When conpraentid stores a valid value, each element of fk_attnums
> > points to logically the same attribute with the RI_ConstraintInfo for
> > the parent constraint."  Or I'd be happy if we have such a comment
> > there instead.
> 
> I saw a comment in Kuroda-san's v2 patch that is perhaps meant to
> address this point, but the placement needs to be reconsidered:

Ah, yes, that comes from my proposal.

> @@ -366,6 +368,14 @@ RI_FKey_check(TriggerData *trigdata)
> querysep = "WHERE";
> for (int i = 0; i < riinfo->nkeys; i++)
> {
> +
> +   /*
> +   * We share the same plan among all relations in a partition
> +   * hierarchy.  The plan is guaranteed to be compatible since all of
> +   * the member relations are guaranteed to have the equivalent set
> +   * of foreign keys in fk_attnums[].
> +   */
> +
> Oid pk_type = RIAttType(pk_rel, riinfo->pk_attnums[i]);
> Oid fk_type = RIAttType(fk_rel, riinfo->fk_attnums[i]);
> 
> A more appropriate place for this kind of comment would be where
> fk_attnums is defined or in ri_BuildQueryKey() that is shared by
> different RI query issuing functions.

Yeah, I wanted more appropriate place for the comment.  That place
seems reasonable.

> > > referenced side ("restrict", "cascade", "set" side), as you already
> > > mentioned, fk_attnums refers to the top parent table of the
> > > referencing side, so no possibility of they being different in the
> > > various referenced partitions' RI_ConstraintInfos.
> >
> > Right. (I'm not sure I have mention that here, though:p)A
> 
> Maybe I misread but I think you did in your email dated Dec 1 where you said:
> 
> "After an off-list discussion, we confirmed that even in that case the
> patch works as is because fk_attnum (or contuple.conkey) always stores
> key attnums compatible to the topmost parent when conparent has a
> valid value (assuming the current usage of fk_attnum), but I still
> feel uneasy to rely on that unclear behavior."

fk_attnums *doesn't* refers to to top parent talbe of the referencing
side. it refers to attributes of the partition that is compatible with
the same element of fk_attnums of the topmost parent.  Maybe I'm
misreading.


> > > On the topic of how we'd be able to share even the RI_ConstraintInfos
> > > among partitions, that would indeed look a bit more elaborate than the
> > > patch we have right now.
> >
> > Maybe just letting the hash entry for the child riinfo point to the
> > parent riinfo if all members (other than constraint_id, of course)
> > share the exactly the same values.  No need to count references since
> > we don't going to remove riinfos.
> 
> Ah, something maybe worth trying.  Although the memory we'd save by
> sharing the RI_ConstraintInfos would not add that much to the savings
> we're having by sharing the plan, because it's the plans that are a
> memory hog AFAIK.

I agree that plans are rather large but the sharable part of the
RI_ConstraintInfos is 536 bytes, I'm not sure it is small enough
comparing to the plans.  But that has somewhat large footprint.. (See
the attached)

> > > > About your patch, it calculates the root constrid at the time an
> > > > riinfo is created, but when the root-partition is further attached to
> > > > another partitioned-table after the riinfo creation,
> > > > constraint_root_id gets stale.  Of course that dones't matter
> > > > practically, though.
> > >
> > > Maybe we could also store the hash value of the root constraint OID as
> > > rootHashValue and check for that one too in
> > > InvalidateConstraintCacheCallBack().  That would take care of this
> > > unless I'm missing something.
> >
> > Seems to be sound.
> 
> Okay, thanks.
> 
> I have attached a patch in which I've tried to merge the ideas from
> both my patch and

Re: pg_stat_statements oddity with track = all

2020-12-03 Thread legrand legrand
Hi Julien,

> The extra field I've proposed would increase the number of records, as it
> needs
to be a part of the key. 

To get an increase in the number of records that means that the same
statement 
would appear at top level AND nested level. This seems a corner case with
very low 
(neglectible) occurence rate. Did I miss something ?

Regards
PAscal 



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Kyotaro Horiguchi
At Thu, 03 Dec 2020 17:13:16 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
me> I agree that plans are rather large but the sharable part of the
me> RI_ConstraintInfos is 536 bytes, I'm not sure it is small enough
me> comparing to the plans.  But that has somewhat large footprint.. (See
me> the attached)

0001 contains a bug about query_key and get_ri_constaint_root (from
your patch) is not needed there, but the core part is 0002 so please
ignore them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_statements oddity with track = all

2020-12-03 Thread Sergei Kornilov
Hello

> To get an increase in the number of records that means that the same
> statement
> would appear at top level AND nested level. This seems a corner case with
> very low
> (neglectible) occurence rate.

+1
I think splitting fields into plans_toplevel / plans_nested will be less 
convenient. And more code with higher chance of copypaste errors

regards, Sergei




Re: pg_stat_statements oddity with track = all

2020-12-03 Thread Julien Rouhaud
On Wed, Dec 02, 2020 at 05:13:56PM +0300, Sergei Kornilov wrote:
> Hello
> 
> > - add a parent_statement_id column that would be NULL for top level queries
> 
> Will generate too much entries... Every FK for each different delete/insert, 
> for example.
> But very useful for databases with a lot of stored procedures to find where 
> this query is called. May be new mode track = tree? Use NULL to indicate a 
> top-level query (same as with track=tree) and some constant for any nested 
> queries when track = all.

Maybe pg_stat_statements isn't the best tool for that use case.  For the record
the profiler in plpgsql_check can now track queryid for each statements inside
a function, so you match pg_stat_statements entries.  That's clearly not
perfect as dynamic queries could generate different queryid, but that's a
start.

> Also, currently a top statement will account buffers usage for underlying 
> statements?

I think so.




Re: pg_stat_statements oddity with track = all

2020-12-03 Thread Julien Rouhaud
On Thu, Dec 03, 2020 at 11:40:22AM +0300, Sergei Kornilov wrote:
> Hello
> 
> > To get an increase in the number of records that means that the same
> > statement
> > would appear at top level AND nested level. This seems a corner case with
> > very low
> > (neglectible) occurence rate.
> 
> +1
> I think splitting fields into plans_toplevel / plans_nested will be less 
> convenient. And more code with higher chance of copypaste errors

As I mentioned in a previous message, I really have no idea if that would be a
corner case or not.  For instance with native partitioning, the odds to have
many different query executed both at top level and as a nested statement may
be quite higher.




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-12-03 Thread Bharath Rupireddy
On Thu, Dec 3, 2020 at 1:38 PM Dilip Kumar  wrote:
>
> On Mon, Nov 30, 2020 at 10:49 AM Bharath Rupireddy
>  wrote:
> >
> > Currently, required logic for multi inserts (such as buffer slots 
> > allocation, flushing, tuple size calculation to decide when to flush, 
> > cleanup and so on) is being handled outside of the existing tableam APIs. 
> > And there are a good number of cases where multi inserts can be used, such 
> > as for existing COPY or for CTAS, CREATE/REFRESH MATERIALIZED VIEW 
> > [proposed in this thread], and INSERT INTO SELECTs [here] which are 
> > currently under discussion. Handling the same multi inserts logic in many 
> > places is error prone and duplicates most of the code. To avoid this, 
> > proposing here are generic tableam APIs, that can be used in all the cases 
> > and which also gives the flexibility to tableam developers in implementing 
> > multi inserts logic dependent on the underlying storage engine[1].
> >
> > I would like to seek thoughts/opinions on the proposed new APIs. Once 
> > reviewed, I will start implementing them.
>
> IMHO, if we think that something really specific to the tableam then
> it makes sense to move it there.  But just to avoid duplicating the
> code it might not be the best idea.  Instead, you can write some
> common functions and we can call them from different places.  So if
> something is very much common and will not vary based on the storage
> type we can keep it outside the tableam interface however we can move
> them into some common functions to avoid duplication.
>

Thanks for the response. Main design goal of the new APIs is to give
flexibility to tableam developers in implementing multi insert logic
dependent on the underlying storage engine. Currently, for all the
underlying storage engines, we follow the same multi insert logic such
as when and how to flush the buffered tuples, tuple size calculation,
and this logic doesn't take into account the underlying storage engine
capabilities. Please have a look at [1] where this point was brought
up by @Luc Vlaming. The subsequent discussion went on to some level of
agreement on the proposed APIs.

I want to clarify that avoiding duplicate multi insert code (for COPY,
CTAS, CREATE/REFRESH MAT VIEW and INSERT SELECTs) is a byproduct(not a
main design goal) if we implement the new APIs for heap AM. I feel
sorry for projecting the goal as avoiding duplicate code earlier.

I also want to mention that @Andres Freund visualized similar kinds of
APIs in [2].

I tried to keep the API as generic as possible, please have a look at
the new structure and APIs [3].

Thoughts?

[1] - 
https://www.postgresql.org/message-id/ca3dd08f-4ce0-01df-ba30-e9981bb0d54e%40swarm64.com
[2] - 
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/CALj2ACV8_O651C2zUqrVSRFDJkp8%3DTMwSdG9%2BmDGL%2BvF6CD%2BAQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Improper use about DatumGetInt32

2020-12-03 Thread Peter Eisentraut

On 2020-11-30 16:32, Alvaro Herrera wrote:

On 2020-Nov-30, Peter Eisentraut wrote:


Patch updated this way.  I agree it's better that way.


Thanks, LGTM.


For a change like this, do we need to change the C symbol names, so that 
there is no misbehavior if the shared library is not updated at the same 
time as the extension is upgraded in SQL?





Single transaction in the tablesync worker?

2020-12-03 Thread Amit Kapila
The tablesync worker in logical replication performs the table data
sync in a single transaction which means it will copy the initial data
and then catch up with apply worker in the same transaction. There is
a comment in LogicalRepSyncTableStart ("We want to do the table data
sync in a single transaction.") saying so but I can't find the
concrete theory behind the same. Is there any fundamental problem if
we commit the transaction after initial copy and slot creation in
LogicalRepSyncTableStart and then allow the apply of transactions as
it happens in apply worker? I have tried doing so in the attached (a
quick prototype to test) and didn't find any problems with regression
tests. I have tried a few manual tests as well to see if it works and
didn't find any problem. Now, it is quite possible that it is
mandatory to do the way we are doing currently, or maybe something
else is required to remove this requirement but I think we can do
better with respect to comments in this area.

The reason why I am looking into this area is to support the logical
decoding of prepared transactions. See the problem [1] reported by
Peter Smith. Basically, when we stream prepared transactions in the
tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions. Now, we can fix that
problem by disabling the decoding of prepared xacts in tablesync
worker. But that will arise to a different kind of problems like the
prepare will not be sent by the publisher but a later commit might
move lsn to a later step which will allow it to catch up till the
apply worker. So, now the prepared transaction will be skipped by both
tablesync and apply worker.

I think apart from unblocking the development of 'logical decoding of
prepared xacts', it will make the code consistent between apply and
tablesync worker and reduce the chances of future bugs in this area.
Basically, it will reduce the checks related to am_tablesync_worker()
at various places in the code.

I see that this code is added as part of commit
7c4f52409a8c7d85ed169bbbc1f6092274d03920 (Logical replication support
for initial data copy).

Thoughts?

[1] - 
https://www.postgresql.org/message-id/cahut+puemk4so8ogzxc_ftzpkga8uc-y5qi-krqhsy_p0i3...@mail.gmail.com

-- 
With Regards,
Amit Kapila.


v1-0001-Allow-more-than-one-transaction-in-tablesync-work.patch
Description: Binary data


REINDEX backend filtering

2020-12-03 Thread Julien Rouhaud
Hello,

Now that we have the infrastructure to track indexes that might be corrupted
due to changes in collation libraries, I think it would be a good idea to offer
an easy way for users to reindex all indexes that might be corrupted.

I'm attaching a POC patch as a discussion basis.  It implements a new
"COLLATION" option to reindex, with "not_current" being the only accepted
value.  Note that I didn't spent too much efforts on the grammar part yet.

So for instance you can do:

REINDEX (COLLATION 'not_current') DATABASE mydb;

The filter is also implemented so that you could cumulate multiple filters, so
it could be easy to add more filtering, for instance:

REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;

to only rebuild indexes depending on outdated libc collations, or

REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;

to only rebuild indexes depending on a specific version of libc.
>From 5acf42e15c0dc8b185547ff9cb9371a86a057ec9 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Thu, 3 Dec 2020 15:54:42 +0800
Subject: [PATCH v1] Add a new COLLATION option to REINDEX.

---
 doc/src/sgml/ref/reindex.sgml  | 13 +
 src/backend/catalog/index.c| 59 +-
 src/backend/commands/indexcmds.c   | 12 +++--
 src/backend/utils/cache/relcache.c | 43 
 src/include/catalog/index.h|  6 ++-
 src/include/utils/relcache.h   |  1 +
 src/test/regress/expected/create_index.out | 10 
 src/test/regress/sql/create_index.sql  | 10 
 8 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6e1cf06713..eb8da9c070 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -25,6 +25,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 where option can be one 
of:
 
+COLLATION [ text ]
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
 
@@ -168,6 +169,18 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+COLLATION
+
+ 
+  This option can be used to filter the list of indexes to rebuild.  The
+  only allowed value is 'not_current', which will only
+  process indexes that depend on a collation version different than the
+  current one.
+ 
+
+   
+

 CONCURRENTLY
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..7d941f40af 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -99,6 +99,12 @@ typedef struct
Oid pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct
+{
+   Oid relid;  /* targetr index oid */
+   bool deprecated;/* depends on at least on deprected collation? 
*/
+} IndexHasDeprecatedColl;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -1349,6 +1355,57 @@ index_check_collation_versions(Oid relid)
list_free(context.warned_colls);
 }
 
+/*
+ * Detect if an index depends on at least one deprecated collation.
+ * This is a callback for visitDependenciesOf().
+ */
+static bool
+do_check_index_has_deprecated_collation(const ObjectAddress *otherObject,
+   
const char *version,
+   
char **new_version,
+   
void *data)
+{
+   IndexHasDeprecatedColl *context = data;
+   char *current_version;
+
+   /* We only care about dependencies on collations. */
+   if (otherObject->classId != CollationRelationId)
+   return false;
+
+   /* Fast exit if we already found a deprecated collation version. */
+   if (context->deprecated)
+   return false;
+
+   /* Ask the provider for the current version.  Give up if unsupported. */
+   current_version = get_collation_version_for_oid(otherObject->objectId);
+   if (!current_version)
+   return false;
+
+   if (!version || strcmp(version, current_version) != 0)
+   context->deprecated = true;
+
+   return false;
+}
+
+bool
+index_has_deprecated_collation(Oid relid)
+{
+   ObjectAddress object;
+   IndexHasDeprecatedColl context;
+
+   object.classId = RelationRelationId;
+   object.objectId = relid;
+   object.objectSubId = 0;
+
+   context.relid = relid;
+   context.deprecated = false;
+
+   visitDependenciesOf(&object, &do_check_index_has_deprecated_collation,
+   &context);
+
+   return context.deprecated;
+}
+
 /*
  * Update the version for collations.  A callback for visitDependenciesOf().
  */
@@ -3886,7 +3943,7 @@ reindex_relation(Oid re

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Krunal Bauskar
Any updates or further inputs on this.

On Wed, 2 Dec 2020 at 09:27, Krunal Bauskar  wrote:

>
>
> On Tue, 1 Dec 2020 at 22:19, Tom Lane  wrote:
>
>> Alexander Korotkov  writes:
>> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar 
>> wrote:
>> >> I would request you guys to re-think it from this perspective to help
>> ensure that PGSQL can scale well on ARM.
>> >> s_lock becomes a top-most function and LSE is not a universal solution
>> but CAS surely helps ease the main bottleneck.
>>
>> > CAS patch isn't proven to be a universal solution as well.  We have
>> > tested the patch on just a few processors, and Tom has seen the
>> > regression [1].  The benchmark used by Tom was artificial, but the
>> > results may be relevant for some real-life workload.
>>
>> Yeah.  I think that the main conclusion from what we've seen here is
>> that on smaller machines like M1, a standard pgbench benchmark just
>> isn't capable of driving PG into serious spinlock contention.  (That
>> reflects very well on the work various people have done over the years
>> to get rid of spinlock contention, because ten or so years ago it was
>> a huge problem on this size of machine.  But evidently, not any more.)
>> Per the results others have posted, nowadays you need dozens of cores
>> and hundreds of client threads to measure any such issue with pgbench.
>>
>> So that is why I experimented with a special test that does nothing
>> except pound on one spinlock.  Sure it's artificial, but if you want
>> to see the effects of different spinlock implementations then it's
>> just too hard to get any results with pgbench's regular scripts.
>>
>> And that's why it disturbs me that the CAS-spinlock patch showed up
>> worse in that environment.  The fact that it's not visible in the
>> regular pgbench test just means that the effect is too small to
>> measure in that test.  But in a test where we *can* measure an effect,
>> it's not looking good.
>>
>> It would be interesting to see some results from the same test I did
>> on other processors.  I suspect the results would look a lot different
>> from mine ... but we won't know unless someone does it.  Or, if someone
>> wants to propose some other test case, let's have a look.
>>
>> > I'm expressing just my personal opinion, other committers can have
>> > different opinions.  I don't particularly think this topic is
>> > necessarily a non-starter.  But I do think that given ambiguity we've
>> > observed in the benchmark, much more research is needed to push this
>> > topic forward.
>>
>> Yeah.  I'm not here to say "do nothing".  But I think we need results
>> from more machines and more test cases to convince ourselves whether
>> there's a consistent, worthwhile win from any specific patch.
>>
>
> I think there is
> *an ambiguity with lse and that has been the*
> *source of some confusion* so let's make another attempt to
> understand all the observations and then define the next steps.
>
> -
>
>
> *1. CAS patch (applied on the baseline)*   - Kunpeng: 10-45% improvement
> observed [1]
>- Graviton2: 30-50% improvement observed [2]
>- M1: Only select results are available cas continue to maintain a
> marginal gain but not significant. [3]
>  [inline with what we observed with Kunpeng and Graviton2 for select
> results too].
>
>
> *2. Let's ignore CAS for a sec and just think of LSE independently*   -
> Kunpeng: regression observed
>- Graviton2: gain observed
>- M1: regression observed
>  [while lse probably is default explicitly enabling it with +lse
> causes regression on the head itself [4].
>   client=2/4: 1816/714  vs   892/610]
>
>There is enough reason not to immediately consider enabling LSE given
> its unable to perform consistently on all hardware.
> -
>
> With those 2 aspects clear let's evaluate what options we have in hand
>
>
> *1. Enable CAS approach*   *- What we gain:* pgsql scale on
> Kunpeng/Graviton2
>  (m1 awaiting read-write result but may marginally scale  [[5]: "but
> the patched numbers are only about a few percent better"])
>*- What we lose:* Nothing for now.
>
>
> *2. LSE:*   *- What we gain: *Scaled workload with Graviton2
>   * - What we lose:* regression on M1 and Kunpeng.
>
> Let's think of both approaches independently.
>
> - Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1)
> - Enabling LSE would help us scale only on some but regress on others.
>   [LSE could be considered in the future once it stabilizes and all
> hardware adapts to it]
>
> ---
>
> *Let me know what do you think about this analysis and any specific
> direction that we should consider to help move forward.*
>
> ---
>
> Links:
> [1]:
> https://www.postgresql.org/message-id/attach

Remove unnecessary grammar symbols

2020-12-03 Thread Peter Eisentraut
While doing the proverbial other things, I noticed that the grammar 
symbols publication_name_list and publication_name_item are pretty 
useless.  We already use name_list/name to refer to publications in most 
places, so getting rid of these makes things more consistent.


These appear to have been introduced by the original logical replication 
patch, so there probably wasn't that much scrutiny on this detail then.
From 311f3c7e40b47ace182f1ad5e39a6a12ae80d23c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 3 Dec 2020 10:40:57 +0100
Subject: [PATCH] Remove unnecessary grammar symbols

Instead of publication_name_list, we can use name_list.  We already
refer to publications everywhere else by the 'name' or 'name_list'
symbols, so this only improves consistency.
---
 src/backend/parser/gram.y | 20 ++--
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 61f0236041..5d343f3e0f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -414,7 +414,6 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
relation_expr_list dostmt_opt_list
transform_element_list transform_type_list
TriggerTransitions TriggerReferencing
-   publication_name_list
vacuum_relation_list opt_vacuum_relation_list
drop_option_list
 
@@ -422,7 +421,6 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
 %typegroup_by_item empty_grouping_set rollup_clause cube_clause
 %typegrouping_sets_clause
 %typeopt_publication_for_tables publication_for_tables
-%type   publication_name_item
 
 %typeopt_fdw_options fdw_options
 %type  fdw_option
@@ -9512,7 +9510,7 @@ AlterPublicationStmt:
  */
 
 CreateSubscriptionStmt:
-   CREATE SUBSCRIPTION name CONNECTION Sconst PUBLICATION 
publication_name_list opt_definition
+   CREATE SUBSCRIPTION name CONNECTION Sconst PUBLICATION 
name_list opt_definition
{
CreateSubscriptionStmt *n =

makeNode(CreateSubscriptionStmt);
@@ -9524,20 +9522,6 @@ CreateSubscriptionStmt:
}
;
 
-publication_name_list:
-   publication_name_item
-   {
-   $$ = list_make1($1);
-   }
-   | publication_name_list ',' publication_name_item
-   {
-   $$ = lappend($1, $3);
-   }
-   ;
-
-publication_name_item:
-   ColLabel{ $$ = makeString($1); 
};
-
 /*
  *
  * ALTER SUBSCRIPTION name ...
@@ -9572,7 +9556,7 @@ AlterSubscriptionStmt:
n->options = $6;
$$ = (Node *)n;
}
-   | ALTER SUBSCRIPTION name SET PUBLICATION 
publication_name_list opt_definition
+   | ALTER SUBSCRIPTION name SET PUBLICATION name_list 
opt_definition
{
AlterSubscriptionStmt *n =
makeNode(AlterSubscriptionStmt);
-- 
2.29.2



Re: Commitfest 2020-11 is closed

2020-12-03 Thread Anastasia Lubennikova

On 02.12.2020 23:59, Tom Lane wrote:

Anastasia Lubennikova  writes:

Commitfest 2020-11 is officially closed now.
Many thanks to everyone who participated by posting patches, reviewing
them, committing and sharing ideas in discussions!

Thanks for all the hard work!


Today, me and Georgios will move the remaining items to the next CF or
return them with feedback. We're planning to leave Ready For Committer
till the end of the week, to make them more visible and let them get the
attention they deserve.

This is actually a bit problematic, because now the cfbot is ignoring
those patches (or if it's not, I don't know where it's displaying the
results).  Please go ahead and move the remaining open patches, or
else re-open the CF if that's possible.

regards, tom lane


Oh, I wasn't aware of that. Thank you for the reminder.

Now all patches are moved to the next CF.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Remove incorrect assertion in reorderbuffer.c.

2020-12-03 Thread Amit Kapila
We start recording changes in ReorderBufferTXN even before we reach
SNAPBUILD_CONSISTENT state so that if the commit is encountered after
reaching that we should be able to send the changes of the entire
transaction. Now, while recording changes if the reorder buffer memory
has exceeded logical_decoding_work_mem then we can start streaming if
it is allowed and we haven't yet streamed that data. However, we must
not allow streaming to start unless the snapshot has reached
SNAPBUILD_CONSISTENT state.

I have also improved the comments atop ReorderBufferResetTXN to
mention the case when we need to continue streaming after getting an
error.

Attached patch for the above changes.

Thoughts?


v1-0001-Remove-incorrect-assertion-in-reorderbuffer.c.patch
Description: Binary data


Re: autovac issue with large number of tables

2020-12-03 Thread Fujii Masao



On 2020/12/03 11:46, Kasahara Tatsuhito wrote:

On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada  wrote:


On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao  wrote:




On 2020/12/02 12:53, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  wrote:


On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao  wrote:




On 2020/12/01 16:23, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
 wrote:


Hi,

On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  wrote:




On 2020/11/30 10:43, Masahiko Sawada wrote:

On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
 wrote:


Hi, Thanks for you comments.

On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao  wrote:




On 2020/11/27 18:38, Kasahara Tatsuhito wrote:

Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
- SET autovacuum = off
- CREATE tables with 100 rows
- DELETE 90 rows for each tables
- SET autovacuum = on and restart PostgreSQL
- Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
- CREATE brank tables
- SELECT all of these tables (for generate stats)
- SET autovacuum_freeze_max_age to low values and restart PostgreSQL
- Consumes a lot of XIDs by using txid_curent()
- Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
   tables:1000
autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

   tables:5000
autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

   tables:1
autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

   tables:2
autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
autov

Re: SELECT INTO deprecation

2020-12-03 Thread Peter Eisentraut

On 2020-12-02 18:58, Stephen Frost wrote:

I also found some gratuitous uses of SELECT INTO in various tests and
documentation (not ecpg or plpgsql of course).  Here is a patch to adjust
those to CREATE TABLE AS.

If we aren't actually removing SELECT INTO then I don't know that it
makes sense to just stop testing it.


The point here was, there is still code that actually tests SELECT INTO 
specifically.  But unrelated test code that just wants to set up a quick 
table with some rows in it ought to use the preferred syntax for doing so.





Re: SELECT INTO deprecation

2020-12-03 Thread Peter Eisentraut

On 2020-12-03 00:54, Michael Paquier wrote:

I got to wonder about the impact when migrating applications
though.  SELECT INTO has a different meaning in Oracle, but SQL server
creates a new table like Postgres.


Interesting.  This appears to be the case.  SQL Server uses SELECT INTO 
to create a table, and does not appear to have CREATE TABLE AS.


So maybe we should keep it, but adjust the documentation to point out 
this use case.


[some snarky comment about AWS Babelfish here ... ;-) ]




Re: Add session statistics to pg_stat_database

2020-12-03 Thread Laurenz Albe
On Tue, 2020-12-01 at 17:32 +0100, Magnus Hagander wrote:
> > I have changed "connections" to "sessions" and renamed the new
> > column "connections" to "session_count".
> > 
> > I think that most people will understand a session as started after a 
> > successful
> > connection.
> 
> Yeah, I agree, and as long as it's consistent we don't need more explanations 
> than that.
> 
> Further int he views, it's a bit strange to have session_count and 
> aborted_session, but I'm not
>  sure what to suggest. "aborted_session_count" seems too long. Maybe just 
> "sessions" instead
>  of "session_count" -- no other counters actually have the "_count" suffix.

"sessions" is fine, I think; I changed the name.

> > > I wonder if there would also be a way to count "sessions that crashed" as 
> > > well.
> > > That is,the ones that failed in a way that caused the postmaster to 
> > > restart the system.
> >
> > Sure, a crash count would be useful.  I don't know if it is easy for the 
> > stats collector
> > to tell the difference between a start after a backend crash and - say - 
> > starting from
> > a base backup.
> > 
> > I think that that would be material for another patch, and I don't think it 
> > should go
> > to "pg_stat_database", because a) it might be hard to tell to which 
> > database the crashed
> > backend was attached, b) it might be a background process that doesn't 
> > belong to a database
> > and c) if the crash were caused by - say - corruption in a shared catalog, 
> > it would be
> > misleading
> 
> I'm not sure it is outside the scope of this patch, because I think it might 
> be easier to
>  do than I (and I think you) first thought. We don't need to track which 
> database crashed --
>  if we track all *other* ways a database exits, then crashes are all that 
> remains.
> 
> So in fact, we *almost* have all the data we need already. We have the number 
> of sessions
>  started. We have the number of sessions "aborted". if we also had the number 
> of sessions
>  that were closed normally, then whatever is "left" would be the number of 
> sessions crashed.
>  And we do already, in your patch, send the message in the case of both 
> aborted and
>  non-aborted sessions. So we just need to keep track of both in the statsfile
>  (which we don't now), and we'd more or less have it, wouldn't we?

There is one problem with that: the statistics collector is not guaranteed to 
get all
messages, right?  If a disconnection statistics UDP datagram doesn't reach the 
statistics
collector, that connection
would end up being reported as crashed.
That would alarm people unnecessarily and make the crash statistics misleading.

> However, some thinking around that also leads me to another question which is 
> very much
>  in scope for this patch regardless, which is what about shutdown and admin 
> termination.
>  Right now, when you do a "pg_ctl stop" on the database, all sessions count 
> as aborted.
>  Same thing for a pg_terminate_backend(). I wonder if this is also a case 
> that would be
>  useful to track as a separate thing? One could argue that the docs in your 
> patch say
>  aborted means "terminated by something else than a regular client 
> disconnection".
> But that's true for a "shutdown", but not for a crash, so whichever way we go 
> with crashes
>  it's slightly incorrect.

> But thinking from a usability perspective, wouldn't what we want more be 
> something
>  like , , 
> ,
>  ?
> 
> What do you think of adapting it to that?
> 
> Basically, that would change pgStatSessionDisconnectedNormally into instead 
> being an
>  enum of reasons, which could be normal disconnect, abnormal disconnect and 
> admin.
>  And we'd track all those three as separate numbers in the stats file, 
> meaning we could
>  then calculate the crash by subtracting all three from the total number of 
> sessions?

I think at least "closed by admin" might be interesting; I'll have a look.
I don't think we have to specifically count "closed by normal disconnect", 
because
that should be the rule and could be more or less deduced from the other numbers
(with the uncertainty mentioned above).

> (Let me know if you think the idea could work and would prefer it if I worked 
> up a
>  complete suggestion based on it rather than just spitting ideas)

Thanks for the offer, and I'll get back to it if I get stuck.
But I'm ready to do the grunt work, so that you can spend your precious
committer cycles elsewhere :^)

I'll have a go at "closed by admin", meanwhile here is patch v7 with the 
renaming
"session_count -> sessions".

Yours,
Laurenz Albe
From 8feed416f91a5de9011616c1545156b9c8f28943 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 20 Nov 2020 15:11:57 +0100
Subject: [PATCH] Add session statistics to pg_stat_database

If "track_counts" is active, track the following per database:
- total number of connections
- number of sessions that ended other than with a client disconnect
- total time spent in database sessions
-

Re: Commitfest 2020-11 is closed

2020-12-03 Thread Peter Eisentraut

On 2020-12-02 23:13, Thomas Munro wrote:

I'm experimenting with Github's built in CI.  All other ideas welcome.


You can run Linux builds on AppVeyor, too.




Re: [PATCH] Covering SPGiST index

2020-12-03 Thread Pavel Borisov
I've noticed CI error due to the fact that MSVC doesn't allow arrays of
flexible size arrays and made a fix for the issue.
Also did some minor refinement in tuple creation.
PFA v12 of a patch.

чт, 26 нояб. 2020 г. в 21:48, Pavel Borisov :

> > The way that seems acceptable to me is to add (optional) nulls mask into
>> > the end of existing style SpGistLeafTuple header and use indextuple
>> > routines to attach attributes after it. In this case, we can reduce the
>> > amount of code at the cost of adding one extra MAXALIGN size to the
>> overall
>> > tuple size on 32-bit arch as now tuple header size of 12 bit already
>> fits 3
>> > MAXALIGNS (on 64 bit the header now is shorter than 2 maxaligns (12
>> bytes
>> > of 16) and nulls mask will be free of cost). If you mean this I try to
>> make
>> > changes soon. What do you think of it?
>>
>> Yeah, that was pretty much the same conclusion I came to.  For
>> INDEX_MAX_KEYS values up to 32, the nulls bitmap will fit into what's
>> now padding space on 64-bit machines.  For backwards compatibility,
>> we'd have to be careful that the code knows there's no nulls bitmap in
>> an index without included columns, so I'm not sure how messy that will
>> be.  But it's worth trying that way to see how it comes out.
>>
>
> I made a refactoring of the patch code according to the discussion:
> 1. Changed a leaf tuple format to: header - (optional) bitmask - key value
> - (optional) INCLUDE values
> 2. Re-use existing code of heap_fill_tuple() to fill data part of a leaf
> tuple
> 3. Splitted index_deform_tuple() into two portions: (a) bigger 'inner' one
> - index_deform_anyheader_tuple() - to make processing of index-like tuples
> (now IndexTuple and SpGistLeafTuple) working independent from type of tuple
> header. (b) a small 'outer' index_deform_tuple() and spgDeformLeafTuple()
> to make all header-specific processing and then call the inner (a)
> 4. Inserted a tuple descriptor into the SpGistCache chunk of memory. So
> cleaning the cached chunk will also invalidate the tuple descriptor and not
> make it dangling or leaked. This also allows not to build it every time
> unless the cache is invalidated.
> 5. Corrected amroutine->amcaninclude according to new upstream fix.
> 6. Returned big chunks that were shifted in spgist_private.h to their
> initial places where possible and made other cosmetic changes to improve
> the patch.
>
> PFA v.11 of the patch.
> Do you think the proposed changes are in the right direction?
>
> Thank you!
> --
> Best regards,
> Pavel Borisov
>
> Postgres Professional: http://postgrespro.com 
>


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v12-0001-Covering-SP-GiST-index-support-for-INCLUDE-colum.patch
Description: Binary data


Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Amit Langote
On Thu, Dec 3, 2020 at 5:13 PM Kyotaro Horiguchi
 wrote:
> At Thu, 3 Dec 2020 16:41:45 +0900, Amit Langote  
> wrote in
> > Maybe I misread but I think you did in your email dated Dec 1 where you 
> > said:
> >
> > "After an off-list discussion, we confirmed that even in that case the
> > patch works as is because fk_attnum (or contuple.conkey) always stores
> > key attnums compatible to the topmost parent when conparent has a
> > valid value (assuming the current usage of fk_attnum), but I still
> > feel uneasy to rely on that unclear behavior."
>
> fk_attnums *doesn't* refers to to top parent talbe of the referencing
> side. it refers to attributes of the partition that is compatible with
> the same element of fk_attnums of the topmost parent.  Maybe I'm
> misreading.

Yeah, no I am confused.  Reading what I wrote, it seems I implied that
the referenced (PK relation's) partitions have RI_ConstraintInfo which
makes no sense, although there indeed is one pg_constraint entry that
is defined on the FK root table for every PK partition with its OID as
confrelid, which is in addition to an entry containing the root PK
table's OID as confrelid.  I confused those PK-partition-referencing
entries as belonging to the partitions themselves.  Although in my
defence, all of those entries' conkey contains the FK root table's
attributes, so at least that much holds. :)

> > > > On the topic of how we'd be able to share even the RI_ConstraintInfos
> > > > among partitions, that would indeed look a bit more elaborate than the
> > > > patch we have right now.
> > >
> > > Maybe just letting the hash entry for the child riinfo point to the
> > > parent riinfo if all members (other than constraint_id, of course)
> > > share the exactly the same values.  No need to count references since
> > > we don't going to remove riinfos.
> >
> > Ah, something maybe worth trying.  Although the memory we'd save by
> > sharing the RI_ConstraintInfos would not add that much to the savings
> > we're having by sharing the plan, because it's the plans that are a
> > memory hog AFAIK.
>
> I agree that plans are rather large but the sharable part of the
> RI_ConstraintInfos is 536 bytes, I'm not sure it is small enough
> comparing to the plans.  But that has somewhat large footprint.. (See
> the attached)

Thanks for the patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Amit Langote
On Tue, Dec 1, 2020 at 12:25 PM Corey Huinker  wrote:
> On Mon, Nov 30, 2020 at 9:48 PM Tom Lane  wrote:
>> Corey Huinker  writes:
>> > Given that we're already looking at these checks, I was wondering if this
>> > might be the time to consider implementing these checks by directly
>> > scanning the constraint index.
>>
>> Yeah, maybe.  Certainly ri_triggers is putting a huge amount of effort
>> into working around the SPI/parser/planner layer, to not a lot of gain.
>>
>> However, it's not clear to me that that line of thought will work well
>> for the statement-level-trigger approach.  In that case you might be
>> dealing with enough tuples to make a different plan advisable.
>
> Bypassing SPI would probably mean that we stay with row level triggers, and 
> the cached query plan would go away, perhaps replaced by an 
> already-looked-up-this-tuple hash sorta like what the cached nested loops 
> effort is doing.
>
> I've been meaning to give this a try when I got some spare time. This may 
> inspire me to try again.

+1 for this line of work.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

2020-12-03 Thread Bharath Rupireddy
On Tue, Dec 1, 2020 at 5:34 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> I think we can pass CURSOR_OPT_PARALLEL_OK to pg_plan_query() for
> refresh mat view so that parallelism can be considered for the SELECT
> part of the previously created mat view. The refresh mat view queries
> can be faster in cases where SELECT is parallelized.
>
> Attaching a small patch. Thoughts?
>

Added this to commitfest, in case it is useful -
https://commitfest.postgresql.org/31/2856/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Additional improvements to extended statistics

2020-12-03 Thread Dean Rasheed
On Wed, 2 Dec 2020 at 16:34, Tomas Vondra  wrote:
>
> On 12/2/20 4:51 PM, Dean Rasheed wrote:
> >
> > Barring any further comments, I'll push this sometime soon.
>
> +1
>

Pushed.

Regards,
Dean




Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Alvaro Herrera
Hello

I haven't followed this thread's latest posts, but I'm unclear on the
lifetime of the new struct that's being allocated in TopMemoryContext.
At what point are those structs freed?

Also, the comment that was in RI_ConstraintInfo now appears in
RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
undocumented.  What is the relationship between those two structs?  I
see that they have pointers to each other, but I think the relationship
should be documented more clearly.

Thanks!





Re: Autovacuum on partitioned table (autoanalyze)

2020-12-03 Thread Alvaro Herrera
Hello Yuzuko,

On 2020-Dec-02, yuzuko wrote:

> The problem Horiguchi-san mentioned is as follows:
> [explanation]

Hmm, I see.  So the problem is that if some ancestor is analyzed first,
then analyze of one of its partition will cause a redundant analyze of
the ancestor, because the number of tuples that is propagated from the
partition represents a set that had already been included in the
ancestor's analysis.

If the problem was just that, then I think it would be very simple to
solve: just make sure to sort the tables to vacuum so that all leaves
are vacuumed first, and then all ancestors, sorted from the bottom up.
Problem solved.

But I'm not sure that that's the whole story, for two reasons: one, two
workers can run simultaneously, where one analyzes the partition and the
other analyzes the ancestor.  Then the order is not guaranteed (and
each process will get no effect from remembering whether it did that one
or not).  Second, manual analyzes can occur in any order.

Maybe it's more useful to think about this in terms of rememebering that
partition P had changed_tuples set to N when we analyzed ancestor A.
Then, when we analyze partition P, we send the message listing A as
ancestor; on receipt of that message, we see M+N changed tuples in P,
but we know that we had already seen N, so we only record M.

I'm not sure how to implement this idea however, since on analyze of
ancestor A we don't have the list of partitions, so we can't know the N
for each partition.





Re: Commitfest 2020-11 is closed

2020-12-03 Thread Andrew Dunstan


On 12/3/20 4:54 AM, Anastasia Lubennikova wrote:
> On 02.12.2020 23:59, Tom Lane wrote:
>> Anastasia Lubennikova  writes:
>>> Commitfest 2020-11 is officially closed now.
>>> Many thanks to everyone who participated by posting patches, reviewing
>>> them, committing and sharing ideas in discussions!
>> Thanks for all the hard work!
>>
>>> Today, me and Georgios will move the remaining items to the next CF or
>>> return them with feedback. We're planning to leave Ready For Committer
>>> till the end of the week, to make them more visible and let them get
>>> the
>>> attention they deserve.
>> This is actually a bit problematic, because now the cfbot is ignoring
>> those patches (or if it's not, I don't know where it's displaying the
>> results).  Please go ahead and move the remaining open patches, or
>> else re-open the CF if that's possible.
>>
>>     regards, tom lane
>
> Oh, I wasn't aware of that. Thank you for the reminder.
>
> Now all patches are moved to the next CF.
>

Maybe this needs to be added to the instructions for CF managers so the
workflow is clear.


cheers


andrew






Re: Single transaction in the tablesync worker?

2020-12-03 Thread Ashutosh Bapat
On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila  wrote:
>
> The tablesync worker in logical replication performs the table data
> sync in a single transaction which means it will copy the initial data
> and then catch up with apply worker in the same transaction. There is
> a comment in LogicalRepSyncTableStart ("We want to do the table data
> sync in a single transaction.") saying so but I can't find the
> concrete theory behind the same. Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker? I have tried doing so in the attached (a
> quick prototype to test) and didn't find any problems with regression
> tests. I have tried a few manual tests as well to see if it works and
> didn't find any problem. Now, it is quite possible that it is
> mandatory to do the way we are doing currently, or maybe something
> else is required to remove this requirement but I think we can do
> better with respect to comments in this area.

If we commit the initial copy, the data upto the initial copy's
snapshot will be visible downstream. If we apply the changes by
committing changes per transaction, the data visible to the other
transactions will differ as the apply progresses. You haven't
clarified whether we will respect the transaction boundaries in the
apply log or not. I assume we will. Whereas if we apply all the
changes in one go, other transactions either see the data before
resync or after it without any intermediate states. That will not
violate consistency, I think.

That's all I can think of as the reason behind doing a whole resync as
a single transaction.

-- 
Best Wishes,
Ashutosh Bapat




Re: pg_ctl.exe file deleted automatically

2020-12-03 Thread Craig Ringer
On Thu, 3 Dec 2020, 13:06 Joel Mariadasan (jomariad), 
wrote:

> Hi,
>
>
>
> We are using Windows 2019 server.
>
>
>
> Sometimes we see the *pg_ctl.exe* getting automatically deleted.
>
> Due to this, while starting up Postgres windows service we are getting the
> error.
>
> “Error 2: The system cannot find the file specified”
>
>
>
> Can you let us know the potential causes for this pg_ctl.exe file deletion?
>
>
>

PostgreSQL will never delete pg_ctl.exe

Check your antivirus software logs for false positives.

This mailing list is for software development on postgres so I suggest
following up on pgsql-general.

> Joel
>


Re: Single transaction in the tablesync worker?

2020-12-03 Thread Amit Kapila
On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila  wrote:
> >
> > The tablesync worker in logical replication performs the table data
> > sync in a single transaction which means it will copy the initial data
> > and then catch up with apply worker in the same transaction. There is
> > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > sync in a single transaction.") saying so but I can't find the
> > concrete theory behind the same. Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker? I have tried doing so in the attached (a
> > quick prototype to test) and didn't find any problems with regression
> > tests. I have tried a few manual tests as well to see if it works and
> > didn't find any problem. Now, it is quite possible that it is
> > mandatory to do the way we are doing currently, or maybe something
> > else is required to remove this requirement but I think we can do
> > better with respect to comments in this area.
>
> If we commit the initial copy, the data upto the initial copy's
> snapshot will be visible downstream. If we apply the changes by
> committing changes per transaction, the data visible to the other
> transactions will differ as the apply progresses.
>

It is not clear what you mean by the above.  The way you have written
appears that you are saying that instead of copying the initial data,
I am saying to copy it transaction-by-transaction. But that is not the
case. I am saying copy the initial data by using REPEATABLE READ
isolation level as we are doing now, commit it and then process
transaction-by-transaction till we reach sync-point (point till where
apply worker has already received the data).

> You haven't
> clarified whether we will respect the transaction boundaries in the
> apply log or not. I assume we will.
>

It will be transaction-by-transaction.

> Whereas if we apply all the
> changes in one go, other transactions either see the data before
> resync or after it without any intermediate states.
>

What is the problem even if the user is able to see the data after the
initial copy?

> That will not
> violate consistency, I think.
>

I am not sure how consistency will be broken.

> That's all I can think of as the reason behind doing a whole resync as
> a single transaction.
>

Thanks for sharing your thoughts.

-- 
With Regards,
Amit Kapila.




Re: Github Actions (CI)

2020-12-03 Thread Josef Šimánek
čt 3. 12. 2020 v 7:34 odesílatel Thomas Munro  napsal:
>
> Hi hackers,
>
> I'm looking for more horsepower for testing commitfest entries
> automatically, and today I tried out $SUBJECT.  The attached is a
> rudimentary first attempt, for show-and-tell.  If you have a Github
> account, you just have to push it to a branch there and look at the
> Actions tab on the web page for the results.  Does anyone else have
> .github files and want to share, to see if we can combine efforts
> here?
>
> The reason for creating three separate "workflows" for Linux, Windows
> and macOS rather than three separate "jobs" inside one workflow is so
> that cfbot.cputube.org could potentially get separate pass/fail
> results for each OS out of the API rather than one combined result.  I
> rather like that feature of cfbot's results.  (I could be wrong about
> needing to do that, this is the first time I've ever looked at this
> stuff.)
>
> The Windows test actually fails right now, exactly as reported by
> Ranier[1].  It is a release build on a recent MSVC, so I guess that is
> expected and off-topic for this thread.  But generally,
> .github/workflows/ci-windows.yml is the weakest part of this.  It'd be
> great to get a debug/assertion build, show backtraces when it crashes,
> run more of the tests, etc etc, but I don't know nearly enough about
> Windows to do that myself.  Another thing is that it uses Choco for
> flex and bison; it'd be better to find those on the image, if
> possible.  Also, for all 3 OSes, it's not currently attempting to
> cache build results or anything like that.

Any chance to also share links to failing/passing testing builds?

> I'm a bit sad that GH doesn't have FreeBSD build runners.  Those are
> now popping up on other CIs, but I'm not sure if their free/open
> source tiers have enough resources for cfbot.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAEudQArhn8bH836OB%2B3SboiaeEcgOtrJS58Bki4%3D5yeVqToxgw%40mail.gmail.com




Re: Corner-case bug in pg_rewind

2020-12-03 Thread Heikki Linnakangas

On 02/12/2020 15:26, Ian Barwick wrote:

On 02/12/2020 20:13, Heikki Linnakangas wrote:

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.


Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.


Ok, pushed and backpatched this now.

Thanks!

- Heikki




Re: Corner-case bug in pg_rewind

2020-12-03 Thread Pavel Borisov
>
> Ok, pushed and backpatched this now.
>
Very nice!
Thanks to you all!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Github Actions (CI)

2020-12-03 Thread Andrew Dunstan


On 12/3/20 1:33 AM, Thomas Munro wrote:
> Hi hackers,
>
> I'm looking for more horsepower for testing commitfest entries
> automatically, and today I tried out $SUBJECT.  The attached is a
> rudimentary first attempt, for show-and-tell.  If you have a Github
> account, you just have to push it to a branch there and look at the
> Actions tab on the web page for the results.  
>

Awesome. That's a pretty big bang for the buck.


cheers


andrew





Re: Remove unnecessary grammar symbols

2020-12-03 Thread Tom Lane
Peter Eisentraut  writes:
> While doing the proverbial other things, I noticed that the grammar 
> symbols publication_name_list and publication_name_item are pretty 
> useless.  We already use name_list/name to refer to publications in most 
> places, so getting rid of these makes things more consistent.

+1.  Strictly speaking, this reduces the set of keywords that you
can use as names here (since name is ColId, versus ColLabel in
publication_name_item).  However, given the inconsistency with
other commands, I don't see it as an advantage to be more forgiving
in just one place.  We might have problems preserving the laxer
definition anyway, if the syntaxes of these commands ever get
any more complicated.

regards, tom lane




Re: Corner-case bug in pg_rewind

2020-12-03 Thread Heikki Linnakangas

On 03/12/2020 16:49, Pavel Borisov wrote:

Ok, pushed and backpatched this now.

Very nice!
Thanks to you all!


Thanks for the review, Pavel! I just realized that I forgot to credit 
you in the commit message. I'm sorry.


- Heikki




Re: SELECT INTO deprecation

2020-12-03 Thread Tom Lane
Peter Eisentraut  writes:
> Interesting.  This appears to be the case.  SQL Server uses SELECT INTO 
> to create a table, and does not appear to have CREATE TABLE AS.
> So maybe we should keep it, but adjust the documentation to point out 
> this use case.

That argument makes sense, but only if our version is a drop-in
replacement for SQL Server's version: if people have to adjust their
commands anyway in corner cases, we're not doing them any big favor.
So: are the syntax and semantics really a match?  Do we have feature
parity?

As I recall, a whole lot of the pain we have with INTO has to do
with the semantics we've chosen for INTO in a set-operation nest.
We think you can write something like

   SELECT ... INTO foo FROM ... UNION SELECT ... FROM ...

but we insist on the INTO being in the first component SELECT.
I'd like to know exactly how much of that messiness is shared
by SQL Server.

(FWIW, I think the fact that SELECT INTO means something entirely
different in plpgsql is a good reason for killing off one version
or the other.  As things stand, it's mighty confusing.)

regards, tom lane




Re: Corner-case bug in pg_rewind

2020-12-03 Thread Pavel Borisov
чт, 3 дек. 2020 г. в 19:15, Heikki Linnakangas :

> On 03/12/2020 16:49, Pavel Borisov wrote:
> > Ok, pushed and backpatched this now.
> >
> > Very nice!
> > Thanks to you all!
>
> Thanks for the review, Pavel! I just realized that I forgot to credit
> you in the commit message. I'm sorry.
>
Don't worry, Heikki. No problem.
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Tom Lane
Krunal Bauskar  writes:
> Any updates or further inputs on this.

As far as LSE goes: my take is that tampering with the
compiler/platform's default optimization options requires *very*
strong evidence, which we have not got and likely won't get.  Users
who are building for specific hardware can choose to supply custom
CFLAGS, of course.  But we shouldn't presume to do that for them,
because we don't know what they are building for, or with what.

I'm very willing to consider the CAS spinlock patch, but it still
feels like there's not enough evidence to show that it's a universal
win.  The way to move forward on that is to collect more measurements
on additional ARM-based platforms.  And I continue to think that
pgbench is only a very crude tool for testing spinlock performance;
we should look at other tests.

>From a system structural standpoint, I seriously dislike that lwlock.c
patch: putting machine-specific variant implementations into that file
seems like a disaster for maintainability.  So it would need to show a
very significant gain across a range of hardware before I'd want to
consider adopting it ... and it has not shown that.

regards, tom lane




Re: Commitfest 2020-11 is closed

2020-12-03 Thread Nikolay Samokhvalov
On Wed, Dec 2, 2020 at 2:36 PM Andrew Dunstan  wrote:

>
> On 12/2/20 5:13 PM, Thomas Munro wrote:
> >
> > I'm experimenting with Github's built in CI.  All other ideas welcome.
>
>
> I'd look very closely at gitlab.
>

+1.

Why:
- having a great experience for more than 2 years
- if needed, there is an open-source version of it
- it's possible to set up your own CI [custom] runners even when you're
working with their SaaS
- finally, it's on Postgres itself


Re: Minor documentation error regarding streaming replication protocol

2020-12-03 Thread Jeff Davis
On Wed, 2020-12-02 at 15:16 -0500, Bruce Momjian wrote:
> Yes, we could, but I thought the format code was not something we set
> at
> this level.  Looking at byteasend() it is true it just sends the
> bytes.

It can be set along with the type. Attached an example.

Andres objected (in a separate conversation) to forcing a binary-format 
value on a client that didn't ask for one. He suggested that we mandate
that the data is ASCII-only (for both filename and content), closing
the gap Michael raised[1]; and then just declare all values to be text
format.

I am fine with either approach; but in any case, I don't see the point
in sending an incorrect RowDescription.

Regards,
Jeff Davis

[1] https://postgr.es/m/20201008235250.ga1...@paquier.xyz

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 4899bacda7b..f383223f462 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1778,7 +1778,7 @@ The commands accepted in replication mode are:
   
   
   
-   systemid (text)
+   systemid (text, text format)
   
   
   
@@ -1791,7 +1791,7 @@ The commands accepted in replication mode are:
 
   
   
-   timeline (int4)
+   timeline (int4, text format)
   
   
   
@@ -1803,7 +1803,7 @@ The commands accepted in replication mode are:
 
   
   
-   xlogpos (text)
+   xlogpos (text, text format)
   
   
   
@@ -1815,7 +1815,7 @@ The commands accepted in replication mode are:
 
   
   
-   dbname (text)
+   dbname (text, text format)
   
   
   
@@ -1861,27 +1861,26 @@ The commands accepted in replication mode are:
  
   Requests the server to send over the timeline history file for timeline
   tli.  Server replies with a
-  result set of a single row, containing two fields.  While the fields
-  are labeled as text, they effectively return raw bytes,
-  with no encoding conversion:
+  result set of a single row, containing two fields:
  
 
  
   
   
   
-   filename (text)
+   filename (bytea, binary format)
   
   
   
File name of the timeline history file, e.g., 0002.history.
+   No encoding conversion is performed.
   
   
   
 
   
   
-   content (text)
+   content (bytea, binary format)
   
   
   
@@ -1975,7 +1974,7 @@ The commands accepted in replication mode are:
 
   

-slot_name (text)
+slot_name (text, text format)
 
  
   The name of the newly-created replication slot.
@@ -1984,7 +1983,7 @@ The commands accepted in replication mode are:

 

-consistent_point (text)
+consistent_point (text, text format)
 
  
   The WAL location at which the slot became consistent.  This is the
@@ -1995,7 +1994,7 @@ The commands accepted in replication mode are:

 

-snapshot_name (text)
+snapshot_name (text, text format)
 
  
   The identifier of the snapshot exported by the command.  The
@@ -2007,7 +2006,7 @@ The commands accepted in replication mode are:

 

-output_plugin (text)
+output_plugin (text, text format)
 
  
   The name of the output plugin used by the newly-created replication
@@ -2636,7 +2635,7 @@ The commands accepted in replication mode are:
   The fields in this row are:
   

-spcoid (oid)
+spcoid (oid, text format)
 
  
   The OID of the tablespace, or null if it's the base
@@ -2645,7 +2644,7 @@ The commands accepted in replication mode are:
 


-spclocation (text)
+spclocation (text, text format)
 
  
   The full path of the tablespace directory, or null
@@ -2654,7 +2653,7 @@ The commands accepted in replication mode are:
 


-size (int8)
+size (int8, text format)
 
  
   The approximate size of the tablespace, in kilobytes (1024 bytes),
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2eb19ad2936..089b3692dec 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -487,19 +487,19 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
 	pq_sendstring(&buf, "filename");	/* col name */
 	pq_sendint32(&buf, 0);		/* table oid */
 	pq_sendint16(&buf, 0);		/* attnum */
-	pq_sendint32(&buf, TEXTOID);	/* type oid */
+	pq_sendint32(&buf, BYTEAOID);	/* type oid */
 	pq_sendint16(&buf, -1);		/* typlen */
 	pq_sendint32(&buf, 0);		/* typmod */
-	pq_sendint16(&buf, 0);		/* format code */
+	pq_sendint16(&buf, 1);		/* format code */
 
 	/* second field */
 	pq_sendstring(&buf, "content"); /* col name */
 	pq_sendint32(&buf, 0);		/* table oid */
 	pq_sendint16(&buf

Re: Add Information during standby recovery conflicts

2020-12-03 Thread Fujii Masao



On 2020/12/01 17:29, Drouvot, Bertrand wrote:

Hi,

On 12/1/20 12:35 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera  wrote:

On 2020-Dec-01, Fujii Masao wrote:


+ if (proc)
+ {
+ if (nprocs == 0)
+ appendStringInfo(&buf, "%d", proc->pid);
+ else
+ appendStringInfo(&buf, ", %d", proc->pid);
+
+ nprocs++;

What happens if all the backends in wait_list have gone? In other words,
how should we handle the case where nprocs == 0 (i.e., nprocs has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no need to log
the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You could
test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.


Thank you all for the review/remarks.

They have been addressed in the new attached patch version.


Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.

+   if (waitStart > 0 && !logged_recovery_conflict)
+   {
+   TimestampTz cur_ts = GetCurrentTimestamp();
+   if (TimestampDifferenceExceeds(waitStart, 
cur_ts,
+   
   DeadlockTimeout))

On the first time through, this is executed before we have started
actually waiting. Which is a bit wasteful. So I changed LockBufferForCleanup()
and ResolveRecoveryConflictWithVirtualXIDs() so that the code for logging
the recovery conflict is executed after the function to wait  is executed.

+   ereport(LOG,
+   errmsg("recovery still waiting after %ld.%03d ms: %s",
+   msecs, usecs, 
_(get_recovery_conflict_desc(reason))),
+   wait_list > 0 ? errdetail_log_plural("Conflicting process: 
%s.",
+ 
  "Conflicting processes: %s.",
+   
nprocs, buf.data) : 0);

Seems "wait_list > 0" should be "nprocs > 0". So I changed the code that way.

+   if (waitStart > 0)
{
-   const char *old_status;

I added "(!logged_recovery_conflict || new_status == NULL)" into
the above if-condition, to avoid executing again the code for logging
after PS title was updated and the recovery conflict was logged.

+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.

+   /*
+* Log the recovery conflict if there 
is still virtual transaction
+* conflicting with the lock.
+*/
+   if (cnt > 0)
+   {
+   
LogRecoveryConflict(PROCSIG_RECOVERY_CONFLICT_LOCK,
+   
standbyWaitStart, cur_ts, vxids);
+   logged_recovery_conflict = true;
+   }

I think that ProcSleep() should log the recovery conflict even if
there are no conflicting virtual transactions. Because the startup
process there has already waited longer than deadlock_timeout,
whether conflicting virtual transactions are still running or not.

Also LogRecoveryConflict() logs the recovery conflict even if it
finds that there are no conflicting active backends. So the rule
about whether to log the conflict when there are no conflicting
bac

Re: Minor documentation error regarding streaming replication protocol

2020-12-03 Thread Bruce Momjian
On Thu, Dec  3, 2020 at 09:04:21AM -0800, Jeff Davis wrote:
> On Wed, 2020-12-02 at 15:16 -0500, Bruce Momjian wrote:
> > Yes, we could, but I thought the format code was not something we set
> > at
> > this level.  Looking at byteasend() it is true it just sends the
> > bytes.
> 
> It can be set along with the type. Attached an example.
> 
> Andres objected (in a separate conversation) to forcing a binary-format 
> value on a client that didn't ask for one. He suggested that we mandate
> that the data is ASCII-only (for both filename and content), closing
> the gap Michael raised[1]; and then just declare all values to be text
> format.

How do we mandate that?  Just mention it in the docs and C comments?

> I am fine with either approach; but in any case, I don't see the point
> in sending an incorrect RowDescription.

Yeah, I can see that argument, particularly since you are setting binary
for the entire row, which in this case is valid, but still, kind of odd.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Alexander Korotkov
On Thu, Dec 3, 2020 at 7:02 PM Tom Lane  wrote:
> From a system structural standpoint, I seriously dislike that lwlock.c
> patch: putting machine-specific variant implementations into that file
> seems like a disaster for maintainability.  So it would need to show a
> very significant gain across a range of hardware before I'd want to
> consider adopting it ... and it has not shown that.

The current shape of the lwlock patch is experimental.  I had quite a
beautiful (in my opinion) idea to wrap platform-dependent parts of
CAS-loops into macros.  Then we could provide different low-level
implementations of CAS-loops for Power, ARM and rest platforms with
single code for LWLockAttempLock() and others.  However, I see that
modern ARM tends to efficiently implement LSE.  Power doesn't seem to
be very popular.  So, I'm going to give up with this for now.

--
Regards,
Alexander Korotkov




Re: SELECT INTO deprecation

2020-12-03 Thread Peter Eisentraut

On 2020-12-03 16:34, Tom Lane wrote:

As I recall, a whole lot of the pain we have with INTO has to do
with the semantics we've chosen for INTO in a set-operation nest.
We think you can write something like

SELECT ... INTO foo FROM ... UNION SELECT ... FROM ...

but we insist on the INTO being in the first component SELECT.
I'd like to know exactly how much of that messiness is shared
by SQL Server.


On sqlfiddle.com, this works:

select a into t3 from t1 union select a from t2;

but this gets an error:

select a from t1 union select a into t4 from t2;

SELECT INTO must be the first query in a statement containing a UNION, 
INTERSECT or EXCEPT operator.





Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-03 Thread Peter Eisentraut
A side comment on this patch:  I think using enums as bit mask values is 
bad style.  So changing this:


-/* Reindex options */
-#define REINDEXOPT_VERBOSE (1 << 0) /* print progress info */
-#define REINDEXOPT_REPORT_PROGRESS (1 << 1) /* report pgstat progress */
-#define REINDEXOPT_MISSING_OK (1 << 2) /* skip missing relations */
-#define REINDEXOPT_CONCURRENTLY (1 << 3)   /* concurrent mode */

to this:

+typedef enum ReindexOption
+{
+   REINDEXOPT_VERBOSE = 1 << 0,/* print progress info */
+   REINDEXOPT_REPORT_PROGRESS = 1 << 1,/* report pgstat progress */
+   REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */
+   REINDEXOPT_CONCURRENTLY = 1 << 3/* concurrent mode */
+} ReindexOption;

seems wrong.

There are a couple of more places like this, including the existing 
ClusterOption that this patched moved around, but we should be removing 
those.


My reasoning is that if you look at an enum value of this type, either 
say in a switch statement or a debugger, the enum value might not be any 
of the defined symbols.  So that way you lose all the type checking that 
an enum might give you.


Let's just keep the #define's like it is done in almost all other places.




Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Alexander Korotkov
On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar  wrote:
> 1. CAS patch (applied on the baseline)
>- Kunpeng: 10-45% improvement observed [1]
>- Graviton2: 30-50% improvement observed [2]

What does lower boundary of improvement mean?  Does it mean minimal
improvement observed?  Obviously not, because there is no improvement
with a low number of clients or read-only benchmark.

>- M1: Only select results are available cas continue to maintain a 
> marginal gain but not significant. [3]

Read-only benchmark doesn't involve spinlocks (I've just rechecked
this).  So, this difference is purely speculative.

Also, Tom observed the regression [1].  The benchmark is artificial,
but it may correspond to some real workload with heavily-loaded
spinlocks.  And that might have an explanation.  ldrex/strex
themselves don't work as memory barrier (this is why compiler adds
explicit memory barrier afterwards).  And I bet ldrex unpaired with
strex could see an outdated value.  On high-contended spinlocks that
may cause too pessimistic waits.

> 2. Let's ignore CAS for a sec and just think of LSE independently
>- Kunpeng: regression observed

Yeah, it's sad that there are ARM chips, where making the same action
in a single instruction is slower than the same actions in multiple
instructions.

>- Graviton2: gain observed

Have to say it's 5.5x improvement, which is dramatically more than
CAS-loop patch can give.

>- M1: regression observed
>  [while lse probably is default explicitly enabling it with +lse causes 
> regression on the head itself [4].
>   client=2/4: 1816/714  vs   892/610]

This is plain false.  LSE was enabled in all the versions tested in M1
[2].  So not LSE instructions or +lse option caused the regression,
but lack of other options enabled by default in Apple's clang.

> Let me know what do you think about this analysis and any specific direction 
> that we should consider to help move forward.

In order to move forward with ARM-optimized spinlocks I would
investigate other options (at least [3]).

Regarding CAS spinlock patch I can propose the following steps to
clarify the things.
1. Run Tom's spinlock test on ARM machines other than M1.
2. Try to construct a pgbench script, which produces heavily-loaded
spinlock without custom C-function, and also run it across various ARM
machines.

Links
1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us
2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us
3. 
https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

--
Regards,
Alexander Korotkov




Re: Github Actions (CI)

2020-12-03 Thread Thomas Munro
On Fri, Dec 4, 2020 at 2:55 AM Josef Šimánek  wrote:
> Any chance to also share links to failing/passing testing builds?

https://github.com/macdice/postgres/runs/1490727809
https://github.com/macdice/postgres/runs/1490727838

However, looking these in a clean/cookieless browser, I see that it
won't show you anything useful unless you're currently logged in to
Github, so that's a major point against using it for cfbot (it's
certainly not my intention to make cfbot only useful to people who
have Github accounts).




Re: Renaming cryptohashes.c to cryptohashfuncs.c

2020-12-03 Thread Daniel Gustafsson
> On 3 Dec 2020, at 03:03, Michael Paquier  wrote:

> Any objections to rename that to cryptohashfuncs.c?  That would be
> much more consistent with the surroundings (cleaning up anythig
> related to MD5 is on my TODO list).  Patch is attached.

+1 on this proposed rename.

cheers ./daniel




copy.sgml and partitioned tables

2020-12-03 Thread Justin Pryzby
https://www.postgresql.org/docs/current/sql-copy.html
|. COPY FROM can be used with plain, foreign, or partitioned tables or with 
views that have INSTEAD OF INSERT triggers.
|. COPY only deals with the specific table named; IT DOES NOT COPY DATA TO OR 
FROM CHILD TABLES. ...

That language in commit 854b5eb51 was never updated since partitioning was
added, so I propose this.

I'm not sure, but maybe it should still say that "COPY TO does not copy data to
child tables of inheritance hierarchies."

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 369342b74d..0631dfe6b3 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -414,9 +414,14 @@ COPY count
 

 COPY TO can only be used with plain tables, not
-with views.  However, you can write COPY (SELECT * FROM
-viewname) TO ...
-to copy the current contents of a view.
+views, and does not copy data from child tables or partitions.
+Thus for example
+COPY table 
TO
+shows the same data as SELECT * FROM ONLY table.  But COPY
+(SELECT * FROM table) TO 
...
+can be used to dump all of the data in an inheritance hierarchy,
+partitioned table, or view.

 

@@ -425,16 +430,6 @@ COPY count
 INSTEAD OF INSERT triggers.

 
-   
-COPY only deals with the specific table named;
-it does not copy data to or from child tables.  Thus for example
-COPY table 
TO
-shows the same data as SELECT * FROM ONLY table.  But COPY
-(SELECT * FROM table) TO 
...
-can be used to dump all of the data in an inheritance hierarchy.
-   
-

 You must have select privilege on the table
 whose values are read by COPY TO, and




Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-03 Thread Alexander Korotkov
On Wed, Dec 2, 2020 at 10:18 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Wed, Dec 02, 2020 at 11:52:54AM -0500, Tom Lane wrote:
> > Dmitry Dolgov <9erthali...@gmail.com> writes:
> > >> On Mon, Nov 30, 2020 at 02:26:19PM +0100, Dmitry Dolgov wrote:
> > >>> On Mon, Nov 30, 2020 at 04:12:29PM +0300, Alexander Korotkov wrote:
> > >>> The idea of an opaque field in SubscriptingRef structure is more
> > >>> attractive to me.  Could you please implement it?
> >
> > >> Sure, doesn't seem to be that much work.
> >
> > I just happened to notice this bit.  This idea is a complete nonstarter.
> > You cannot have an "opaque" field in a parsetree node, because then the
> > backend/nodes code has no idea what to do with it for
> > copy/compare/outfuncs/readfuncs.  The patch seems to be of the opinion
> > that "do nothing" is adequate, which it completely isn't.
> >
> > Perhaps this is a good juncture at which to remind people that parse
> > tree nodes are read-only so far as the executor is concerned, so
> > storing something there only at execution time won't work either.
>
> Oh, right, stupid of me. Then I'll just stick with the original
> Alexanders suggestion.

Stupid me too :)

I didn't get we can't add opaque field to SubscriptingRefState without
adding it to SubscriptingRef, which has to support
copy/compare/outfuncs/readfuncs

--
Regards,
Alexander Korotkov




Re: please update ps display for recovery checkpoint

2020-12-03 Thread Justin Pryzby
On Thu, Dec 03, 2020 at 09:18:07PM +, Bossart, Nathan wrote:
> I considered also checking that update_process_title was enabled, but
> I figured that these ps display updates should happen sparsely enough
> that it wouldn't make much of an impact.

Since bf68b79e5, update_ps_display is responsible for checking
update_process_title.  Its other, remaining uses are apparently just acting as
minor optimizations to guard against useless snprintf's.

See also 
https://www.postgresql.org/message-id/flat/1288021.1600178478%40sss.pgh.pa.us
in which (I just saw) Tom wrote:

> Seems like a good argument, but you'd have to be careful about the
> final state when you stop overriding update_process_title --- it can't
> be left looking like it's still-in-progress on some random WAL file.

I think that's a live problem, not just a concern for that patch.
It was exactly my complaint leading to this thread:

> But runs a checkpoint, which can take a long time, while the "ps" display 
> still
> says "recovering ".

-- 
Justin




Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-12-03 Thread Daniel Gustafsson
> On 3 Dec 2020, at 02:47, Michael Paquier  wrote:
> 
> On Wed, Dec 02, 2020 at 12:03:49PM +0900, Michael Paquier wrote:
>> Thanks.  0001 has been applied and the buildfarm does not complain, so
>> it looks like we are good (I'll take care of any issues, like the one
>> Fujii-san has just reported).  Attached are new patches for 0002, the
>> EVP switch.  One thing I noticed is that we need to free the backup
>> manifest a bit earlier once we begin to use resource owner in
>> basebackup.c as there is a specific step that may do a double-free.
>> This would not happen when not using OpenSSL or on HEAD.  It would be
>> easy to separate the resowner and cryptohash portions of the patch
>> here, but both are tightly linked, so I'd prefer to keep them
>> together.
> 
> Attached is a rebased version to take care of the conflicts introduced
> by 91624c2f.

This version looks good to me, and builds/tests without any issues.  While I
didn't try to adapt the libnss patch to the resowner machinery, I don't see any
reasons off the cuff why it wouldn't work with the scaffolding provided here.
My only question is:

+#ifndef FRONTEND
+   elog(ERROR, "out of memory");
Shouldn't that be an ereport using ERRCODE_OUT_OF_MEMORY?

cheers ./daniel



Re: Corner-case bug in pg_rewind

2020-12-03 Thread Heikki Linnakangas

On 03/12/2020 16:10, Heikki Linnakangas wrote:

On 02/12/2020 15:26, Ian Barwick wrote:

On 02/12/2020 20:13, Heikki Linnakangas wrote:

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.


Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.


Ok, pushed and backpatched this now.


The buildfarm is reporting sporadic failures in the new regression test. 
I suspect it's because of timing issues, where a server is promoted or 
shut down before some data has been replicated. I'll fix that tomorrow 
morning.


- Heikki




Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-03 Thread Tom Lane
Alexander Korotkov  writes:
> I didn't get we can't add opaque field to SubscriptingRefState without
> adding it to SubscriptingRef, which has to support
> copy/compare/outfuncs/readfuncs

Umm ... all depends on what you envision putting in there.  There
certainly can be an opaque field in SubscriptingRefState as long
as the subscript-mechanism-specific code is responsible for setting
it up.  You just can't pass such a thing through the earlier phases.

regards, tom lane




Re: please update ps display for recovery checkpoint

2020-12-03 Thread Bossart, Nathan
On 12/3/20, 1:58 PM, "Justin Pryzby"  wrote:
> On Thu, Dec 03, 2020 at 09:18:07PM +, Bossart, Nathan wrote:
>> I considered also checking that update_process_title was enabled, but
>> I figured that these ps display updates should happen sparsely enough
>> that it wouldn't make much of an impact.
>
> Since bf68b79e5, update_ps_display is responsible for checking
> update_process_title.  Its other, remaining uses are apparently just acting as
> minor optimizations to guard against useless snprintf's.
>
> See also 
> https://www.postgresql.org/message-id/flat/1288021.1600178478%40sss.pgh.pa.us
> in which (I just saw) Tom wrote:
>
>> Seems like a good argument, but you'd have to be careful about the
>> final state when you stop overriding update_process_title --- it can't
>> be left looking like it's still-in-progress on some random WAL file.
>
> I think that's a live problem, not just a concern for that patch.
> It was exactly my complaint leading to this thread:
>
>> But runs a checkpoint, which can take a long time, while the "ps" display 
>> still
>> says "recovering ".

Ah, I see.  Thanks for pointing this out.

Nathan



Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Kyotaro Horiguchi
At Thu, 3 Dec 2020 21:40:29 +0900, Amit Langote  wrote 
in 
> On Thu, Dec 3, 2020 at 5:13 PM Kyotaro Horiguchi
>  wrote:
> > At Thu, 3 Dec 2020 16:41:45 +0900, Amit Langote  
> > wrote in
> > > Maybe I misread but I think you did in your email dated Dec 1 where you 
> > > said:
> > >
> > > "After an off-list discussion, we confirmed that even in that case the
> > > patch works as is because fk_attnum (or contuple.conkey) always stores
> > > key attnums compatible to the topmost parent when conparent has a
> > > valid value (assuming the current usage of fk_attnum), but I still
> > > feel uneasy to rely on that unclear behavior."
> >
> > fk_attnums *doesn't* refers to to top parent talbe of the referencing
> > side. it refers to attributes of the partition that is compatible with
> > the same element of fk_attnums of the topmost parent.  Maybe I'm
> > misreading.
> 
> Yeah, no I am confused.  Reading what I wrote, it seems I implied that
> the referenced (PK relation's) partitions have RI_ConstraintInfo which
> makes no sense, although there indeed is one pg_constraint entry that
> is defined on the FK root table for every PK partition with its OID as
> confrelid, which is in addition to an entry containing the root PK
> table's OID as confrelid.  I confused those PK-partition-referencing
> entries as belonging to the partitions themselves.  Although in my
> defence, all of those entries' conkey contains the FK root table's
> attributes, so at least that much holds. :)

Yes. I think that that confusion doen't hurt the correctness of the
discussion:)

> > > > > On the topic of how we'd be able to share even the RI_ConstraintInfos
> > > > > among partitions, that would indeed look a bit more elaborate than the
> > > > > patch we have right now.
> > > >
> > > > Maybe just letting the hash entry for the child riinfo point to the
> > > > parent riinfo if all members (other than constraint_id, of course)
> > > > share the exactly the same values.  No need to count references since
> > > > we don't going to remove riinfos.
> > >
> > > Ah, something maybe worth trying.  Although the memory we'd save by
> > > sharing the RI_ConstraintInfos would not add that much to the savings
> > > we're having by sharing the plan, because it's the plans that are a
> > > memory hog AFAIK.
> >
> > I agree that plans are rather large but the sharable part of the
> > RI_ConstraintInfos is 536 bytes, I'm not sure it is small enough
> > comparing to the plans.  But that has somewhat large footprint.. (See
> > the attached)
> 
> Thanks for the patch.

That's only to show how that looks like.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add Information during standby recovery conflicts

2020-12-03 Thread Masahiko Sawada
On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao  wrote:
>
>
>
> On 2020/12/01 17:29, Drouvot, Bertrand wrote:
> > Hi,
> >
> > On 12/1/20 12:35 AM, Masahiko Sawada wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera  
> >> wrote:
> >>> On 2020-Dec-01, Fujii Masao wrote:
> >>>
>  + if (proc)
>  + {
>  + if (nprocs == 0)
>  + appendStringInfo(&buf, "%d", 
>  proc->pid);
>  + else
>  + appendStringInfo(&buf, ", %d", 
>  proc->pid);
>  +
>  + nprocs++;
> 
>  What happens if all the backends in wait_list have gone? In other words,
>  how should we handle the case where nprocs == 0 (i.e., nprocs has not 
>  been
>  incrmented at all)? This would very rarely happen, but can happen.
>  In this case, since buf.data is empty, at least there seems no need to 
>  log
>  the list of conflicting processes in detail message.
> >>> Yes, I noticed this too; this can be simplified by changing the
> >>> condition in the ereport() call to be "nprocs > 0" (rather than
> >>> wait_list being null), otherwise not print the errdetail.  (You could
> >>> test buf.data or buf.len instead, but that seems uglier to me.)
> >> +1
> >>
> >> Maybe we can also improve the comment of this function from:
> >>
> >> + * This function also reports the details about the conflicting
> >> + * process ids if *wait_list is not NULL.
> >>
> >> to " This function also reports the details about the conflicting
> >> process ids if exist" or something.
> >>
> > Thank you all for the review/remarks.
> >
> > They have been addressed in the new attached patch version.
>
> Thanks for updating the patch! I read through the patch again
> and applied the following chages to it. Attached is the updated
> version of the patch. Could you review this version? If there is
> no issue in it, I'm thinking to commit this version.

Thank you for updating the patch! I have one question.

>
> +   timeouts[cnt].id = STANDBY_TIMEOUT;
> +   timeouts[cnt].type = TMPARAM_AFTER;
> +   timeouts[cnt].delay_ms = DeadlockTimeout;
>
> Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
> I changed the code that way.

As the comment of ResolveRecoveryConflictWithLock() says the
following, a deadlock is detected by the ordinary backend process:

 * Deadlocks involving the Startup process and an ordinary backend proces
 * will be detected by the deadlock detector within the ordinary backend.

If we use STANDBY_DEADLOCK_TIMEOUT,
SendRecoveryConflictWithBufferPin() will be called after
DeadlockTimeout passed, but I think it's not necessary for the startup
process in this case. If we want to just wake up the startup process
maybe we can use STANDBY_TIMEOUT here?

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: On login trigger: take three

2020-12-03 Thread Greg Nancarrow
On Tue, Sep 15, 2020 at 2:12 AM Pavel Stehule  wrote:
>
>>
>> It is always possible to login by disabling startup triggers using 
>> disable_session_start_trigger GUC:
>>
>> psql "dbname=postgres options='-c disable_session_start_trigger=true'"
>
>
> sure, I know. Just this behavior can be a very unpleasant surprise, and my 
> question is if it can be fixed.  Creating custom libpq variables can be the 
> stop for people that use pgAdmin.
>

Hi,

I thought in the case of using pgAdmin (assuming you can connect as
superuser to a database, say the default "postgres" maintenance
database, that doesn't have an EVENT TRIGGER defined for the
session_start event) you could issue the query "ALTER SYSTEM SET
disable_session_start_trigger TO true;"  and then reload the
configuration?

Anyway, I am wondering if this patch is still being actively developed/improved?

Regarding the last-posted patch, I'd like to give some feedback. I
found that the documentation part wouldn't build because of errors in
the SGML tags. There are some grammatical errors too, and some minor
inconsistencies with the current documentation, and some descriptions
could be improved. I think that a colon separator should be added to
the NOTICE message for superuser, so it's clear exactly where the text
of the underlying error message starts. Also, I think that
"client_connection" is perhaps a better and more intuitive event name
than "session_start", or the suggested "user_backend_start".
I've therefore attached an updated patch with these suggested minor
improvements, please take a look and see what you think (please
compare with the original patch).

Regards,
Greg Nancarrow
Fujitsu Australia


on_connect_event_trigger_WITH_SUGGESTED_UPDATES.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2020-12-03 Thread Peter Smith
On Thu, Dec 3, 2020 at 6:21 PM Peter Smith  wrote:
> Sorry for any inconvenience. I will add the missing functionality to
> 0009 as soon as I can.
>

PSA a **replacement** patch for the previous v29-0009.

This should correct the recently reported trouble [1]
[1] = 
https://www.postgresql.org/message-id/CAD21AoBnZ6dYffVjOCdSvSohR_1ZNedqmb%3D6P9w_H6W0bK1s6g%40mail.gmail.com

I observed after this patch:
make check is all OK.
cd src/test/subscription, then make check is all OK.

~

Note that the tablesync worker's (temporary) slot always uses
two_phase *off*, regardless of the user setting.

e.g.

CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub WITH (streaming = on,
two_phase = on);

will show in the logs that only the apply worker slot enabled the two_phase.

STATEMENT:  START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', streaming 'on', two_phase 'on', publication_names
'"tap_pub"')
STATEMENT:  START_REPLICATION SLOT "tap_sub_16395_sync_16385" LOGICAL
0/16076D8 (proto_version '2', streaming 'on', publication_names
'"tap_pub"')

---

Kind Regards,
Peter Smith.
Fujitsu Australia


v29-0009-Support-2PC-txn-Subscription-option.patch
Description: Binary data


Re: Support for NSS as a libpq TLS backend

2020-12-03 Thread Jacob Champion
On Nov 17, 2020, at 7:00 AM, Daniel Gustafsson  wrote:
> 
> Nice, thanks for the fix!  I've incorporated your patch into the attached v20
> which also fixes client side error reporting to be more readable.

I was testing handshake failure modes and noticed that some FATAL
messages are being sent through to the client in cleartext. The OpenSSL
implementation doesn't do this, because it logs handshake problems at
COMMERROR level. Should we switch all those ereport() calls in the NSS
be_tls_open_server() to COMMERROR as well (and return explicitly), to
avoid this? Or was there a reason for logging at FATAL/ERROR level?

Related note, at the end of be_tls_open_server():

> ...
> port->ssl_in_use = true;
> return 0;
> 
> error:
> return 1;
> }

This needs to return -1 in the error case; the only caller of
secure_open_server() does a direct `result == -1` comparison rather than
checking `result != 0`.

--Jacob



Re: Add Information during standby recovery conflicts

2020-12-03 Thread Fujii Masao




On 2020/12/04 9:28, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao  wrote:




On 2020/12/01 17:29, Drouvot, Bertrand wrote:

Hi,

On 12/1/20 12:35 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera  wrote:

On 2020-Dec-01, Fujii Masao wrote:


+ if (proc)
+ {
+ if (nprocs == 0)
+ appendStringInfo(&buf, "%d", proc->pid);
+ else
+ appendStringInfo(&buf, ", %d", proc->pid);
+
+ nprocs++;

What happens if all the backends in wait_list have gone? In other words,
how should we handle the case where nprocs == 0 (i.e., nprocs has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no need to log
the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You could
test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.


Thank you all for the review/remarks.

They have been addressed in the new attached patch version.


Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.


Thank you for updating the patch! I have one question.



+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.


As the comment of ResolveRecoveryConflictWithLock() says the
following, a deadlock is detected by the ordinary backend process:

  * Deadlocks involving the Startup process and an ordinary backend proces
  * will be detected by the deadlock detector within the ordinary backend.

If we use STANDBY_DEADLOCK_TIMEOUT,
SendRecoveryConflictWithBufferPin() will be called after
DeadlockTimeout passed, but I think it's not necessary for the startup
process in this case.


Thanks for pointing this! You are right.



If we want to just wake up the startup process
maybe we can use STANDBY_TIMEOUT here?


When STANDBY_TIMEOUT happens, a request to release conflicting buffer pins is 
sent. Right? If so, we should not also use STANDBY_TIMEOUT there?

Or, first of all, we don't need to enable the deadlock timer at all? Since what 
we'd like to do is to wake up after deadlock_timeout passes, we can do that by 
changing ProcWaitForSignal() so that it can accept the timeout and giving the 
deadlock_timeout to it. If we do this, maybe we can get rid of 
STANDBY_LOCK_TIMEOUT from ResolveRecoveryConflictWithLock(). Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-03 Thread Justin Pryzby
On Thu, Dec 03, 2020 at 04:12:53PM +0900, Michael Paquier wrote:
> > +typedef struct ReindexParams {
> > +   bool concurrently;
> > +   bool verbose;
> > +   bool missingok;
> > +
> > +   int options;/* bitmask of lowlevel REINDEXOPT_* */
> > +} ReindexParams;
> > +
> 
> By moving everything into indexcmds.c, keeping ReindexParams within it
> makes sense to me.  Now, there is no need for the three booleans
> because options stores the same information, no?

 I liked the bools, but dropped them so the patch is smaller.

> >  struct ReindexIndexCallbackState
> >  {
> > -   int options;/* options from 
> > statement */
> > +   boolconcurrently;
> > Oid locked_table_oid;   /* tracks previously 
> > locked table */
> >  };
> 
> Here also, I think that we should just pass down the full
> ReindexParams set.

Ok.

Regarding the REINDEX patch, I think this comment is misleading:

|/*
| * If the relation has a secondary toast rel, reindex that too while we
| * still hold the lock on the main table.
| */
|if ((flags & REINDEX_REL_PROCESS_TOAST) && OidIsValid(toast_relid))
|{
|/*
| * Note that this should fail if the toast relation is 
missing, so
| * reset REINDEXOPT_MISSING_OK.
|+*
|+* Even if table was moved to new tablespace, normally toast 
cannot move.
| */
|+   Oid toasttablespaceOid = allowSystemTableMods ? tablespaceOid 
: InvalidOid;
|result |= reindex_relation(toast_relid, flags,
|-  options & 
~(REINDEXOPT_MISSING_OK));
|+  options & 
~(REINDEXOPT_MISSING_OK),
|+  
toasttablespaceOid);
|}

I think it ought to say "Even if a table's indexes were moved to a new
tablespace, its toast table's index is not normally moved"
Right ?

Also, I don't know whether we should check for GLOBALTABLESPACE_OID after
calling get_tablespace_oid(), or in the lowlevel routines.  Note that
reindex_relation is called during cluster/vacuum, and in the later patches, I
moved the test from from cluster() and ExecVacuum() to rebuild_relation().

-- 
Justin
>From df43fe542081178ea74ffb2d1d77342e6c657c2f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 2 Dec 2020 20:54:47 -0600
Subject: [PATCH v32 1/5] ExecReindex and ReindexParams

TODO: typedef
---
 src/backend/commands/indexcmds.c | 151 ---
 src/backend/tcop/utility.c   |  40 +---
 src/include/commands/defrem.h|   7 +-
 3 files changed, 101 insertions(+), 97 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 14d24b3cc4..f0456dcbef 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -67,6 +67,10 @@
 #include "utils/syscache.h"
 
 
+typedef struct ReindexParams {
+	int options;	/* bitmask of lowlevel REINDEXOPT_* */
+} ReindexParams;
+
 /* non-export function prototypes */
 static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 static void CheckPredicate(Expr *predicate);
@@ -86,12 +90,17 @@ static char *ChooseIndexName(const char *tabname, Oid namespaceId,
 			 bool primary, bool isconstraint);
 static char *ChooseIndexNameAddition(List *colnames);
 static List *ChooseIndexColumnNames(List *indexElems);
+
+static void ReindexIndex(RangeVar *indexRelation, ReindexParams *params, bool isTopLevel);
+static Oid ReindexTable(RangeVar *relation, ReindexParams *params, bool isTopLevel);
+static void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind, ReindexParams *params);
+
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, void *arg);
 static void reindex_error_callback(void *args);
-static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
-static void ReindexMultipleInternal(List *relids, int options);
-static bool ReindexRelationConcurrently(Oid relationOid, int options);
+static void ReindexPartitions(Oid relid, ReindexParams *params, bool isTopLevel);
+static void ReindexMultipleInternal(List *relids, ReindexParams *params);
+static bool ReindexRelationConcurrently(Oid relationOid, ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
 static inline void set_indexsafe_procflags(void);
 
@@ -100,7 +109,7 @@ static inline void set_indexsafe_procflags(void);
  */
 struct ReindexIndexCallbackState
 {
-	int			options;		/* options from statement */
+	ReindexParams		*params;
 	Oid			locked_table_oid;	/* tracks previously locked table */
 };
 
@@ -2452,16 +2461,19 @@ ChooseIndexColumnNames(List *indexElems)
 }
 
 /*
- * ReindexParseOptions
- *		Parse list of REINDE

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-12-03 Thread Michael Paquier
On Thu, Dec 03, 2020 at 10:58:39PM +0100, Daniel Gustafsson wrote:
> This version looks good to me, and builds/tests without any issues.  While I
> didn't try to adapt the libnss patch to the resowner machinery, I don't see 
> any
> reasons off the cuff why it wouldn't work with the scaffolding provided here.

Based on my read of the code in lib/freebl/, SHA256ContextStr & co
hold the context data for SHA2, but are headers like sha256.h
installed?  I don't know enough of NSS to be able to answer to
that.  If, like OpenSSL, the context internals are not provided, I
think that you could use SHA256_NewContext() and track the allocation
with the resource owner callbacks, but doing a palloc() would be 
much simpler if the context internals are available.

> My only question is:
> 
> +#ifndef FRONTEND
> +   elog(ERROR, "out of memory");
> Shouldn't that be an ereport using ERRCODE_OUT_OF_MEMORY?

That makes sense, fixed.

I have done more testing across all versions of OpenSSL, and applied
this one, meaning that we are done for SHA2.  Thanks for the reviews!
Now, moving back to MD5..
--
Michael


signature.asc
Description: PGP signature


Re: Single transaction in the tablesync worker?

2020-12-03 Thread Craig Ringer
On Thu, 3 Dec 2020 at 17:25, Amit Kapila  wrote:

> Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker?

No fundamental problem. Both approaches are fine. Committing the
initial copy then doing the rest in individual txns means an
incomplete sync state for the table becomes visible, which may not be
ideal. Ideally we'd do something like sync the data into a clone of
the table then swap the table relfilenodes out once we're synced up.

IMO the main advantage of committing as we go is that it would let us
use a non-temporary slot and support recovering an incomplete sync and
finishing it after interruption by connection loss, crash, etc. That
would be advantageous for big table syncs or where the sync has lots
of lag to replay. But it means we have to remember sync states, and
give users a way to cancel/abort them. Otherwise forgotten temp slots
for syncs will cause a mess on the upstream.

It also allows the sync slot to advance, freeing any held upstream
resources before the whole sync is done, which is good if the upstream
is busy and generating lots of WAL.

Finally, committing as we go means we won't exceed the cid increment
limit in a single txn.

> The reason why I am looking into this area is to support the logical
> decoding of prepared transactions. See the problem [1] reported by
> Peter Smith. Basically, when we stream prepared transactions in the
> tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions. Now, we can fix that
> problem by disabling the decoding of prepared xacts in tablesync
> worker.

Tablesync should indeed only receive a txn when the commit arrives, it
should not attempt to handle uncommitted prepared xacts.

> But that will arise to a different kind of problems like the
> prepare will not be sent by the publisher but a later commit might
> move lsn to a later step which will allow it to catch up till the
> apply worker. So, now the prepared transaction will be skipped by both
> tablesync and apply worker.

I'm not sure I understand. If what you describe is possible then
there's already a bug in prepared xact handling. Prepared xact commit
progress should be tracked by commit lsn, not by prepare lsn.

Can you set out the ordering of events in more detail?

> I think apart from unblocking the development of 'logical decoding of
> prepared xacts', it will make the code consistent between apply and
> tablesync worker and reduce the chances of future bugs in this area.
> Basically, it will reduce the checks related to am_tablesync_worker()
> at various places in the code.

I think we made similar changes in pglogical to switch to applying
sync work in individual txns.




Re: Single transaction in the tablesync worker?

2020-12-03 Thread Amit Kapila
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
 wrote:
>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila  wrote:
>
>
> > The reason why I am looking into this area is to support the logical
> > decoding of prepared transactions. See the problem [1] reported by
> > Peter Smith. Basically, when we stream prepared transactions in the
> > tablesync worker, it will simply commit the same due to the
> > requirement of maintaining a single transaction for the entire
> > duration of copy and streaming of transactions. Now, we can fix that
> > problem by disabling the decoding of prepared xacts in tablesync
> > worker.
>
> Tablesync should indeed only receive a txn when the commit arrives, it
> should not attempt to handle uncommitted prepared xacts.
>

Why? If we go with the approach of the commit as we go for individual
transactions in the tablesync worker then this shouldn't be a problem.

> > But that will arise to a different kind of problems like the
> > prepare will not be sent by the publisher but a later commit might
> > move lsn to a later step which will allow it to catch up till the
> > apply worker. So, now the prepared transaction will be skipped by both
> > tablesync and apply worker.
>
> I'm not sure I understand. If what you describe is possible then
> there's already a bug in prepared xact handling. Prepared xact commit
> progress should be tracked by commit lsn, not by prepare lsn.
>

Oh no, I am talking about commit of some other transaction.

> Can you set out the ordering of events in more detail?
>

Sure. It will be something like where apply worker is ahead of sync worker:

Assume t1 has some data which tablesync worker has to first copy.

tx1
Begin;
Insert into t1
Prepare Transaction 'foo'

tx2
Begin;
Insert into t1
Commit

apply worker
• tx1: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready
• tx2: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready

tablesync worder
• tx1: handles: BEGIN - INSERT - PREPARE 'xyz';  (but tablesync gets
nothing because say we disable 2-PC for it)
• tx2: handles: BEGIN - INSERT - COMMIT;
• tablelsync exits

Now the situation is that the apply worker has skipped the prepared
xact data and tablesync worker has not received it, so not applied it.
Next, when we get Commit Prepared for tx1, it will silently commit the
prepared transaction without any data being updated. The commit
prepared won't error out in subscriber because the prepare would have
been successful even though the data is skipped via
should_apply_changes_for_rel.

> > I think apart from unblocking the development of 'logical decoding of
> > prepared xacts', it will make the code consistent between apply and
> > tablesync worker and reduce the chances of future bugs in this area.
> > Basically, it will reduce the checks related to am_tablesync_worker()
> > at various places in the code.
>
> I think we made similar changes in pglogical to switch to applying
> sync work in individual txns.
>

oh, cool. Did you make some additional changes as you have mentioned
in the earlier part of the email?

-- 
With Regards,
Amit Kapila.




RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-03 Thread tsunakawa.ta...@fujitsu.com
From: Bharath Rupireddy 
> 1) What happens if a partitioned table has a foreign partition along
> with few other local partitions[1]? Currently, if we try to set
> logged/unlogged of a foreign table, then an "ERROR:  "" is not a
> table" is thrown. This makes sense. With your patch also we see the
> same error, but I'm not quite sure, whether it is setting the parent
> and local partitions to logged/unlogged and then throwing the error?
> Or do we want the parent relation and the all the local partitions
> become logged/unlogged and give a warning saying foreign table can not
> be made logged/unlogged?

Good point, thanks.  I think the foreign partitions should be ignored, 
otherwise the user would have to ALTER each local partition manually or detach 
the foreign partitions temporarily.  Done like that.


> 2) What is the order in which the parent table and the partitions are
> made logged/unlogged? Is it that first the parent table and then all
> the partitions? What if an error occurs as in above point for a
> foreign partition or a partition having foreign key reference to a
> logged table? During the rollback of the txn, will we undo the setting
> logged/unlogged?

The parent is not changed because it does not have storage.
If some partition has undesirable foreign key relationship, the entire ALTER 
command fails.  All the effects are undone when the transaction rolls back.


> 3) Say, I have two logged tables t1 and t2. I altered t1 to be
> unlogged, and then I attach logged table t2 as a partition to t1, then
> what's the expectation? While attaching the partition, should we also
> make t2 as unlogged?

The attached partition retains its property.


> 4) Currently, if we try to set logged/unlogged of a foreign table,
> then an "ERROR:  "" is not a table" is thrown. I also see that, in
> general ATWrongRelkindError() throws an error saying the given
> relation is not of expected types, but it doesn't say what is the
> given relation kind. Should we improve ATWrongRelkindError() by
> passing the actual relation type along with the allowed relation types
> to show a bit more informative error message, something like "ERROR:
> "" is a foreign table" with a hint "Allowed relation types are
> table, view, index."

Ah, maybe that's a bit more friendly.  But I don't think it's worth bothering 
to mess ATWrongRelkindError() with a long switch statement to map a relation 
kind to its string representation.  Anyway, I'd like it to be a separate topic.


> 5) Coming to the patch, it is missing to add test cases.

Yes, added in the revised patch.

I added this to the next CF.


Regards
Takayuki Tsunakawa



v2-0001-Make-ALTER-TABLE-SET-LOGGED-UNLOGGED-on-a-partiti.patch
Description:  v2-0001-Make-ALTER-TABLE-SET-LOGGED-UNLOGGED-on-a-partiti.patch


Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Keisuke Kuroda
Hi Amit,

> I have attached a patch in which I've tried to merge the ideas from
> both my patch and Kuroda-san's.  I liked that his patch added
> conparentid to RI_ConstraintInfo because that saves a needless
> syscache lookup for constraints that don't have a parent.  I've kept
> my idea to compute the root constraint id only once in
> ri_LoadConstraint(), not on every invocation of ri_BuildQueryKey().
> Kuroda-san, anything you'd like to add to that?

Thank you for the merge! It looks good to me.
I think a fix for InvalidateConstraintCacheCallBack() is also good.

I also confirmed that the patch passed the make check-world.

Best Regards,
-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com




Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-03 Thread Bharath Rupireddy
On Fri, Dec 4, 2020 at 8:22 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Bharath Rupireddy 
> > 1) What happens if a partitioned table has a foreign partition along
> > with few other local partitions[1]? Currently, if we try to set
> > logged/unlogged of a foreign table, then an "ERROR:  "" is not a
> > table" is thrown. This makes sense. With your patch also we see the
> > same error, but I'm not quite sure, whether it is setting the parent
> > and local partitions to logged/unlogged and then throwing the error?
> > Or do we want the parent relation and the all the local partitions
> > become logged/unlogged and give a warning saying foreign table can not
> > be made logged/unlogged?
>
> Good point, thanks.  I think the foreign partitions should be ignored, 
> otherwise the user would have to ALTER each local partition manually or 
> detach the foreign partitions temporarily.  Done like that.
>
>
> > 2) What is the order in which the parent table and the partitions are
> > made logged/unlogged? Is it that first the parent table and then all
> > the partitions? What if an error occurs as in above point for a
> > foreign partition or a partition having foreign key reference to a
> > logged table? During the rollback of the txn, will we undo the setting
> > logged/unlogged?
>
> The parent is not changed because it does not have storage.
> If some partition has undesirable foreign key relationship, the entire ALTER 
> command fails.  All the effects are undone when the transaction rolls back.
>
>
> > 3) Say, I have two logged tables t1 and t2. I altered t1 to be
> > unlogged, and then I attach logged table t2 as a partition to t1, then
> > what's the expectation? While attaching the partition, should we also
> > make t2 as unlogged?
>
> The attached partition retains its property.
>
>
> > 4) Currently, if we try to set logged/unlogged of a foreign table,
> > then an "ERROR:  "" is not a table" is thrown. I also see that, in
> > general ATWrongRelkindError() throws an error saying the given
> > relation is not of expected types, but it doesn't say what is the
> > given relation kind. Should we improve ATWrongRelkindError() by
> > passing the actual relation type along with the allowed relation types
> > to show a bit more informative error message, something like "ERROR:
> > "" is a foreign table" with a hint "Allowed relation types are
> > table, view, index."
>
> Ah, maybe that's a bit more friendly.  But I don't think it's worth bothering 
> to mess ATWrongRelkindError() with a long switch statement to map a relation 
> kind to its string representation.  Anyway, I'd like it to be a separate 
> topic.
>
>
> > 5) Coming to the patch, it is missing to add test cases.
>
> Yes, added in the revised patch.
>
> I added this to the next CF.
>

Thanks! I will review the v2 patch and provide my thoughts.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Kyotaro Horiguchi
Thanks, but sorry for the confusion.

I intended just to show how it looks like if we share
RI_ConstraintInfo among partition relations.

At Thu, 3 Dec 2020 10:22:47 -0300, Alvaro Herrera  
wrote in 
> Hello
> 
> I haven't followed this thread's latest posts, but I'm unclear on the
> lifetime of the new struct that's being allocated in TopMemoryContext.
> At what point are those structs freed?

The choice of memory context is tentative and in order to shrink the
patch'es footprint. I think we don't use CurrentDynaHashCxt for the
additional struct so a context for this use is needed.

The struct is freed only when the parent struct (RI_ConstraintInfo) is
found to be able to share the child struct (RI_ConstraintParam) with
the parent constraint.  It seems like inefficient (or tending to make
"hole"s in the heap area) but I chose it just to shrink the footprint.

We could create the new RI_ConstraintInfo on stack then copy it to the
cache after we find that the RI_ConstraintInfo needs its own
RI_ConstriantParam.

> Also, the comment that was in RI_ConstraintInfo now appears in
> RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> undocumented.  What is the relationship between those two structs?  I
> see that they have pointers to each other, but I think the relationship
> should be documented more clearly.

I'm not sure the footprint of this patch worth doing but here is a bit
more polished version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 55946e09d33fc7fa43bed04ef548bf8a3f67155d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 3 Dec 2020 16:15:57 +0900
Subject: [PATCH 1/2] separte riinfo

---
 src/backend/utils/adt/ri_triggers.c | 290 
 1 file changed, 168 insertions(+), 122 deletions(-)

diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 02b1a3868f..0306bf7739 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -97,14 +97,29 @@
  * Information extracted from an FK pg_constraint entry.  This is cached in
  * ri_constraint_cache.
  */
+struct RI_ConstraintParam;
+
 typedef struct RI_ConstraintInfo
 {
-	Oid			constraint_id;	/* OID of pg_constraint entry (hash key) */
+	Oid			constraint_id;
 	bool		valid;			/* successfully initialized? */
 	uint32		oidHashValue;	/* hash value of pg_constraint OID */
 	NameData	conname;		/* name of the FK constraint */
-	Oid			pk_relid;		/* referenced relation */
 	Oid			fk_relid;		/* referencing relation */
+	struct RI_ConstraintParam *param;	/* sharable part  */
+	dlist_node	valid_link;		/* Link in list of valid entries */
+} RI_ConstraintInfo;
+
+/*
+ * RI_ConstraintParam
+ *
+ * The part sharable among relations in a partitioned table of the cached
+ * constraint information.
+ */
+typedef struct RI_ConstraintParam
+{
+	/* begin with identity members */
+	Oid			pk_relid;		/* referenced relation */
 	char		confupdtype;	/* foreign key's ON UPDATE action */
 	char		confdeltype;	/* foreign key's ON DELETE action */
 	char		confmatchtype;	/* foreign key's match type */
@@ -114,8 +129,12 @@ typedef struct RI_ConstraintInfo
 	Oid			pf_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (PK = FK) */
 	Oid			pp_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (PK = PK) */
 	Oid			ff_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (FK = FK) */
-	dlist_node	valid_link;		/* Link in list of valid entries */
-} RI_ConstraintInfo;
+
+	/* These should be at the end of struct, see ri_LoadConstraintInfo */
+	Oid			query_key;		/* key for planned statment */
+	RI_ConstraintInfo *ownerinfo; /* owner RI_ConstraintInfo of this param */
+} RI_ConstraintParam;
+
 
 /*
  * RI_QueryKey
@@ -163,6 +182,7 @@ typedef struct RI_CompareHashEntry
 /*
  * Local data
  */
+static MemoryContext ri_constraint_cache_cxt = NULL;
 static HTAB *ri_constraint_cache = NULL;
 static HTAB *ri_query_cache = NULL;
 static HTAB *ri_compare_cache = NULL;
@@ -264,7 +284,7 @@ RI_FKey_check(TriggerData *trigdata)
 	 * SELECT FOR KEY SHARE will get on it.
 	 */
 	fk_rel = trigdata->tg_relation;
-	pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+	pk_rel = table_open(riinfo->param->pk_relid, RowShareLock);
 
 	switch (ri_NullCheck(RelationGetDescr(fk_rel), newslot, riinfo, false))
 	{
@@ -283,7 +303,7 @@ RI_FKey_check(TriggerData *trigdata)
 			 * This is the only case that differs between the three kinds of
 			 * MATCH.
 			 */
-			switch (riinfo->confmatchtype)
+			switch (riinfo->param->confmatchtype)
 			{
 case FKCONSTR_MATCH_FULL:
 
@@ -364,17 +384,17 @@ RI_FKey_check(TriggerData *trigdata)
 		appendStringInfo(&querybuf, "SELECT 1 FROM %s%s x",
 		 pk_only, pkrelname);
 		querysep = "WHERE";
-		for (int i = 0; i < riinfo->nkeys; i++)
+		for (int i = 0; i < riinfo->param->nkeys; i++)
 		{
-			Oid			pk_type = RIAttType(pk_rel, riinfo->pk_attnums[i]);
-			Oid			fk_type = RIAttType(fk_rel, riinfo->fk_attnums[i]);
+			Oid			pk_type = RIAttType(pk_rel, rii

RE: A new function to wait for the backend exit after termination

2020-12-03 Thread Hou, Zhijie
Hi,

-however only superusers can terminate superuser backends.
+however only superusers can terminate superuser backends. When no
+wait and timeout are
+provided, only SIGTERM is sent to the backend with the given process
+ID and false is returned immediately. But the

I test the case when no wait and timeout are provided.
True is returned as the following which seems different from the doc.

postgres=# select pg_terminate_backend(pid);
 pg_terminate_backend 
--
 t
(1 row)

Best regards,
houzj





Re: autovac issue with large number of tables

2020-12-03 Thread Kasahara Tatsuhito
Hi,

On Thu, Dec 3, 2020 at 9:09 PM Fujii Masao  wrote:
>
>
>
> On 2020/12/03 11:46, Kasahara Tatsuhito wrote:
> > On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada  
> > wrote:
> >>
> >> On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/12/02 12:53, Masahiko Sawada wrote:
>  On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  
>  wrote:
> >
> > On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao 
> >  wrote:
> >>
> >>
> >>
> >> On 2020/12/01 16:23, Masahiko Sawada wrote:
> >>> On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
> >>>  wrote:
> 
>  Hi,
> 
>  On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao 
>   wrote:
> >
> >
> >
> > On 2020/11/30 10:43, Masahiko Sawada wrote:
> >> On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
> >>  wrote:
> >>>
> >>> Hi, Thanks for you comments.
> >>>
> >>> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao 
> >>>  wrote:
> 
> 
> 
>  On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
> > Hi,
> >
> > On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
> >  wrote:
> >>
> >>
> >>
> >> On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> >>> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> >>>  wrote:
> 
>  On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
>   wrote:
> >
> > Hi,
> >
> > On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
> >  wrote:
> >>
> >> On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> >>  wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> >>>  wrote:
> > I wonder if we could have table_recheck_autovac do two 
> > probes of the stats
> > data.  First probe the existing stats data, and if it 
> > shows the table to
> > be already vacuumed, return immediately.  If not, 
> > *then* force a stats
> > re-read, and check a second time.
>  Does the above mean that the second and subsequent 
>  table_recheck_autovac()
>  will be improved to first check using the previous 
>  refreshed statistics?
>  I think that certainly works.
> 
>  If that's correct, I'll try to create a patch for the PoC
> >>>
> >>> I still don't know how to reproduce Jim's troubles, but I 
> >>> was able to reproduce
> >>> what was probably a very similar problem.
> >>>
> >>> This problem seems to be more likely to occur in cases 
> >>> where you have
> >>> a large number of tables,
> >>> i.e., a large amount of stats, and many small tables need 
> >>> VACUUM at
> >>> the same time.
> >>>
> >>> So I followed Tom's advice and created a patch for the 
> >>> PoC.
> >>> This patch will enable a flag in the 
> >>> table_recheck_autovac function to use
> >>> the existing stats next time if VACUUM (or ANALYZE) has 
> >>> already been done
> >>> by another worker on the check after the stats have been 
> >>> updated.
> >>> If the tables continue to require VACUUM after the 
> >>> refresh, then a refresh
> >>> will be required instead of using the existing statistics.
> >>>
> >>> I did simple test with HEAD and HEAD + this PoC patch.
> >>> The tests were conducted in two cases.
> >>> (I changed few configurations. see attached scripts)
> >>>
> >>> 1. Normal VACUUM case
> >>> - SET autovacuum = off
> >>> - CREATE tables with 100 rows
> >>> - DELETE 90 rows for each tables
> >>> - SET autovacuum = on and restart PostgreSQL
> >>> - Measure the time it takes for all tables to be 
> >>> VACUUMed
> >>>
> >>> 2. Anti wrap round VACUUM case
> >>> - CREATE brank tables
> >>> - SELECT all of these tables (for generate stats)
> >>

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Tang, Haiying
Hello, Kirk

Thanks for providing the new patches.
I did the recovery performance test on them, the results look good. I'd like to 
share them with you and everyone else. 
(I also record VACUUM and TRUNCATE execution time on master/primary in case you 
want to have a look.)  

1. VACUUM and Failover test results(average of 15 times) 
[VACUUM] ---execution time on master/primary
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/master)
--
128M9.440  9.483   0%
10G74.689 76.219   2%
20G   152.538138.292  -9%

[Failover] ---execution time on standby
shared_buffers master(sec)patched(sec) 
%reg=((patched-master)/master)
--
128M3.6292.961-18%
10G82.4432.627-97%
20G   171.3882.607-98%

2. TRUNCATE and Failover test results(average of 15 times) 
[TRUNCATE] ---execution time on master/primary
shared_buffers master(sec)patched(sec) 
%reg=((patched-master)/master)
--
128M   49.271   49.867 1%
10G   172.437  175.197 2%
20G   279.658  278.752 0%

[Failover] ---execution time on standby
shared_buffersmaster(sec)   patched(sec) 
%reg=((patched-master)/master)
--
128M   4.8773.989-18%
10G   92.6803.975-96%
20G  182.0353.962-98% 

[Machine spec]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 64G
OS: CentOS 8

[Failover test data]
Total table Size: 700M
Table: 1 tables (1000 rows per table)

If you have question on my test, please let me know.

Regards,
Tang






RE: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-12-03 Thread Shinya11.Kato
When I execute pg_waldump, I found that XLOG/SWITCH_JUNK appears twice.
Is this problem solved by the way of correcting the previously discussed 
Transaction/COMMIT?

$ ../bin/pg_waldump --stats=record ../data/pg_wal/00010001
Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/CHECKPOINT_SHUTDOWN   5 (  0.01)  570 
(  0.01)0 (  0.00)  570 (  0.01)
XLOG/CHECKPOINT_ONLINE 6 (  0.02)  684 
(  0.02)0 (  0.00)  684 (  0.01)
XLOG/NEXTOID   3 (  0.01)   90 
(  0.00)0 (  0.00)   90 (  0.00)
XLOG/FPI 290 (  0.80)14210 
(  0.34)   638216 ( 40.72)   652426 ( 11.30)
Transaction/COMMIT12 (  0.03)  408 
(  0.01)0 (  0.00)  408 (  0.01)
Transaction/COMMIT   496 (  1.36)   134497 
(  3.20)0 (  0.00)   134497 (  2.33)
Storage/CREATE13 (  0.04)  546 
(  0.01)0 (  0.00)  546 (  0.01)
CLOG/ZEROPAGE  1 (  0.00)   30 
(  0.00)0 (  0.00)   30 (  0.00)
Database/CREATE2 (  0.01)   84 
(  0.00)0 (  0.00)   84 (  0.00)
Standby/LOCK 142 (  0.39) 5964 
(  0.14)0 (  0.00) 5964 (  0.10)
Standby/RUNNING_XACTS 13 (  0.04)  666 
(  0.02)0 (  0.00)  666 (  0.01)
Standby/INVALIDATIONS136 (  0.37)12416 
(  0.30)0 (  0.00)12416 (  0.22)
Heap2/CLEAN  132 (  0.36) 8994 
(  0.21)0 (  0.00) 8994 (  0.16)
Heap2/FREEZE_PAGE245 (  0.67)   168704 
(  4.01)0 (  0.00)   168704 (  2.92)
Heap2/CLEANUP_INFO 2 (  0.01)   84 
(  0.00)0 (  0.00)   84 (  0.00)
Heap2/VISIBLE424 (  1.16)25231 
(  0.60)   352256 ( 22.48)   377487 (  6.54)
XLOG/SWITCH_JUNK   0 (  0.00)0 
(  0.00)0 (  0.00)0 (  0.00)
Heap2/MULTI_INSERT  1511 (  4.15)   287727 
(  6.84)12872 (  0.82)   300599 (  5.21)
Heap2/MULTI_INSERT+INIT   46 (  0.13)71910 
(  1.71)0 (  0.00)71910 (  1.25)
Heap/INSERT 8849 ( 24.31)  1288414 
( 30.62)25648 (  1.64)  1314062 ( 22.76)
Heap/DELETE   25 (  0.07) 1350 
(  0.03)0 (  0.00) 1350 (  0.02)
Heap/UPDATE  173 (  0.48)55238 
(  1.31) 5964 (  0.38)61202 (  1.06)
Heap/HOT_UPDATE  257 (  0.71)27585 
(  0.66) 1300 (  0.08)28885 (  0.50)
XLOG/SWITCH_JUNK   0 (  0.00)0 
(  0.00)0 (  0.00)0 (  0.00)
Heap/LOCK180 (  0.49) 9800 
(  0.23)   129812 (  8.28)   139612 (  2.42)
Heap/INPLACE 214 (  0.59)44520 
(  1.06)40792 (  2.60)85312 (  1.48)
Heap/INSERT+INIT 171 (  0.47)   171318 
(  4.07)0 (  0.00)   171318 (  2.97)

Regards,
Shinya Kato


Re: Single transaction in the tablesync worker?

2020-12-03 Thread Amit Kapila
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
 wrote:
>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila  wrote:
>
> > Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker?
>
> No fundamental problem. Both approaches are fine. Committing the
> initial copy then doing the rest in individual txns means an
> incomplete sync state for the table becomes visible, which may not be
> ideal. Ideally we'd do something like sync the data into a clone of
> the table then swap the table relfilenodes out once we're synced up.
>
> IMO the main advantage of committing as we go is that it would let us
> use a non-temporary slot and support recovering an incomplete sync and
> finishing it after interruption by connection loss, crash, etc. That
> would be advantageous for big table syncs or where the sync has lots
> of lag to replay. But it means we have to remember sync states, and
> give users a way to cancel/abort them. Otherwise forgotten temp slots
> for syncs will cause a mess on the upstream.
>
> It also allows the sync slot to advance, freeing any held upstream
> resources before the whole sync is done, which is good if the upstream
> is busy and generating lots of WAL.
>
> Finally, committing as we go means we won't exceed the cid increment
> limit in a single txn.
>


Yeah, all these are advantages of processing
transaction-by-transaction. IIUC, we need to primarily do two things
to achieve it, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position.

Apart from the above, I think with the current design of tablesync we
can see partial data of transactions because we allow all the
tablesync workers to run parallelly. Consider the below scenario:

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

Tx1
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl;

CREATE SUBSCRIPTION mysub
 CONNECTION 'host=localhost port=5432 dbname=postgres'
PUBLICATION mypublication;

Tx2
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
INSERT INTO mytbl2(somedata, text) VALUES (1, 2);
Commit;

Tx3
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT INTO mytbl2(somedata, text) VALUES (1, 3);
Commit;

Now, I could see the below results on subscriber:

postgres=# select * from mytbl1;
 id | somedata | text
+--+--
(0 rows)


postgres=# select * from mytbl2;
 id | somedata | text
+--+--
  1 |1 | 1
  2 |1 | 2
  3 |1 | 3
(3 rows)

Basically, the results for Tx1, Tx2, Tx3 are visible for mytbl2 but
not for mytbl1. To reproduce this I have stopped the tablesync workers
(via debugger) for mytbl1 and mytbl2 in LogicalRepSyncTableStart
before it changes the relstate to SUBREL_STATE_SYNCWAIT. Then allowed
Tx2 and Tx3 to be processed by apply worker and then allowed tablesync
worker for mytbl2 to proceed. After that, I can see the above state.

Now, won't this behavior be considered as transaction inconsistency
where partial transaction data or later transaction data is visible? I
don't think we can have such a situation on the master (publisher)
node or in physical standby.

-- 
With Regards,
Amit Kapila.




Re: Single transaction in the tablesync worker?

2020-12-03 Thread Amit Kapila
On Fri, Dec 4, 2020 at 10:29 AM Amit Kapila  wrote:
>
> On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
>  wrote:
> >
> > On Thu, 3 Dec 2020 at 17:25, Amit Kapila  wrote:
> >
> > > Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker?
> >
> > No fundamental problem. Both approaches are fine. Committing the
> > initial copy then doing the rest in individual txns means an
> > incomplete sync state for the table becomes visible, which may not be
> > ideal. Ideally we'd do something like sync the data into a clone of
> > the table then swap the table relfilenodes out once we're synced up.
> >
> > IMO the main advantage of committing as we go is that it would let us
> > use a non-temporary slot and support recovering an incomplete sync and
> > finishing it after interruption by connection loss, crash, etc. That
> > would be advantageous for big table syncs or where the sync has lots
> > of lag to replay. But it means we have to remember sync states, and
> > give users a way to cancel/abort them. Otherwise forgotten temp slots
> > for syncs will cause a mess on the upstream.
> >
> > It also allows the sync slot to advance, freeing any held upstream
> > resources before the whole sync is done, which is good if the upstream
> > is busy and generating lots of WAL.
> >
> > Finally, committing as we go means we won't exceed the cid increment
> > limit in a single txn.
> >
>
> Yeah, all these are advantages of processing
> transaction-by-transaction. IIUC, we need to primarily do two things
> to achieve it, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position.
>
> Apart from the above, I think with the current design of tablesync we
> can see partial data of transactions because we allow all the
> tablesync workers to run parallelly. Consider the below scenario:
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> Tx1
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl;
>

oops, the above statement should be CREATE PUBLICATION mypublication
FOR TABLE mytbl1, mytbl2;

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Kyotaro Horiguchi
Thanks for the new version.

This contains only replies. I'll send some further comments in another
mail later.
   
At Thu, 3 Dec 2020 03:49:27 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> > Hello, Kirk. Thank you for the new version.
> 
> Apologies for the delay, but attached are the updated versions to simplify 
> the patches.
> The changes reflected most of your comments/suggestions.
> 
> Summary of changes in the latest versions.
> 1. Updated the function description of DropRelFileNodeBuffers in 0003.
> 2. Updated the commit logs of 0003 and 0004.
> 3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
>   instead of for all involved forks.
> 4. Removed the unnecessary palloc() and subscripts like forks[][],
>firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
>allocation for block[][] was also simplified.
>So 0004 became simpler and more readable.
...
> > > a reliable size of nblocks for supplied relation's fork at that time,
> > > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > > that cached nblocks will not be invalidated by file extension during
> > > recovery.  Otherwise, or if not in recovery, proceed to sequential
> > > search of the whole buffer pool.
> > 
> > This sentence seems involving confusion. It reads as if "we can rely on it
> > because we're relying on it".  And "the cached value won't be invalidated"
> > doesn't explain the reason precisely. The reason I think is that the cached
> > value is guaranteed to be the maximum page we have in shared buffer at least
> > while recovery, and that guarantee is holded by not asking fseek once we
> > cached the value.
> 
> Fixed the commit log of 0003.

Thanks!

...
> > +   nforks = palloc(sizeof(int) * n);
> > +   forks = palloc(sizeof(ForkNumber *) * n);
> > +   blocks = palloc(sizeof(BlockNumber *) * n);
> > +   firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM
> > + 1));
> > +   for (i = 0; i < n; i++)
> > +   {
> > +   forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM +
> > 1));
> > +   blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM
> > + 1));
> > +   }
> > 
> > We can allocate the whole array at once like this.
> > 
> >  BlockNumber (*blocks)[MAX_FORKNUM+1] =
> >   (BlockNumber (*)[MAX_FORKNUM+1])
> >   palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))
> 
> Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc.
> I followed this way in 0004, but it's my first time to see it written this 
> way.
> I am very glad it works, though is it okay to write it this way since I 
> cannot find
> a similar code of declaring and allocating 2D arrays like this in Postgres 
> source code?

Actually it would be somewhat novel for a certain portion of people,
but it is fundamentally the same with function pointers.  Hard to make
it from scratch, but I suppose not so hard to read:)

int (*func_char_to_int)(char x) = some_func;

FWIW isn.c has the following part:

> static bool
> check_table(const char *(*TABLE)[2], const unsigned TABLE_index[10][2])


> > +   nBlocksToInvalidate += blocks[i][numForks];
> > +
> > +   forks[i][numForks++] = j;
> > 
> > We can signal to the later code the absense of a fork by setting
> > InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
> > removed.
> 
> Followed it in 0004.

Looks fine to me, thanks.

> > +   /* Zero the array of blocks because these will all be dropped anyway
> > */
> > +   MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n *
> > (MAX_FORKNUM +
> > +1));
> > 
> > We don't need to prepare nforks, forks and firstDelBlocks for all relations
> > before looping over relations.  In other words, we can fill in the arrays 
> > for a
> > relation at every iteration of relations.
> 
> Followed your advice. Although I now drop the buffers per fork, which now
> removes forks[][], nforks, firstDelBlocks[].

That's fine for me.

> > +* We enter the optimization iff we are in recovery and the number of
> > +blocks to
> > 
> > This comment ticks out of 80 columns. (I'm not sure whether that convention
> > is still valid..)
> 
> Fixed.
>  
> > +   if (InRecovery && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD)
> > 
> > We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do
> > that.
> 
> 
> As for DropRelFileNodesAllBuffers use case, I used InRecovery
> so that the optimization still works.
>   Horiguchi-san also wrote in another mail:
> > A bit different from the point, but if some tuples have been inserted to the
> > truncated table, XLogReadBufferExtended() is called for the table and the
> > length is cached.
> I was wrong in my previous claim that the "cached" value always return false.
> When I checked the recovery test log from recovery tap test, there was only
> one example when "cached" became true (script below) and entered the

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Kyotaro Horiguchi
At Thu, 3 Dec 2020 07:18:16 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Jamison, Kirk/ジャミソン カーク 
> > Apologies for the delay, but attached are the updated versions to simplify 
> > the
> > patches.
> 
> Looks good for me.  Thanks to Horiguchi-san and Andres-san, the code bebecame 
> further compact and easier to read.  I've marked this ready for committer.
> 
> 
> To the committer:
> I don't think it's necessary to refer to COMMIT/ROLLBACK PREPARED in the 
> following part of the 0003 commit message.  They surely call 
> DropRelFileNodesAllBuffers(), but COMMIT/ROLLBACK also call it.
> 
> the full scan threshold. This improves the DropRelationFiles()
> performance when the TRUNCATE command truncated off any of the empty
> pages at the end of relation, and when dropping relation buffers if a
> commit/rollback transaction has been prepared in FinishPreparedTransaction().

I think whether we can use this optimization only by looking
InRecovery is still in doubt.  Or if we can decide that on that
criteria, 0003 also can be simplivied using the same assumption.


Separate from the maybe-remaining discussion, I have a comment on the
revised code in 0004.

+* equal to the full scan threshold.
+*/
+   if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD)
+   {
+   pfree(block);
+   goto buffer_full_scan;
+   }

I don't particularily hate goto statement but we can easily avoid that
by reversing the condition here.  You might consider the length of the
line calling "FindAndDropRelFileNodeBuffers" but the indentation can
be lowered by inverting the condition on BLockNumberIsValid.

!| if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 | {
 |  for (i = 0; i < n; i++)
 |  {
 |  /*
 |   * If block to drop is valid, drop the buffers of the fork.
 |   * Zero the firstDelBlock because all buffers will be
 |   * dropped anyway.
 |   */
 |  for (j = 0; j <= MAX_FORKNUM; j++)
 |  {
!|  if (!BlockNumberIsValid(block[i][j]))
!|  continue;
 | 
 |  
FindAndDropRelFileNodeBuffers(smgr_reln[i]->smgr_rnode.node,
 |  
  j, block[i][j], 0);
 |  }
 |  }
 |  pfree(block);
 |  return;
 | }
 | 
 | pfree(block);

Or we can separate the calcualtion part and the execution part by
introducing a flag "do_fullscan".

 |  /*
 |   * We enter the optimization iff we are in recovery.  Otherwise,
 |   * we proceed to full scan of the whole buffer pool.
 |   */
 |  if (InRecovery)
 |  {
...
!|  if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
!|  do_fullscan = false;
!|  }
!|
!|  if (!do_fullscan)
!|  {
 |  for (i = 0; i < n; i++)
 |  {
 |  /*
 |   * If block to drop is valid, drop the buffers of the 
fork.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Is it useful to record whether plans are generic or custom?

2020-12-03 Thread Fujii Masao




On 2020/11/30 15:24, Tatsuro Yamada wrote:

Hi Torikoshi-san,



In this patch, exposing new columns is mandatory, but I think
it's better to make it optional by adding a GUC something
like 'pgss.track_general_custom_plans.

I also feel it makes the number of columns too many.
Just adding the total time may be sufficient.



I think this feature is useful for DBA. So I hope that it gets
committed to PG14. IMHO, many columns are Okay because DBA can
select specific columns by their query.
Therefore, it would be better to go with the current design.


But that design may waste lots of memory. No? For example, when
plan_cache_mode=force_custom_plan, the memory used for the columns
for generic plans is not used.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-03 Thread Michael Paquier
On Thu, Dec 03, 2020 at 08:46:09PM +0100, Peter Eisentraut wrote:
> There are a couple of more places like this, including the existing
> ClusterOption that this patched moved around, but we should be removing
> those.
> 
> My reasoning is that if you look at an enum value of this type, either say
> in a switch statement or a debugger, the enum value might not be any of the
> defined symbols.  So that way you lose all the type checking that an enum
> might give you.

VacuumOption does that since 6776142, and ClusterOption since 9ebe057,
so switching ReindexOption to just match the two others still looks
like the most consistent move.  Please note that there is more than
that, like ScanOptions, relopt_kind, RVROption, InstrumentOption,
TableLikeOption.

I would not mind changing that, though I am not sure that this
improves readability.  And if we'd do it, it may make sense to extend
that even more to the places where it would apply like the places
mentioned one paragraph above.
--
Michael


signature.asc
Description: PGP signature


Re: Renaming cryptohashes.c to cryptohashfuncs.c

2020-12-03 Thread Michael Paquier
On Thu, Dec 03, 2020 at 10:10:44PM +0100, Daniel Gustafsson wrote:
> +1 on this proposed rename.

Thanks.  I have been able to get that done as of bd94a9c.
--
Michael


signature.asc
Description: PGP signature


Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Kyotaro Horiguchi
At Fri, 4 Dec 2020 12:00:09 +0900, Keisuke Kuroda 
 wrote in 
> Hi Amit,
> 
> > I have attached a patch in which I've tried to merge the ideas from
> > both my patch and Kuroda-san's.  I liked that his patch added
> > conparentid to RI_ConstraintInfo because that saves a needless
> > syscache lookup for constraints that don't have a parent.  I've kept
> > my idea to compute the root constraint id only once in
> > ri_LoadConstraint(), not on every invocation of ri_BuildQueryKey().
> > Kuroda-san, anything you'd like to add to that?
> 
> Thank you for the merge! It looks good to me.
> I think a fix for InvalidateConstraintCacheCallBack() is also good.
> 
> I also confirmed that the patch passed the make check-world.

It's fine that constraint_rood_id overrides constraint_id, but how
about that constraint_root_id stores constraint_id if it is not a
partition?  That change makes the patch a bit simpler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Remove incorrect assertion in reorderbuffer.c.

2020-12-03 Thread Dilip Kumar
On Thu, Dec 3, 2020 at 5:33 PM Amit Kapila  wrote:
>
> We start recording changes in ReorderBufferTXN even before we reach
> SNAPBUILD_CONSISTENT state so that if the commit is encountered after
> reaching that we should be able to send the changes of the entire
> transaction. Now, while recording changes if the reorder buffer memory
> has exceeded logical_decoding_work_mem then we can start streaming if
> it is allowed and we haven't yet streamed that data. However, we must
> not allow streaming to start unless the snapshot has reached
> SNAPBUILD_CONSISTENT state.
>
> I have also improved the comments atop ReorderBufferResetTXN to
> mention the case when we need to continue streaming after getting an
> error.
>
> Attached patch for the above changes.
>
> Thoughts?

LGTM.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Is it useful to record whether plans are generic or custom?

2020-12-03 Thread torikoshia

On 2020-12-04 14:29, Fujii Masao wrote:

On 2020/11/30 15:24, Tatsuro Yamada wrote:

Hi Torikoshi-san,



In this patch, exposing new columns is mandatory, but I think
it's better to make it optional by adding a GUC something
like 'pgss.track_general_custom_plans.

I also feel it makes the number of columns too many.
Just adding the total time may be sufficient.



I think this feature is useful for DBA. So I hope that it gets
committed to PG14. IMHO, many columns are Okay because DBA can
select specific columns by their query.
Therefore, it would be better to go with the current design.


But that design may waste lots of memory. No? For example, when
plan_cache_mode=force_custom_plan, the memory used for the columns
for generic plans is not used.



Yeah.

ISTM now that creating pg_stat_statements_xxx views
both for generic andcustom plans is better than my PoC patch.

And I'm also struggling with the following.

| However, I also began to wonder how effective it would be to just
| distinguish between generic and custom plans.  Custom plans can
| include all sorts of plans. and thinking cache validation, generic
| plans can also include various plans.

| Considering this, I'm starting to feel that it would be better to
| not just keeping whether generic or cutom but the plan itself as
| discussed in the below thread.


Yamada-san,

Do you think it's effective just distinguish between generic
and custom plans?

Regards,




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-03 Thread Fujii Masao




On 2020/11/27 11:12, Bharath Rupireddy wrote:

On Wed, Nov 25, 2020 at 12:13 AM Alexey Kondratov
 wrote:


1) Return 'true' if there were open connections and we successfully
closed them.
2) Return 'false' in the no-op case, i.e. there were no open
connections.
3) Rise an error if something went wrong. And non-existing server case
belongs to this last category, IMO.



Done this way.



I am not sure, but I think, that instead of adding this additional flag
into ConnCacheEntry structure we can look on entry->xact_depth and use
local:

bool used_in_current_xact = entry->xact_depth > 0;

for exactly the same purpose. Since we set entry->xact_depth to zero at
the end of xact, then it was used if it is not zero. It is set to 1 by
begin_remote_xact() called by GetConnection(), so everything seems to be
fine.



Done.



In the case of pgfdw_inval_callback() this argument makes sense, since
syscache callbacks work that way, but here I can hardly imagine a case
where we can use it. Thus, it still looks as a preliminary complication
for me, since we do not have plans to use it, do we? Anyway, everything
seems to be working fine, so it is up to you to keep this additional
argument.



Removed the cacheid variable.



Following this logic:

1) If keep_connections == true, then per-server keep_connection has a
*higher* priority, so one can disable caching of a single foreign
server.

2) But if keep_connections == false, then it works like a global switch
off indifferently of per-server keep_connection's, i.e. they have a
*lower* priority.

It looks fine for me, at least I cannot propose anything better, but
maybe it should be documented in 0004?



Done.



I think that GUC acronym is used widely only in the source code and
Postgres docs tend to do not use it at all, except from acronyms list
and a couple of 'GUC parameters' collocation usage. And it never used in
a singular form there, so I think that it should be rather:

A configuration parameter,
postgres_fdw.keep_connections, default being...



Done.



The whole paragraph is really difficult to follow. It could be something
like that:

   
Note that setting postgres_fdw.keep_connections
to
off does not discard any previously made and still open
connections immediately.
They will be closed only at the end of a future transaction, which
operated on them.

To close all connections immediately use
postgres_fdw_disconnect function.
   



Done.

Attaching the v2 patch set. Please review it further.


Regarding the 0001 patch, we should add the function that returns
the information of cached connections like dblink_get_connections(),
together with 0001 patch? Otherwise it's not easy for users to
see how many cached connections are and determine whether to
disconnect them or not. Sorry if this was already discussed before.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-12-03 Thread Kyotaro Horiguchi
Thanks for taking a look on this.

At Fri, 4 Dec 2020 04:20:47 +,  wrote in 
> When I execute pg_waldump, I found that XLOG/SWITCH_JUNK appears twice.
> Is this problem solved by the way of correcting the previously discussed 
> Transaction/COMMIT?
> 
> $ ../bin/pg_waldump --stats=record ../data/pg_wal/00010001
> Type   N  (%)  Record 
> size  (%) FPI size  (%)Combined size  (%)
>    -  ---  
> ---  ---   ----  
> ---
..
> XLOG/SWITCH_JUNK   0 (  0.00)
> 0 (  0.00)0 (  0.00)0 (  0.00)
...
> XLOG/SWITCH_JUNK   0 (  0.00)
> 0 (  0.00)0 (  0.00)0 (  0.00)

Yeah, that's because of XLogDumpDisplayStats forgets to consider ri
(rmgr id) when showing the lines. If there's a record with info = 0x04
for other resources than RM_XLOG_ID, the spurious line is shown.

The first one is for XLOG_HEAP2_VISIBLE and the latter is for
XLOG_HEAP_HOT_UPDATE, that is, both of which are not for XLOG_SWITCH..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A new function to wait for the backend exit after termination

2020-12-03 Thread Bharath Rupireddy
On Fri, Dec 4, 2020 at 8:44 AM Hou, Zhijie 
wrote:
>
> -however only superusers can terminate superuser backends.
> +however only superusers can terminate superuser backends. When no
> +wait and timeout
are
> +provided, only SIGTERM is sent to the backend with the given
process
> +ID and false is returned immediately. But the
>
> I test the case when no wait and timeout are provided.
> True is returned as the following which seems different from the doc.
>
> postgres=# select pg_terminate_backend(pid);
>  pg_terminate_backend
> --
>  t
> (1 row)
>

Thanks for pointing that out. I reworded that statement. Attaching v5
patch. Please have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v5-0001-pg_terminate_backend-with-wait-timeout-and-pg_wai.patch
Description: Binary data


Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Amit Langote
On Fri, Dec 4, 2020 at 2:48 PM Kyotaro Horiguchi
 wrote:
> At Fri, 4 Dec 2020 12:00:09 +0900, Keisuke Kuroda 
>  wrote in
> > Hi Amit,
> >
> > > I have attached a patch in which I've tried to merge the ideas from
> > > both my patch and Kuroda-san's.  I liked that his patch added
> > > conparentid to RI_ConstraintInfo because that saves a needless
> > > syscache lookup for constraints that don't have a parent.  I've kept
> > > my idea to compute the root constraint id only once in
> > > ri_LoadConstraint(), not on every invocation of ri_BuildQueryKey().
> > > Kuroda-san, anything you'd like to add to that?
> >
> > Thank you for the merge! It looks good to me.
> > I think a fix for InvalidateConstraintCacheCallBack() is also good.
> >
> > I also confirmed that the patch passed the make check-world.
>
> It's fine that constraint_rood_id overrides constraint_id, but how
> about that constraint_root_id stores constraint_id if it is not a
> partition?  That change makes the patch a bit simpler.

My patch was like that before posting to this thread, but keeping
constraint_id and constraint_root_id separate looked better for
documenting the partitioning case as working differently from the
regular table case.  I guess a comment in ri_BuildQueryKey is enough
for that though and it's not like we're using constraint_root_id in
any other place to make matters confusing, so I changed it as you
suggest.  Updated patch attached.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v3-0001-ri_triggers.c-Use-root-constraint-OID-as-key-to-r.patch
Description: Binary data


Re: Is it useful to record whether plans are generic or custom?

2020-12-03 Thread Kyotaro Horiguchi
At Fri, 04 Dec 2020 15:03:25 +0900, torikoshia  
wrote in 
> On 2020-12-04 14:29, Fujii Masao wrote:
> > On 2020/11/30 15:24, Tatsuro Yamada wrote:
> >> Hi Torikoshi-san,
> >> 
> >>> In this patch, exposing new columns is mandatory, but I think
> >>> it's better to make it optional by adding a GUC something
> >>> like 'pgss.track_general_custom_plans.
> >>> I also feel it makes the number of columns too many.
> >>> Just adding the total time may be sufficient.
> >> I think this feature is useful for DBA. So I hope that it gets
> >> committed to PG14. IMHO, many columns are Okay because DBA can
> >> select specific columns by their query.
> >> Therefore, it would be better to go with the current design.
> > But that design may waste lots of memory. No? For example, when
> > plan_cache_mode=force_custom_plan, the memory used for the columns
> > for generic plans is not used.
> > 
> 
> Yeah.
> 
> ISTM now that creating pg_stat_statements_xxx views
> both for generic andcustom plans is better than my PoC patch.
> 
> And I'm also struggling with the following.
> 
> | However, I also began to wonder how effective it would be to just
> | distinguish between generic and custom plans.  Custom plans can
> | include all sorts of plans. and thinking cache validation, generic
> | plans can also include various plans.
> 
> | Considering this, I'm starting to feel that it would be better to
> | not just keeping whether generic or cutom but the plan itself as
> | discussed in the below thread.

FWIW, that seems to me to be like some existing extension modules,
pg_stat_plans or pg_store_plans..  The former is faster but may lose
plans, the latter doesn't lose plans but slower.  I feel that we'd
beter consider simpler feature if we are intendeng it to be a part of
a contrib module,

> Yamada-san,
> 
> Do you think it's effective just distinguish between generic
> and custom plans?
> 
> Regards,

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Refactor MD5 implementations and switch to EVP for OpenSSL

2020-12-03 Thread Michael Paquier
On Tue, Nov 10, 2020 at 01:28:09PM +0900, Michael Paquier wrote:
> The CF bot has been complaining on Windows and this issue is fixed in
> the attached.  A refresh of src/tools/msvc for pgcrypto was just
> missing.

Now that HEAD has the necessary infrastructure to be able to plug in
easily new cryptohash routines, here is a rebased patch for MD5.  The
basics are unchanged.  Here is a summary:
- The duplication with MD5 implementations (pgcrypto, src/common/) is
removed, and gets only used when not building with OpenSSL.
- MD5 uses EVP when building with OpenSSL.
- Similarly to SHA2, the fallback implementation of MD5 is kept
internal to src/common/, with an internal header called md5_int.h.
The routines for init, update and final calls are similar to the SHA2
equivalents, making the changes of cryptohash.c straight-forward.

The amount of code shaved is still nice:
13 files changed, 641 insertions(+), 775 deletions(-) 
--
Michael
From 30ede1b29bfce6b5dbfb3a7d1e4243577e8657d6 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 4 Dec 2020 15:50:48 +0900
Subject: [PATCH v3] Refactor MD5 implementations in the tree

This removes the duplicated MD5 implementations present in both
src/common/ and contrib/pgcrypto/, moving it to src/common/.

Similarly to the fallback implementation used for SHA2, the fallback
implementation of MD5 is moved to src/common/md5.c with an internal
header called md5_int.h for the init, update and final routines.  This
gets consumed by cryptohash.c.

When building with OpenSSL, EVP is used for MD5.  With the recent
refactoring work for cryptohash functions, this change is
straight-forward.

The original routines used for MD5-hashed passwords are moved to a
separate file called md5_common.c, also in src/common/, aimed at being
shared between all MD5 implementations.
---
 src/include/common/cryptohash.h |   3 +-
 src/include/common/md5.h|   3 +-
 src/common/Makefile |   3 +-
 src/common/cryptohash.c |  13 +
 src/common/cryptohash_openssl.c |   3 +
 src/common/md5.c| 646 ++--
 src/common/md5_common.c | 145 +++
 src/common/md5_int.h|  85 +
 contrib/pgcrypto/Makefile   |   2 +-
 contrib/pgcrypto/internal.c |  25 +-
 contrib/pgcrypto/md5.c  | 397 
 contrib/pgcrypto/md5.h  |  79 
 src/tools/msvc/Mkvcbuild.pm |  12 +-
 13 files changed, 641 insertions(+), 775 deletions(-)
 create mode 100644 src/common/md5_common.c
 create mode 100644 src/common/md5_int.h
 delete mode 100644 contrib/pgcrypto/md5.c
 delete mode 100644 contrib/pgcrypto/md5.h

diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h
index 0e4a6631a3..6ead1cb8e5 100644
--- a/src/include/common/cryptohash.h
+++ b/src/include/common/cryptohash.h
@@ -18,7 +18,8 @@
 /* Context Structures for each hash function */
 typedef enum
 {
-	PG_SHA224 = 0,
+	PG_MD5 = 0,
+	PG_SHA224,
 	PG_SHA256,
 	PG_SHA384,
 	PG_SHA512
diff --git a/src/include/common/md5.h b/src/include/common/md5.h
index 8695f10dff..b15635d600 100644
--- a/src/include/common/md5.h
+++ b/src/include/common/md5.h
@@ -1,7 +1,7 @@
 /*-
  *
  * md5.h
- *	  Interface to libpq/md5.c
+ *	  Constants and common utilities related to MD5.
  *
  * These definitions are needed by both frontend and backend code to work
  * with MD5-encrypted passwords.
@@ -19,6 +19,7 @@
 #define MD5_PASSWD_CHARSET	"0123456789abcdef"
 #define MD5_PASSWD_LEN	35
 
+/* Utilities common to all the MD5 implementations, as of md5_common.c */
 extern bool pg_md5_hash(const void *buff, size_t len, char *hexsum);
 extern bool pg_md5_binary(const void *buff, size_t len, void *outbuf);
 extern bool pg_md5_encrypt(const char *passwd, const char *salt,
diff --git a/src/common/Makefile b/src/common/Makefile
index b8f5187282..af891cb0ce 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -63,7 +63,7 @@ OBJS_COMMON = \
 	keywords.o \
 	kwlookup.o \
 	link-canary.o \
-	md5.o \
+	md5_common.o \
 	pg_get_line.o \
 	pg_lzcompress.o \
 	pgfnames.o \
@@ -86,6 +86,7 @@ OBJS_COMMON += \
 else
 OBJS_COMMON += \
 	cryptohash.o \
+	md5.o \
 	sha2.o
 endif
 
diff --git a/src/common/cryptohash.c b/src/common/cryptohash.c
index a61091f456..5cc2572eb6 100644
--- a/src/common/cryptohash.c
+++ b/src/common/cryptohash.c
@@ -24,6 +24,7 @@
 #include 
 
 #include "common/cryptohash.h"
+#include "md5_int.h"
 #include "sha2_int.h"
 
 /*
@@ -57,6 +58,9 @@ pg_cryptohash_create(pg_cryptohash_type type)
 
 	switch (type)
 	{
+		case PG_MD5:
+			ctx->data = ALLOC(sizeof(pg_md5_ctx));
+			break;
 		case PG_SHA224:
 			ctx->data = ALLOC(sizeof(pg_sha224_ctx));
 			break;
@@ -95,6 +99,9 @@ pg_cryptohash_init(pg_cryptohash_ctx *ctx)
 
 	switch (ctx->type)
 	{
+		case PG_MD5:
+			pg_md5_init((pg_md5_ctx *) ctx->data);
+			break;
 		case PG_SHA224:
 			pg_sha224_init((pg_sha224_ctx *) ctx->data);

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread k.jami...@fujitsu.com
On Friday, December 4, 2020 12:42 PM, Tang, Haiying wrote:
> Hello, Kirk
> 
> Thanks for providing the new patches.
> I did the recovery performance test on them, the results look good. I'd like 
> to
> share them with you and everyone else.
> (I also record VACUUM and TRUNCATE execution time on master/primary in
> case you want to have a look.)

Hi, Tang.
Thank you very much for verifying the performance using the latest set of 
patches.
Although it's not supposed to affect the non-recovery path (execution on 
primary),
It's good to see those results too.

> 1. VACUUM and Failover test results(average of 15 times) [VACUUM]
> ---execution time on master/primary
> shared_buffers  master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M9.440  9.483   0%
> 10G74.689 76.219   2%
> 20G   152.538138.292  -9%
> 
> [Failover] ---execution time on standby
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M3.6292.961-18%
> 10G82.4432.627-97%
> 20G   171.3882.607-98%
> 
> 2. TRUNCATE and Failover test results(average of 15 times) [TRUNCATE]
> ---execution time on master/primary
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   49.271   49.867 1%
> 10G   172.437  175.197 2%
> 20G   279.658  278.752 0%
> 
> [Failover] ---execution time on standby
> shared_buffersmaster(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   4.8773.989-18%
> 10G   92.6803.975-96%
> 20G  182.0353.962-98%
> 
> [Machine spec]
> CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
> Memory: 64G
> OS: CentOS 8
> 
> [Failover test data]
> Total table Size: 700M
> Table: 1 tables (1000 rows per table)
> 
> If you have question on my test, please let me know.

Looks great.
That was helpful to see if there were any performance differences than the 
previous
versions' results. But I am glad it turned out great too.

Regards,
Kirk Jamison




Logical archiving

2020-12-03 Thread Andrey Borodin
Hi all

I was discussing problems of CDC with scientific community and they asked this 
simple question: "So you have efficient WAL archive on a very cheap storage, 
why don't you have a logical archive too?"
This seems like a wild idea. But really, we have a super expensive NVMe drives 
for OLTP workload. And use this devices to store buffer for data to be dumped 
into MapReduce\YT analytical system.
If OLAP cannot consume data fast enough - we are out of space due to repl slot.
If we have a WAL HA switchover - OLAP has a hole in the stream and have to 
resync data from the scratch.

If we could just run archive command ```archive-tool wal-push 
00090F2C00E1.logical``` with contents of logical replication - this 
would be super cool for OLAP. I'd prefer even avoid writing 
00090F2C00E1.logical to disk, i.e. push data on stdio or something 
like that.

What do you think?

Best regards, Andrey Borodin.



RE: In-placre persistance change of a relation

2020-12-03 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> > No, not really.  The issue is more around what happens if we crash
> > part way through.  At crash recovery time, the system catalogs are not
> > available, because the database isn't consistent yet and, anyway, the
> > startup process can't be bound to a database, let alone every database
> > that might contain unlogged tables.  So the sentinel that's used to
> > decide whether to flush the contents of a table or index is the
> > presence or absence of an _init fork, which the startup process
> > obviously can see just fine.  The _init fork also tells us what to
> > stick in the relation when we reset it; for a table, we can just reset
> > to an empty file, but that's not legal for indexes, so the _init fork
> > contains a pre-initialized empty index that we can just copy over.
> >
> > Now, to make an unlogged table logged, you've got to at some stage
> > remove those _init forks.  But this is not a transactional operation.
> > If you remove the _init forks and then the transaction rolls back,
> > you've left the system an inconsistent state.  If you postpone the
> > removal until commit time, then you have a problem if it fails,
> 
> It's true. That are the cause of headache.
...
> The current implement is simple.  It's enough to just discard old or
> new relfilenode according to the current transaction ends with commit
> or abort. Tweaking of relfilenode under use leads-in some skews in
> some places.  I used pendingDelete mechanism a bit complexified way
> and a violated an abstraction (I think, calling AM-routines from
> storage.c is not good.) and even introduce a new fork kind only to
> mark a init fork as "not committed yet".  There might be better way,
> but I haven't find it.

I have no alternative idea yet, too.  I agree that we want to avoid them, 
especially introducing inittmp fork...  Anyway, below are the rest of my review 
comments for 0001.  I want to review 0002 when we have decided to go with 0001.


(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:

[src/include/catalog/storage_xlog.h]
/*
 * Declarations for smgr-related XLOG records
 *
 * Note: we log file creation and truncation here, but logging of deletion
 * actions is handled by xact.c, because it is part of transaction commit.
 */

[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.

Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives.  Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway.  (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)


(3)
+/* This is bit-map, not ordianal numbers  */

There seems to be no comments using "bit-map".  "Flags for ..." can be seen 
here and there.


(4)
Some wrong spellings:

+   /* we flush this buffer when swithing to PERMANENT */

swithing -> switching

+* alredy flushed out by RelationCreate(Drop)InitFork called 
just

alredy -> already

+* relation content to be WAL-logged to recovery the table.

recovery -> recover

+* The inittmp fork works as the sentinel to identify that situaton.

situaton -> situation


(5)
+   table_close(classRel, NoLock);
+
+
+
+
 }

These empty lines can be deleted.


(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+* Make an XLOG entry reporting the file unlink.

Not unlink but buffer persistence?


(7)
+   /*
+* index-init fork needs further initialization. ambuildempty shoud do
+* WAL-log and file sync by itself but otherwise we do that by myself.
+*/
+   if (rel->rd_rel->relkind == RELKIND_INDEX)
+   rel->rd_indam->ambuildempty(rel);
+   else
+   {
+   log_smgrcreate(&rnode, INIT_FORKNUM);
+   smgrimmedsync(srel, INIT_FORKNUM);
+   }
+
+   /*
+* We have created the init fork. If server crashes before the current
+* transaction ends the init fork left alone corrupts data while 
recovery.
+* The inittmp fork works as the sentinel to identify that situaton.
+*/
+   smgrcreate(srel, INITTMP_FORKNUM, false);
+   log_smgrcreate(&rnode, INITTMP_FORKNUM);
+   smgrimmedsync(srel, INITTMP_FORKNUM);

If the server crashes between these two processings, only the init fork exists. 
 Is it correct to create the inittmp fork first?


(8)
+   if (inxact_created)
+   {
+   SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+   smgrclose(srel);
+   log_smgrunlink(