date:20200708

On Wed, Jul 8, 2020 at 11:28 AM Masahiko Sawada
 wrote:
>
> On Tue, 7 Jul 2020 at 17:50, Magnus Hagander  wrote:
> >
> >
> >
> > On Tue, Jul 7, 2020 at 5:10 AM Amit Kapila  wrote:
> >>
> >> On Tue, Jul 7, 2020 at 7:07 AM Masahiko Sawada
> >>  wrote:
> >> >
> >> > On Mon, 6 Jul 2020 at 20:45, Amit Kapila  wrote:
> >> > >
> >> > > > Second, as long as the unique identifier is the slot name there is no
> >> > > > convenient way to distinguish between the same name old and new
> >> > > > replication slots, so the backend process or wal sender process sends
> >> > > > a message to the stats collector to drop the replication slot at
> >> > > > ReplicationSlotDropPtr(). This strategy differs from what we do for
> >> > > > table, index, and function statistics. It might not be a problem but
> >> > > > I’m thinking a better way.
> >> > > >
> >> > >
> >> > > Can we rely on message ordering in the transmission mechanism (UDP)
> >> > > for stats?  The wiki suggests [1] we can't.  If so, then this might
> >> > > not work.
> >> >
> >> > Yeah, I'm also concerned about this. Another idea would be to have
> >> > another unique identifier to distinguish old and new replication slots
> >> > with the same name. For example, creation timestamp. And then we
> >> > reclaim the stats of unused slots later like table and function
> >> > statistics.
> >> >
> >>
> >> So, we need to have 'creation timestamp' as persistent data for slots
> >> to achieve this?  I am not sure of adding creation_time as a parameter
> >> to identify for this case because users can change timings on systems
> >> so it might not be a bullet-proof method but I agree that it can work
> >> in general.
> >
> >
> > If we need them to be persistent across time like that, perhaps we simply 
> > need to assign oids to replication slots? That might simplify this problem 
> > quite a bit?
>
> Yeah, I guess assigning oids to replication slots in the same way of
> oids in system catalogs might not work because physical replication
> slot can be created even during recovery. But using a
> monotonically-increasing integer as id seems better and straight
> forward.
>

But don't we need to make it WAL logged as well similar to what we do
in GetNewObjectId?  I am thinking do we really need Oids for slots or
is it okay to have some approximate stats in boundary cases?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [doc] modifying unit from characters to bytes

> On 8 Jul 2020, at 04:25, Fujii Masao  wrote:
> 
> On 2020/07/08 10:54, torikoshia wrote:
>> Hi,
>> The manual describes the size of pg_stat_activity.query
>> as below:
>>  | By default the query text is truncated at 1024 characters;
>> When considering multibyte characters, it seems more
>> accurate to change the unit from "characters" to "bytes".
> 
> Agreed. Barring any objection, I will commit this patch.

+1 to commit this patch, following the link to track_activity_query_size it's
even specified to be bytes there.  IIRC the NULL terminator is also included in
the 1024 bytes which prevents it from being 1024 characters even for
non-multibyte.

cheers ./daniel

Re: Is it useful to record whether plans are generic or custom?





On 2020/07/08 10:14, torikoshia wrote:

On 2020-07-06 22:16, Fujii Masao wrote:

On 2020/06/11 14:59, torikoshia wrote:

On 2020-06-10 18:00, Kyotaro Horiguchi wrote:



+    TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_plan",

This could be a problem if we showed the last plan in this view.  I
think "last_plan_type" would be better.

+    if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_CUSTOM)
+    values[7] = CStringGetTextDatum("custom");
+    else if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_GENERIC)
+    values[7] = CStringGetTextDatum("generic");
+    else
+    nulls[7] = true;

Using swith-case prevents future additional type (if any) from being
unhandled.  I think we are recommending that as a convension.


Thanks for your reviewing!

I've attached a patch that reflects your comments.


Thanks for the patch! Here are the comments.


Thanks for your review!


+    Number of times generic plan was choosen
+    Number of times custom plan was choosen

Typo: "choosen" should be "chosen"?


Thanks, fixed them.


+  
+   last_plan_type text
+  
+  
+    Tells the last plan type was generic or custom. If the prepared
+    statement has not executed yet, this field is null
+  

Could you tell me how this information is expected to be used?
I think that generic_plans and custom_plans are useful when investigating
the cause of performance drop by cached plan mode. But I failed to get
how much useful last_plan_type is.


This may be an exceptional case, but I once had a case needed
to ensure whether generic or custom plan was chosen for specific
queries in a development environment.


In your case, probably you had to ensure that the last multiple (or every)
executions chose generic or custom plan? If yes, I'm afraid that displaying
only the last plan mode is not enough for your case. No?
So it seems better to check generic_plans or custom_plans columns in the
view rather than last_plan_type even in your case. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Resetting spilled txn statistics in pg_stat_replication

2020-07-08 Thread Masahiko Sawada

On Wed, 8 Jul 2020 at 16:04, Amit Kapila  wrote:
>
> On Wed, Jul 8, 2020 at 11:28 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, 7 Jul 2020 at 17:50, Magnus Hagander  wrote:
> > >
> > >
> > >
> > > On Tue, Jul 7, 2020 at 5:10 AM Amit Kapila  
> > > wrote:
> > >>
> > >> On Tue, Jul 7, 2020 at 7:07 AM Masahiko Sawada
> > >>  wrote:
> > >> >
> > >> > On Mon, 6 Jul 2020 at 20:45, Amit Kapila  
> > >> > wrote:
> > >> > >
> > >> > > > Second, as long as the unique identifier is the slot name there is 
> > >> > > > no
> > >> > > > convenient way to distinguish between the same name old and new
> > >> > > > replication slots, so the backend process or wal sender process 
> > >> > > > sends
> > >> > > > a message to the stats collector to drop the replication slot at
> > >> > > > ReplicationSlotDropPtr(). This strategy differs from what we do for
> > >> > > > table, index, and function statistics. It might not be a problem 
> > >> > > > but
> > >> > > > I’m thinking a better way.
> > >> > > >
> > >> > >
> > >> > > Can we rely on message ordering in the transmission mechanism (UDP)
> > >> > > for stats?  The wiki suggests [1] we can't.  If so, then this might
> > >> > > not work.
> > >> >
> > >> > Yeah, I'm also concerned about this. Another idea would be to have
> > >> > another unique identifier to distinguish old and new replication slots
> > >> > with the same name. For example, creation timestamp. And then we
> > >> > reclaim the stats of unused slots later like table and function
> > >> > statistics.
> > >> >
> > >>
> > >> So, we need to have 'creation timestamp' as persistent data for slots
> > >> to achieve this?  I am not sure of adding creation_time as a parameter
> > >> to identify for this case because users can change timings on systems
> > >> so it might not be a bullet-proof method but I agree that it can work
> > >> in general.
> > >
> > >
> > > If we need them to be persistent across time like that, perhaps we simply 
> > > need to assign oids to replication slots? That might simplify this 
> > > problem quite a bit?
> >
> > Yeah, I guess assigning oids to replication slots in the same way of
> > oids in system catalogs might not work because physical replication
> > slot can be created even during recovery. But using a
> > monotonically-increasing integer as id seems better and straight
> > forward.
> >
>
> But don't we need to make it WAL logged as well similar to what we do
> in GetNewObjectId?

Yes. I was thinking that assigning (the maximum number of the existing
slot id + 1) to a new slot without WAL logging.

>  I am thinking do we really need Oids for slots or
> is it okay to have some approximate stats in boundary cases?

I think that using oids has another benefit that we don't need to send
slot name to the stats collector along with the stats. Since the
maximum size of slot name is NAMEDATALEN and we don't support the
pgstat message larger than PGSTAT_MAX_MSG_SIZE (1000 bytes), if the
user wants to increase NAMEDATALEN they might not be able to build.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Typo in pgstat.c

Spotted a small typo in pgstat.c this morning, attached patch fixes this.

cheers ./daniel



pgstat_proejects.diff
Description: Binary data

Re: [doc] modifying unit from characters to bytes





On 2020/07/08 16:17, Daniel Gustafsson wrote:

On 8 Jul 2020, at 04:25, Fujii Masao  wrote:

On 2020/07/08 10:54, torikoshia wrote:

Hi,
The manual describes the size of pg_stat_activity.query
as below:
  | By default the query text is truncated at 1024 characters;
When considering multibyte characters, it seems more
accurate to change the unit from "characters" to "bytes".


Agreed. Barring any objection, I will commit this patch.


+1 to commit this patch, following the link to track_activity_query_size it's
even specified to be bytes there.  IIRC the NULL terminator is also included in
the 1024 bytes which prevents it from being 1024 characters even for
non-multibyte.


Yes, so we should document "truncated at 1023 bytes" for accuracy, instead?
This might be more confusing for users, though

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [doc] modifying unit from characters to bytes

> On 8 Jul 2020, at 10:05, Fujii Masao  wrote:
> 
> On 2020/07/08 16:17, Daniel Gustafsson wrote:
>>> On 8 Jul 2020, at 04:25, Fujii Masao  wrote:
>>> 
>>> On 2020/07/08 10:54, torikoshia wrote:
 Hi,
 The manual describes the size of pg_stat_activity.query
 as below:
  | By default the query text is truncated at 1024 characters;
 When considering multibyte characters, it seems more
 accurate to change the unit from "characters" to "bytes".
>>> 
>>> Agreed. Barring any objection, I will commit this patch.
>> +1 to commit this patch, following the link to track_activity_query_size it's
>> even specified to be bytes there.  IIRC the NULL terminator is also included 
>> in
>> the 1024 bytes which prevents it from being 1024 characters even for
>> non-multibyte.
> 
> Yes, so we should document "truncated at 1023 bytes" for accuracy, instead?
> This might be more confusing for users, though

I think that's overcomplicating things, since we do (will) specify bytes and
not characters.

cheers ./daniel

Re: Typo in pgstat.c

2020-07-08 Thread Magnus Hagander

On Wed, Jul 8, 2020 at 10:05 AM Daniel Gustafsson  wrote:

> Spotted a small typo in pgstat.c this morning, attached patch fixes this.
>

Thanks, applied.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila  wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar  wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time.  I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it.  While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged.  I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit.  I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging.  Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit.  Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations.  This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations.  This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
--
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';

+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well?  I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs.  This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml.  This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default.  And then when I
applied this patch, the tests passed because it disables streaming by
default.  I think this should be patch 0007.

Replication Origins
--
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records.  The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.  The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it.  What do you
think?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Quick doc patch

2020-07-08 Thread Alexander Korotkov

Hi!

On Wed, Jul 8, 2020 at 4:43 AM Michael Paquier  wrote:
> On Tue, Jul 07, 2020 at 06:36:10PM +0900, Michael Paquier wrote:
> > On Tue, Jul 07, 2020 at 09:58:59AM +0200, Daniel Gustafsson wrote:
> >> I agree, it looks like a copy-pasteo in 15cb2bd2700 which introduced the
> >> paragraph for both GIN and BRIN.  LGTM.  Adding Alexander who committed in 
> >> on
> >> cc.
> >
> > +1.
>
> Alexander does not seem to be around, so I have just applied the fix.
> There were more inconsistencies in gin.sgml and spgist.sgml missed in
> 14903f2, making the docs of GIN/SP-GiST less in line with the BRIN
> equivalent, so I have fixed both while on it.

I just read this thread.
Thank you for fixing this!

--
Regards,
Alexander Korotkov

Add session statistics to pg_stat_database

2020-07-08 Thread Laurenz Albe

Here is a patch that adds the following to pg_stat_database:
- number of connections
- number of sessions that were not disconnected regularly
- total time spent in database sessions
- total time spent executing queries
- total idle in transaction time

This is useful to check if connection pooling is working.
It also helps to estimate the size of the connection pool
required to keep the database busy, which depends on the
percentage of the transaction time that is spent idling.

Yours,
Laurenz Albe
From: Laurenz Albe 
Date: Wed, 8 Jul 2020 13:12:42 +0200
Subject: [PATCH] Add session statistics to pg_stat_database

If "track_counts" is active, track the following per database:
- number of connections
- number of sessions that were not disconnected regularly
- total time spent in database sessions
- total time spent executing queries
- total idle in transaction time

This is useful to check if connection pooling is working.
It also helps to estimate the size of the connection pool
required to keep the database busy, which depends on the
percentage of the transaction time that is spent idling.
---
 doc/src/sgml/monitoring.sgml |  46 +
 src/backend/catalog/system_views.sql |   5 +
 src/backend/postmaster/pgstat.c  | 138 ++-
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/pgstatfuncs.c  |  78 +++
 src/include/catalog/pg_proc.dat  |  20 
 src/include/pgstat.h |  29 +-
 src/test/regress/expected/rules.out  |   5 +
 8 files changed, 324 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dfa9d0d641..da66808f02 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3519,6 +3519,52 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   session_time double precision
+  
+  
+   Time spent in database sessions in this database, in milliseconds.
+  
+ 
+
+ 
+  
+   active_time double precision
+  
+  
+   Time spent executing SQL statements in this database, in milliseconds.
+  
+ 
+
+ 
+  
+   idle_in_transaction_time double precision
+  
+  
+   Time spent idling while in a transaction in this database, in milliseconds.
+  
+ 
+
+ 
+  
+   connections bigint
+  
+  
+   Number of connections established to this database.
+  
+ 
+
+ 
+  
+   aborted_sessions bigint
+  
+  
+   Number of database sessions to this database that did not end
+   with a regular client disconnection.
+  
+ 
+
  
   
stats_reset timestamp with time zone
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..64a4e5f0d4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -909,6 +909,11 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_checksum_last_failure(D.oid) AS checksum_last_failure,
 pg_stat_get_db_blk_read_time(D.oid) AS blk_read_time,
 pg_stat_get_db_blk_write_time(D.oid) AS blk_write_time,
+pg_stat_get_db_session_time(D.oid) AS session_time,
+pg_stat_get_db_active_time(D.oid) AS active_time,
+pg_stat_get_db_idle_in_transaction_time(D.oid) AS idle_in_transaction_time,
+pg_stat_get_db_connections(D.oid) AS connections,
+pg_stat_get_db_aborted_sessions(D.oid) AS aborted_sessions,
 pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
 FROM (
 SELECT 0 AS oid, NULL::name AS datname
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..7b62028358 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -247,6 +247,11 @@ static int	pgStatXactCommit = 0;
 static int	pgStatXactRollback = 0;
 PgStat_Counter pgStatBlockReadTime = 0;
 PgStat_Counter pgStatBlockWriteTime = 0;
+static TimestampTz pgStatActiveStart = DT_NOBEGIN;
+static PgStat_Counter pgStatActiveTime = 0;
+static TimestampTz pgStatTransactionIdleStart = DT_NOBEGIN;
+static PgStat_Counter pgStatTransactionIdleTime = 0;
+bool pgStatSessionDisconnected = false;
 
 /* Record that's written to 2PC state file when pgstat state is persisted */
 typedef struct TwoPhasePgStatRecord
@@ -326,6 +331,7 @@ static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static void pgstat_send_slru(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void pgstat_send_connstats(bool force);
 
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
@@ -359,6 +365,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *m

Re: Modifying data type of slot_keep_segs from XLogRecPtr to XLogSegNo





On 2020/07/08 15:22, Fujii Masao wrote:



On 2020/07/08 11:55, torikoshia wrote:

On 2020-07-08 11:15, Fujii Masao wrote:

On 2020/07/08 11:02, torikoshia wrote:

Hi,

Currently, slot_keep_segs is defined as "XLogRecPtr" in KeepLogSeg(),
but it seems that should be "XLogSegNo" because this variable is
segment number.

How do you think?


I agree that using XLogRecPtr for slot_keep_segs is incorrect.
But this variable indicates the number of segments rather than
segment no, uint64 seems better. Thought?


That makes sense.
The number of segments and segment number are different.


Yes, so patch attached. I will commit it later.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Global snapshots

On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
 wrote:
>
> On Tue, 7 Jul 2020 at 15:40, Amit Kapila  wrote:
> >
> >
> > Okay, but isn't there some advantage with this approach (manage 2PC at
> > postgres_fdw level) as well which is that any node will be capable of
> > handling global transactions rather than doing them via central
> > coordinator?  I mean any node can do writes or reads rather than
> > probably routing them (at least writes) via coordinator node.
>
> The postgres server where the client started the transaction works as
> the coordinator node. I think this is true for both this patch and
> that 2PC patch. From the perspective of atomic commit, any node will
> be capable of handling global transactions in both approaches.
>

Okay, but then probably we need to ensure that GID has to be unique
even if that gets generated on different nodes?  I don't know if that
is ensured.

> >  Now, I
> > agree that even if this advantage is there in the current approach, we
> > can't lose the crash-safety aspect of other approach.  Will you be
> > able to summarize what was the problem w.r.t crash-safety and how your
> > patch has dealt it?
>
> Since this patch proceeds 2PC without any logging, foreign
> transactions prepared on foreign servers are left over without any
> clues if the coordinator crashes during commit. Therefore, after
> restart, the user will need to find and resolve in-doubt foreign
> transactions manually.
>

Okay, but is it because we can't directly WAL log in postgres_fdw or
there is some other reason for not doing so?

>
> >
> > > Looking at the commit procedure with this patch:
> > >
> > > When starting a new transaction on a foreign server, postgres_fdw
> > > executes pg_global_snapshot_import() to import the global snapshot.
> > > After some work, in pre-commit phase we do:
> > >
> > > 1. generate global transaction id, say 'gid'
> > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > 3. prepare global snapshot locally, if the local node also involves
> > > the transaction
> > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > >
> > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > returned from each pg_global_snapshot_prepare() executions.
> > >
> > > 5. assign global snapshot locally, if the local node also involves the
> > > transaction
> > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > >
> > > Then, we commit locally (i.g. mark the current transaction as
> > > committed in clog).
> > >
> > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > participants.
> > >
> >
> > As per my current understanding, the overall idea is as follows.  For
> > global transactions, pg_global_snapshot_prepare('gid') will set the
> > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > at the node where that function is executed, it also returns the
> > NodeCSN to the coordinator.  Then the coordinator (the current
> > postgres_fdw node on which write transaction is being executed)
> > computes MaxCSN based on the return value (NodeCSN) of prepare
> > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > to each node.  Finally, when Commit Prepared is issued for each node
> > that MaxCSN will be written to each node including the current node.
> > So, with this idea, each node will have the same view of CSN value
> > corresponding to any particular transaction.
> >
> > For Snapshot management, the node which receives the query generates a
> > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > possible that when we are examining a tuple, the CSN corresponding to
> > xid that has written the tuple has a value as INDOUBT which will
> > indicate that the transaction is yet not committed on all nodes.  And
> > we wait till we get the valid CSN value corresponding to xid and then
> > use it to check if the tuple is visible.
> >
> > Now, one thing to note here is that for global transactions we
> > primarily rely on CSN value corresponding to a transaction for its
> > visibility even though we still maintain CLOG for local transaction
> > status.
> >
> > Leaving aside the incomplete parts and or flaws of the current patch,
> > does the above match the top-level idea of this patch?
>
> I'm still studying this patch but your understanding seems right to me.
>

Cool. While studying, if you can try to think whether this approach is
different from the global coordinator based approach then it would be
great.  Here is my initial thought apart from other reasons the global
coordinator based design can help us to do the global transaction
management and snapshots.  It can allocate xids for each transaction
and then collect the list of running xacts (or CSN) from each node and
then prepare a global snapshot that can be used to perform any
transaction.

OTOH, in the design

Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-07-08 Thread Bharath Rupireddy

Hi,

Currently with the postgres_fdw remote connections cached in the local
backend, the queries that use the cached connections from local
backend will not check whether the remote backend is killed or gone
away, and it goes tries to submit the query and fails if the remote
backend is killed.

This problem was found during the discussions made in [1].

One way, we could solve the above problem is that, upon firing the new
foreign query from local backend using the cached connection,
(assuming the remote backend that was cached in the local backed got
killed for some reasons), instead of failing the query in the local
backend, upon detecting error from the remote backend, we could just
delete the cached old entry and try getting another connection to
remote backend, cache it and proceed to submit the query. This has to
happen only at the beginning of remote xact.

This way, instead of the query getting failed, the query succeeds if
the local backend is able to get a new remote backend connection.

Attaching the patch that implements the above retry mechanism.

The way I tested the patch:
1. select * from foreign_tbl; /*from local backend - this results in a
remote connection being cached in the postgres_fdw connection cache
and a remote backend is opened.*/
2. (intentionally) kill the remote backend, just to simulate the scenario.
3. select * from foreign_tbl; /*from local backend - without patch
this throws error "ERROR: server closed the connection unexpectedly".
with patch - try to use the cached connection at the beginning of
remote xact, upon receiving error from remote postgres server, instead
of aborting the query, delete the cached entry, try to get a new
connection, if it gets, caches it and uses that for executing the
query, query succeeds.

I couldn't think of adding a test case to the existing postgres_fdw
regression test suite with an automated scenario of the remote backend
getting killed.

I would like to thank Ashutosh Bapat (ashutosh.bapat@gmail.com)
for the suggestion to fix this and the review of my initial patch
attached in [2]. I tried to address the review comments provided on my
initial patch [3].

For, one of the Ashutosh's review comments from [3] "the fact that the
same connection may be used by multiple plan nodes", I tried to have
few use cases where there exist joins on two foreign tables on the
same remote server, in a single query, so essentially, the same
connection was used for multiple plan nodes. In this case we avoid
retrying for the second GetConnection() request for the second foreign
table, with the check entry->xact_depth <= 0 , xact_depth after the
first GetConnection() and the first remote query will become 1 and we
don't hit the retry logic and seems like we are safe here. Please add
If I'm missing something here.

Request the community to consider the patch for further review if the
overall idea seems beneficial.

[1]
https://www.postgresql.org/message-id/CAExHW5t21B_XPQy_hownm1Qq%3DhMrgOhX%2B8gDj3YEKFvpk%3DVRgw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CALj2ACXp6DQ3iLGx5g%2BLgVtGwC4F6K9WzKQJpyR4FfdydQzC_g%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/CAExHW5u3Gyv6Q1BEr6zMg0t%2B59e8c4KMfKVrV3Z%3D4UKKjJ19nQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-Retry-Cached-Remote-Connections-For-postgres_fdw.patch
Description: Binary data

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2020-07-08 Thread Paul Guo

Looks like my previous reply was held for moderation (maybe due to my new email 
address).
I configured my pg account today using the new email address. I guess this 
email would be
held for moderation.

I’m now replying my previous reply email and attaching the new patch series.


On Jul 6, 2020, at 10:18 AM, Paul Guo 
mailto:gu...@vmware.com>> wrote:

Thanks for the review. I’m now re-picking up the work. I modified the code 
following the comments.
Besides, I tweaked the test code a bit. There are several things I’m not 100% 
sure. Please see
my replies below.

On Jan 27, 2020, at 11:24 PM, Fujii Masao 
mailto:masao.fu...@oss.nttdata.com>> wrote:

On 2020/01/15 19:18, Paul Guo wrote:
I further fixed the last test failure (due to a small bug in the test, not in 
code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

The approach that the 0003 patch uses is not the perfect solution.
If the standby crashes after tblspc_redo() removes the directory and before
its subsequent COMMIT record is replayed, PANIC error would occur since
there can be some unresolved missing directory entries when we reach the
consistent state. The problem would very rarely happen, though...
Just idea; calling XLogFlush() to update the minimum recovery point just
before tblspc_redo() performs destroy_tablespace_directories() may be
safe and helpful for the problem?

Yes looks like an issue. My understanding is the below scenario.

XLogLogMissingDir()

XLogFlush() in redo (e.g. in a commit redo).  <- create a minimum recovery 
point (we call it LSN_A).

tblspc_redo()->XLogForgetMissingDir()
   <- If we panic immediately after we remove the directory in tblspc_redo()
   <- when we do replay during crash-recovery, we will check consistency at 
LSN_A and thus PANIC inXLogCheckMissingDirs()

commit

We should add a XLogFlush() in tblspc_redo(). This brings several other 
questions to my minds also.


1. Should we call XLogFlush() in dbase_redo()  for XLOG_DBASE_DROP also?
   It calls both XLogDropDatabase() and XLogForgetMissingDir, which seem to 
have this issue also?

2. xact_redo_abort() calls DropRelationFiles() also. Why do not we call 
XLogFlush() there?



- appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-  xlrec->src_tablespace_id, xlrec->src_db_id,
-  xlrec->tablespace_id, xlrec->db_id);
+ dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+ dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+ appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+ pfree(dbpath2);
+ pfree(dbpath1);

If the patch is for the bug fix and would be back-ported, the above change
would lead to change pg_waldump's output for CREATE/DROP DATABASE between
minor versions. IMO it's better to avoid such change and separate the above
as a separate patch only for master.

I know we do not want wal format between minor releases, but does wal 
description string change
between minor releases affect users? Anyone I’ll extract this part into a 
separate patch in the series
since this change is actually independent of the other changes..


- appendStringInfo(buf, " %u/%u",
-  xlrec->tablespace_ids[i], xlrec->db_id);
+ {
+ dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+ appendStringInfo(buf,  "%s", dbpath1);
+ pfree(dbpath1);
+ }

Same as above.

BTW, the above "%s" should be " %s", i.e., a space character needs to be
appended to the head of "%s”.

OK


+ get_parent_directory(parent_path);
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
+ XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);

The third argument of XLogLogMissingDir() should be parent_path instead of
dst_path?

The argument is for debug message printing so both should be fine, but 
admittedly we are
logging for the tablespace directory so parent_path might be better.


+ if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+ elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+  path, spcNode, dbNode);

I think that this elog() is useless and rather confusing.

OK. Modified.


+ XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");

The third argument should be set to the actual path instead of an empty
string. Otherwise XLogForgetMissingDir() may emit a confusing DEBUG2
message. Or the third argument of XLogForgetMissingDir() should be removed
and the path in the DEBUG2 message should be calculated from the spcNode
and dbNode in the hash entry in XLogForgetMissingDir().

I’m now removing the third argument. Use GetDatabasePath() to get the path if 
database did I snot InvalidOid.


+#include "common/file_perm.h"

This seems not necessary.

Right.



v10-0001-Support-node-initialization-from-backup-with-tab.patch
Description:  v10-0001-Support-node-initialization-from-backup-with-tab.patch


v10-0002-Tests-to-replay-create-database-operation-on-sta.patch
Description:  v10-0002-Tests-to-repla

Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-08 Thread Michael Paquier

On Wed, Jul 01, 2020 at 09:46:34AM -0700, Mark Dilger wrote:
> Rebased patch attached.  Thanks for mentioning it!

There are two patches on this thread v2-0001 being much smaller than
v2-0002.  I have looked at 0001 for now, and, like Alvaro, this
renaming makes sense to me.  Those commands work on objects that are
relkinds, except for one OBJECT_TYPE.  So, let's get 0001 patch
merged.  Any objections from others?
--
Michael


signature.asc
Description: PGP signature

[PATCH v2] Allow COPY "text" to output a header and add header matching mode to COPY FROM

2020-07-08 Thread Rémi Lapeyre

Hi, here's a new version of the patch that should apply cleanly. I'll monitor
the status on http://cfbot.cputube.org/

Rémi

---
 contrib/file_fdw/input/file_fdw.source  |  7 +-
 contrib/file_fdw/output/file_fdw.source | 13 ++--
 doc/src/sgml/ref/copy.sgml  |  9 ++-
 src/backend/commands/copy.c | 93 ++---
 src/test/regress/input/copy.source  | 71 ++-
 src/test/regress/output/copy.source | 58 ++-
 6 files changed, 202 insertions(+), 49 deletions(-)

diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source
index 45b728eeb3..7a3983c785 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source
@@ -37,7 +37,6 @@ CREATE USER MAPPING FOR regress_no_priv_user SERVER file_server;

 -- validator tests
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'xml');  -- ERROR
-CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', header 'true');  -- ERROR
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', quote ':');  -- ERROR
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', escape ':'); -- ERROR
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'binary', header 'true');-- ERROR
@@ -80,6 +79,12 @@ CREATE FOREIGN TABLE agg_bad (
 OPTIONS (format 'csv', filename '@abs_srcdir@/data/agg.bad', header 'true', delimiter ';', quote '@', escape '"', null '');
 ALTER FOREIGN TABLE agg_bad ADD CHECK (a >= 0);

+-- test header matching
+CREATE FOREIGN TABLE header_match ("1" int, foo text) SERVER file_server
+OPTIONS (format 'csv', filename '@abs_srcdir@/data/list1.csv', delimiter ',', header 'match');
+CREATE FOREIGN TABLE header_dont_match (a int, foo text) SERVER file_server
+OPTIONS (format 'csv', filename '@abs_srcdir@/data/list1.csv', delimiter ',', header 'match');	-- ERROR
+
 -- per-column options tests
 CREATE FOREIGN TABLE text_csv (
 word1 text OPTIONS (force_not_null 'true'),
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 52b4d5f1df..d76a3dc6f8 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -33,14 +33,12 @@ CREATE USER MAPPING FOR regress_no_priv_user SERVER file_server;
 -- validator tests
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'xml');  -- ERROR
 ERROR:  COPY format "xml" not recognized
-CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', header 'true');  -- ERROR
-ERROR:  COPY HEADER available only in CSV mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', quote ':');  -- ERROR
 ERROR:  COPY quote available only in CSV mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'text', escape ':'); -- ERROR
 ERROR:  COPY escape available only in CSV mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'binary', header 'true');-- ERROR
-ERROR:  COPY HEADER available only in CSV mode
+ERROR:  COPY HEADER available only in CSV and text mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'binary', quote ':');-- ERROR
 ERROR:  COPY quote available only in CSV mode
 CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'binary', escape ':');   -- ERROR
@@ -95,6 +93,11 @@ CREATE FOREIGN TABLE agg_bad (
 ) SERVER file_server
 OPTIONS (format 'csv', filename '@abs_srcdir@/data/agg.bad', header 'true', delimiter ';', quote '@', escape '"', null '');
 ALTER FOREIGN TABLE agg_bad ADD CHECK (a >= 0);
+-- test header matching
+CREATE FOREIGN TABLE header_match ("1" int, foo text) SERVER file_server
+OPTIONS (format 'csv', filename '/Users/remi/src/postgresql/contrib/file_fdw/data/list1.csv', delimiter ',', header 'match');
+CREATE FOREIGN TABLE header_dont_match (a int, foo text) SERVER file_server
+OPTIONS (format 'csv', filename '/Users/remi/src/postgresql/contrib/file_fdw/data/list1.csv', delimiter ',', header 'match');	-- ERROR
 -- per-column options tests
 CREATE FOREIGN TABLE text_csv (
 word1 text OPTIONS (force_not_null 'true'),
@@ -441,12 +444,14 @@ SET ROLE regress_file_fdw_superuser;
 -- cleanup
 RESET ROLE;
 DROP EXTENSION file_fdw CASCADE;
-NOTICE:  drop cascades to 7 other objects
+NOTICE:  drop cascades to 9 other objects
 DETAIL:  drop cascades to server file_server
 drop cascades to user mapping for regress_file_fdw_superuser on server file_server
 drop cascades to user mapping for regress_no_priv_user on server file_server
 drop cascades to foreign table agg_text
 drop cascades to foreign table agg_csv
 drop cascades to foreign table agg_bad
+drop cascades to foreign table header_match
+drop cascades to foreign table header_dont_match
 drop cascades to foreign table text_csv
 DROP ROLE regress_file_fdw_superuser, regress_file_fdw_user, regress_no_priv_user;
diff --git a/doc/src/sgml/ref/

Re: Creating a function for exposing memory usage of backend process





On 2020/07/07 22:02, torikoshia wrote:

On 2020-07-06 15:16, Fujii Masao wrote:

On 2020/07/06 12:12, torikoshia wrote:

On Fri, Jul 3, 2020 at 7:33 PM Fujii Masao  wrote:

Thanks for your review!


I like more specific name like pg_backend_memory_contexts.


Agreed.

When I was trying to add this function as statistics function,
I thought that naming pg_stat_getbackend_memory_context()
might make people regarded it as a "per-backend statistics
function", whose parameter is backend ID number.
So I removed "backend", but now there is no necessity to do
so.


But I'd like to hear more opinions about the name from others.


I changed the name to pg_backend_memory_contexts for the time
being.


+1



- function name: pg_get_memory_contexts()
- source file: mainly src/backend/utils/mmgr/mcxt.c




+       Identification information of the memory context. This field is 
truncated if the identification field is longer than 1024 characters


"characters" should be "bytes"?


Fixed, but I used "characters" while referring to the
descriptions on the manual of pg_stat_activity.query
below.

| By default the query text is truncated at 1024 characters;

It has nothing to do with this thread, but considering
multibyte characters, it also may be better to change it
to "bytes".


Yeah, I agree we should write the separate patch fixing that. You will?
If not, I will do that later.


Thanks, I will try it!


Thanks!





Regarding the other comments, I revised the patch as you pointed.


Thanks for updating the patch! The patch basically looks good to me/
Here are some minor comments:

+#define MEMORY_CONTEXT_IDENT_SIZE    1024

This macro varible name sounds like the maximum allowed length of ident that
each menory context has. But actually this limits the maximum bytes of ident
to display. So I think that it's better to rename this macro to something like
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE. Thought?


Agreed.
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE seems more accurate.


+#define PG_GET_MEMORY_CONTEXTS_COLS    9
+    Datum    values[PG_GET_MEMORY_CONTEXTS_COLS];
+    bool    nulls[PG_GET_MEMORY_CONTEXTS_COLS];

This macro variable name should be PG_GET_BACKEND_MEMORY_CONTEXTS_COLS
for the consistency with the function name?


Thanks! Fixed it.


Thanks for updating the patch! It basically looks good to me.

+  
+   backend memory contexts
+  

Do we need this indexterm?






+{ oid => '2282', descr => 'statistics: information about all memory
contexts of local backend',

Isn't it better to remove "statistics: " from the above description?


Yeah, it's my oversight.



+ 
+  
+   parent text

There can be multiple memory contexts with the same name. So I'm afraid
that it's difficult to identify the actual parent memory context from this
"parent" column. This is ok when logging memory contexts by calling
MemoryContextStats() via gdb. Because child memory contexts are printed
just under their parent, with indents. But this doesn't work in the view.
To identify the actual parent memory or calculate the memory contexts tree
from the view, we might need to assign unique ID to each memory context
and display it. But IMO this is overkill. So I'm fine with current "parent"
column. Thought? Do you have any better idea?


Indeed.
I also feel it's not usual to assign a unique ID, which
can vary every time the view displayed.


Agreed. Displaying such ID would be more confusing to users.
Ok, let's leave the code as it is.

Another comment about parent column is: dynahash can be parent?
If yes, its indent instead of name should be displayed in parent column?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH v2] Allow COPY "text" to output a header and add header matching mode to COPY FROM

> On 8 Jul 2020, at 13:45, Rémi Lapeyre  wrote:
> 
> Hi, here's a new version of the patch that should apply cleanly. I'll monitor
> the status on http://cfbot.cputube.org/

Please reply to the old thread about this, as that's the one connected to the
Commitfest entry and thats where all the discussion has happened.  While
additional threads can be attached to a CF entry, it's for when multiple
discussions are relevant to a patch, a single discussion should not be broken
into multiple threads.

cheers ./daniel

Re: Multi-byte character case-folding

On Mon, Jul 6, 2020 at 8:32 PM Tom Lane  wrote:
> test=# create table MYÉCLASS (f1 text);
> CREATE TABLE
> test=# \dt
>   List of relations
>  Schema |   Name   | Type  |  Owner
> +--+---+--
>  public | myÉclass | table | postgres
> (1 row)
>
> pg_dump will render this as
>
> CREATE TABLE public."myÉclass" (
> f1 text
> );
>
> If we start to case-fold É, then the only way to access this table will
> be by double-quoting its name, which the application probably is not
> expecting (else it would have double-quoted in the original CREATE TABLE).

While this is true, it's also pretty hard to imagine a user being
satisfied with a table that ends up with this kind of mixed-case name.

That's not to say that I have any good idea what to do about this. I
just disagree with labelling the above case as a success.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-07-08 Thread Bharath Rupireddy

> > If I understand it correctly, your suggestion is to add
> > keep_connection option and use that while defining the server object.
> > IMO having keep_connection option at the server object level may not
> > serve the purpose being discussed here.
> > For instance, let's say I create a foreign server in session 1 with
> > keep_connection on, and I want to use that
> > server object in session 2 with keep_connection off and session 3 with
> > keep_connection on and so on.
> > One way we can change the server's keep_connection option is to alter
> > the server object, but that's not a good choice,
> > as we have to alter it at the system level.
> >
> > Overall, though we define the server object in a single session, it
> > will be used in multiple sessions, having an
> > option at the per-server level would not be a good idea.
>
> You present this here as if it should be a Boolean (on or off) but I
> don't see why that should be the case. You can imagine trying to close
> connections if they have been idle for a certain length of time, or if
> there are more than a certain number of them, rather than (or in
> addition to) always/never. Which one is best, and why?
>
If the cached connection idle time property is used (I'm thinking we
can define it per server object) then the local backend might have to
close the connections which are lying unused more than idle time. To
perform this task, the local backend might have to do it before it
goes into idle state(as suggested by you in [1]).  Please correct, if
my understanding/thinking is wrong here.

If the connection clean up is to be done by the local backend, then a
point can be - let say a local session initially issues few foreign
queries for which connections are cached, and it keeps executing all
local queries, without never going to idle mode(I think this scenario
looks too much impractical to me), then we may never clean the unused
cached connections. If this scenario is really impractical if we are
sure that there are high chances that the local backend goes to idle
mode, then the idea of having per-server-object idle time and letting
the local backend clean it up before it goes to idle mode looks great
to me.

>
> I tend to think this is better as an FDW property rather than a core
> facility, but I'm not 100% sure of that and I think it likely depends
> somewhat on the answers we choose to the questions in the preceding
> paragraph.
>
I completely agree on having it as a FDW property.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmob_ksTOgmbXhno%2Bk5XXPOK%2B-JYYLoU3MpXuutP4bH7gzA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: TAP tests and symlinks on Windows

On 7/3/20 10:11 AM, Peter Eisentraut wrote:
> On 2020-06-30 14:13, Michael Paquier wrote:
>> Attached is an updated patch, where I have tried to use a better
>> wording in all the code paths involved.
>
> This new patch doesn't work for me on MSYS2 yet.
>
> It fails right now in 010_pg_basebackup.pl at
>
>     my $realTsDir  = TestLib::perl2host("$shorter_tempdir/tblspc1");
>
> with chdir: No such file or directory.  Because perl2host requires the
> parent directory of the argument to exist, but here it doesn't.

Yeah. I have a fix for that, which also checks to see if the grandparent
directory exists:

-   chdir $parent or die "could not chdir \"$parent\": $!";
+   if (! chdir $parent)
+   {
+   $leaf = '/' . basename ($parent) . $leaf;
+   $parent = dirname $parent;
+   chdir $parent or die "could not chdir
\"$parent\": $!";
+   }

We could generalize it to walk all the way up the path, but I think this
is sufficient.

Incidentally, perl2host is arguably a bad name for this routine - there
is nothing perl-specific about the paths, they are provided by the msys
environment. Maybe virtual2host or some such would be a better name, or
even just host_path or native_path.

>
> If I add
>
>     mkdir $shorter_tempdir;
>
> above it, then it proceeds past that point, but then the CREATE
> TABLESPACE command fails with No such file or directory.  I think the
> call
>
>     symlink "$tempdir", $shorter_tempdir;
>
> creates a directory inside $shorter_tempdir, since it now exists, per
> my above change, rather than in place of $shorter_tempdir.
>
> I think all of this is still a bit too fragile it needs further
> consideration.

The symlink built into msys2 perl is distinctly fragile. I was only able
to get it to work by doing this:

+   mkdir "$tempdir/tblspc1";
+   mkdir "$tempdir/tbl=spc2";
+   mkdir "$tempdir/$superlongname";
+   open (my $x, '>', "$tempdir/tblspc1/stuff") || die $!; print $x
"hi\n"; close($x);
+   open ($x, '>', "$tempdir/tbl=spc2/stuff") || die $!; print $x
"hi\n"; close($x);
+   open ($x, '>', "$tempdir/$superlongname/stuff") || die $!; print
$x "hi\n"; close($x);
    symlink "$tempdir", $shorter_tempdir;
+   unlink "$tempdir/tblspc1/stuff";
+   unlink "$tempdir/tbl=spc2/stuff";
+   unlink "$tempdir/$superlongname/stuff";

which is sufficiently ugly that I don't think we should contemplate it.

Luckily there is an alternative, which doesn't require the use of
Win32::Symlink. Windows can be made to create junctions that function
exactly as expected quite easily - it's a builtin of the cmd.exe
processor, and it's been supported in at least every release since
Windows Vista. Here's a perl function that calls it:

sub dsymlink
{
    my $oldname = shift;
    my $newname = shift;
    if ($windows_os)
    {
    $oldname = TestLib::perl2host($oldname);
    $newname = TestLib::perl2host($newname);
    $oldname =~ s,/,\\,g;
    $newname =~ s,/,\\,g;
    my $cmd = "cmd //c 'mklink /j $newname $oldname'";
    system($cmd);
    }
    else
    {
    symlink $oldname, $newname;
    }
    die "No $newname" unless -e $newname;
}

It might need a little more quoting to be more robust.

Give that, we can simply replace

symlink "$tempdir", $shorter_tempdir;

with

dsymlink "$tempdir", $shorter_tempdir;

Then, with a little more sprinkling of perl2host the pg_basebackup tests
can be made to work on msys2.

I'm going to prepare patches along these lines.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-07-08 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2020-Jun-25, Andres Freund wrote:
> 
> > >My point here is that maybe we don't need to offer a GUC to explicitly
> > >turn spilling off; it seems sufficient to let users change work_mem so
> > >that spilling will naturally not occur.  Why do we need more?
> > 
> > That's not really a useful escape hatch, because I'll often lead to
> > other nodes using more memory.
> 
> Ah -- other nodes in the same query -- you're right, that's not good.

It's exactly how the system has been operating for, basically, forever,
for everything.  Yes, it'd be good to have a way to manage the
overall amount of memory that a query is allowed to use but that's a
huge change and inventing some new 'hash_mem' or some such GUC doesn't
strike me as a move in the right direction- are we going to have
sort_mem next?  What if having one large hash table for aggregation
would be good, but having the other aggregate use a lot of memory would
run the system out of memory?  Yes, we need to do better, but inventing
new node_mem GUCs isn't the direction to go in.

That HashAgg previously didn't care that it was going way over
work_mem was, if anything, a bug.  Inventing new GUCs late in the
cycle like this under duress seems like a *really* bad idea.  Yes,
people are going to have to adjust work_mem if they want these queries
to continue using a ton of memory to run when the planner didn't think
it'd actually take that much memory- but then, in lots of the kinds of
cases that I think you're worrying about, the stats aren't actually that
far off and people did increase work_mem to get the HashAgg plan in the
first place.

I'm also in support of having enable_hashagg_disk set to true as the
default, just like all of the other enable_*'s.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-08 Thread Ajin Cherian

I was going through this thread and testing and reviewing the patches, I
think this is a great feature to have and one which customers would
appreciate. I wanted to help out, and I saw a request for a test patch for
a GUC to always enable streaming on logical replication. Here's one on top
of patchset v31, just in case you still need it. By default the GUC is
turned on, I ran the regression tests with it and didn't see any errors.

thanks,
Ajin
Fujitsu Australia

On Wed, Jul 8, 2020 at 8:02 PM Amit Kapila  wrote:

> On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila 
> wrote:
> >
> > On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar 
> wrote:
> > >
> > > I have compared the changes logged at command end vs logged at commit
> > > time.  I have ignored the invalidation for the transaction which has
> > > any aborted subtransaction in it.  While testing this I found one
> > > issue, the issue is that if there are some invalidation generated
> > > between last command counter increment and the commit transaction then
> > > those were not logged.  I have fixed the issue by logging the pending
> > > invalidation in RecordTransactionCommit.  I will include the changes
> > > in the next patch set.
> > >
> >
> > I think it would have been better if you could have given examples for
> > such cases where you need this extra logging.  Anyway, below are few
> > minor comments on this patch:
> >
> > 1.
> > + /*
> > + * Log any pending invalidations which are adding between the last
> > + * command counter increment and the commit.
> > + */
> > + if (XLogLogicalInfoActive())
> > + LogLogicalInvalidations();
> >
> > I think we can change this comment slightly and extend a bit to say
> > for which kind of special cases we are adding this. "Log any pending
> > invalidations which are added between the last CommandCounterIncrement
> > and the commit.  Normally for DDLs, we log this at each command end,
> > however for certain cases where we directly update the system table
> > the invalidations were not logged at command end."
> >
> > Something like above based on cases that are not covered by command
> > end WAL logging.
> >
> > 2.
> > + * Emit WAL for invalidations.  This is currently only used for logging
> > + * invalidations at the command end.
> > + */
> > +void
> > +LogLogicalInvalidations()
> >
> > After this is getting used at a new place, it is better to modify the
> > above comment to something like: "Emit WAL for invalidations.  This is
> > currently only used for logging invalidations at the command end or at
> > commit time if any invalidations are pending."
> >
>
> I have done some more review and below are my comments:
>
> Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
>
> --
> 1.
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
>  VOLATILE ROWS 1000 COST 1000
>  AS 'pg_logical_slot_get_changes';
>
> +CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
> +IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
> VARIADIC options text[] DEFAULT '{}',
> +OUT lsn pg_lsn, OUT xid xid, OUT data text)
> +RETURNS SETOF RECORD
> +LANGUAGE INTERNAL
> +VOLATILE ROWS 1000 COST 1000
> +AS 'pg_logical_slot_get_streaming_changes';
>
> If we are going to add a new streaming API for get_changes, don't we
> need for pg_logical_slot_get_binary_changes,
> pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
> as well?  I was thinking why not add a new parameter (streaming
> boolean) instead of adding the new APIs.  This could be an optional
> parameter which if user doesn't specify will be considered as false.
> We already have optional parameters for APIs like
> pg_create_logical_replication_slot.
>
> 2. You forgot to update sgml/func.sgml.  This will be required even if
> we decide to add a new parameter instead of a new API.
>
> 3.
> + /* If called has not asked for streaming changes then disable it. */
> + ctx->streaming &= streaming;
>
> /If called/If the caller
>
> 4.
> diff --git a/.gitignore b/.gitignore
> index 794e35b..6083744 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -42,3 +42,4 @@ lib*.pc
>  /Debug/
>  /Release/
>  /tmp_install/
> +/build/
>
> Why the patch contains this change?
>
> 5. If I apply the first six patches and run the regressions, it fails
> primarily because streaming got enabled by default.  And then when I
> applied this patch, the tests passed because it disables streaming by
> default.  I think this should be patch 0007.
>
> Replication Origins
> --
> I think we also need to conclude on origins related discussion [1].
> As far as I can see, the origin_id can be sent with the first startup
> message. The origin_lsn and origin_commit can be sent with the last
> start of streaming commit if we want but not sure if that is of use.
> If we need to send it

Re: [PATCH v2] Allow COPY "text" to output a header and add header matching mode to COPY FROM

2020-07-08 Thread Rémi Lapeyre

> 
> Please reply to the old thread about this, as that's the one connected to the
> Commitfest entry and thats where all the discussion has happened.  While
> additional threads can be attached to a CF entry, it's for when multiple
> discussions are relevant to a patch, a single discussion should not be broken
> into multiple threads.

Sorry about this, I thought setting the In-Reply-To like so:

git send-email -v2 --to=pgsql-hack...@postgresql.org 
--in-reply-to=4e31e7aa-bfc6-47ed-90e1-3838e4d1f...@yesql.se HEAD^

There is some nice informations about how to write a good commit but I could 
not find exactly how to send it so I probably did something wrong.

> 
> cheers ./daniel

Re: [PATCH v2] Allow COPY "text" to output a header and add header matching mode to COPY FROM

2020-07-08 Thread Justin Pryzby

On Wed, Jul 08, 2020 at 03:21:48PM +0200, Daniel Gustafsson wrote:
> > On 8 Jul 2020, at 13:45, Rémi Lapeyre  wrote:
> > 
> > Hi, here's a new version of the patch that should apply cleanly. I'll 
> > monitor
> > the status on http://cfbot.cputube.org/
> 
> Please reply to the old thread about this, as that's the one connected to the

Actually, it seems like the subject was changed, but it was correctly
associated with the existing thread, as defined by:
|In-Reply-To: <4e31e7aa-bfc6-47ed-90e1-3838e4d1f...@yesql.se>

And ML/pglister and cfbot see it as such
"In response to Re: [PATCH v1] Allow COPY "text" to output a header and add 
header matching mode to COPY FROM at 2020-07-01 09:04:21 from Daniel Gustafsson"
https://commitfest.postgresql.org/28/2504/

-- 
Justin

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

On Wed, Jul 8, 2020 at 9:26 AM Bharath Rupireddy
 wrote:
> If the cached connection idle time property is used (I'm thinking we
> can define it per server object) then the local backend might have to
> close the connections which are lying unused more than idle time. To
> perform this task, the local backend might have to do it before it
> goes into idle state(as suggested by you in [1]).  Please correct, if
> my understanding/thinking is wrong here.
>
> If the connection clean up is to be done by the local backend, then a
> point can be - let say a local session initially issues few foreign
> queries for which connections are cached, and it keeps executing all
> local queries, without never going to idle mode(I think this scenario
> looks too much impractical to me), then we may never clean the unused
> cached connections. If this scenario is really impractical if we are
> sure that there are high chances that the local backend goes to idle
> mode, then the idea of having per-server-object idle time and letting
> the local backend clean it up before it goes to idle mode looks great
> to me.

If it just did it before going idle, then what about sessions that
haven't reached the timeout at the point when we go idle, but do reach
the timeout later? And how would the FDW get control at the right time
anyway?

> > I tend to think this is better as an FDW property rather than a core
> > facility, but I'm not 100% sure of that and I think it likely depends
> > somewhat on the answers we choose to the questions in the preceding
> > paragraph.
> >
> I completely agree on having it as a FDW property.

Right, but not everyone does. It looks to me like there have been
votes on both sides.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: OpenSSL 3.0.0 compatibility

On Tue, Jul 7, 2020 at 1:46 PM Peter Eisentraut
 wrote:
> Trying to move this along, where would be a good place to define
> OPENSSL_API_COMPAT?  The only place that's shared between frontend and
> backend code is c.h.  The attached patch does it that way.

So, if we go this way, does that mean that we're not going to pursue
removing dependencies on the deprecated interfaces? I wonder if we
really ought to be doing that too, with preprocessor conditionals.
Otherwise, aren't we putting ourselves in an uncomfortable situation
when the deprecated stuff eventually goes away upstream?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: TAP tests and symlinks on Windows

2020-07-08 Thread Juan José Santamaría Flecha

On Wed, Jul 8, 2020 at 3:54 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> Incidentally, perl2host is arguably a bad name for this routine - there
> is nothing perl-specific about the paths, they are provided by the msys
> environment. Maybe virtual2host or some such would be a better name, or
> even just host_path or native_path.
>

There is a utility cygpath [1] meant for the conversion between Unix and
Windows path formats, that might be a meaningful name also.

[1] http://cygwin.net/cygwin-ug-net/cygpath.html

Regards,

Juan José Santamaría Flecha

Re: pg_resetwal --next-transaction-id may cause database failed to restart.

On Wed, Jun 24, 2020 at 11:04 AM Alvaro Herrera
 wrote:
> ISTM that a reasonable compromise is that if you use -x (or -c, -m, -O)
> and the input value is outside the range supported by existing files,
> then it's a fatal error; unless you use --force, which turns it into
> just a warning.

One potential problem is that you might be using --force for some
other reason and end up forcing this, too. But maybe that's OK.

Perhaps we should consider the idea of having pg_resetwal create the
relevant clog file and zero-fill it, if it doesn't exist already,
rather than leaving that to to the DBA or the postmaster binary to do
it. It seems like that is what people would want to happen in this
situation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: max_slot_wal_keep_size and wal_keep_segments




On 2020/07/02 2:18, David Steele wrote:

On 7/1/20 10:54 AM, Alvaro Herrera wrote:

On 2020-Jul-01, Fujii Masao wrote:


On 2020/07/01 12:26, Alvaro Herrera wrote:

On 2020-Jun-30, Fujii Masao wrote:


When I talked about max_slot_wal_keep_size as new feature in v13
at the conference, I received the question like "Why are the units of
setting values in max_slot_wal_keep_size and wal_keep_segments different?"
from audience. That difference looks confusing for users and
IMO it's better to use the same unit for them. Thought?


Do we still need wal_keep_segments for anything?


Yeah, personally I like wal_keep_segments because its setting is very
simple and no extra operations on replication slots are necessary.


Okay.  In that case I +1 the idea of renaming to wal_keep_size.


+1 for renaming to wal_keep_size.


I attached the patch that renames wal_keep_segments to wal_keep_size.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index e9cdff4864..3d973a5c8e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11249,7 +11249,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts 
ppx
  extended means
   that max_wal_size is exceeded but the files are
   still retained, either by the replication slot or
-  by wal_keep_segments.
+  by wal_keep_size.
  
 
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 02909b1e66..33c3ccb923 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3124,7 +3124,7 @@ include_dir 'conf.d'
 checkpoints. This is a soft limit; WAL size can exceed
 max_wal_size under special circumstances, such as
 heavy load, a failing archive_command, or a high
-wal_keep_segments setting.
+wal_keep_size setting.
 If this value is specified without units, it is taken as megabytes.
 The default is 1 GB.
 Increasing this parameter can increase the amount of time needed for
@@ -3751,21 +3751,21 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows

   
 
-  
-   wal_keep_segments (integer)
+  
+   wal_keep_size (integer)

-wal_keep_segments configuration 
parameter
+wal_keep_size configuration 
parameter




-Specifies the minimum number of past log file segments kept in the
+Specifies the minimum size of past log file segments kept in the
 pg_wal
 directory, in case a standby server needs to fetch them for streaming
-replication. Each segment is normally 16 megabytes. If a standby
+replication. If a standby
 server connected to the sending server falls behind by more than
-wal_keep_segments segments, the sending server 
might remove
-a WAL segment still needed by the standby, in which case the
+wal_keep_size megabytes, the sending server might
+remove a WAL segment still needed by the standby, in which case the
 replication connection will be terminated.  Downstream connections
 will also eventually fail as a result.  (However, the standby
 server can recover by fetching the segment from archive, if WAL
@@ -3773,14 +3773,15 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows

 

-This sets only the minimum number of segments retained in
+This sets only the minimum size of segments retained in
 pg_wal; the system might need to retain more 
segments
 for WAL archival or to recover from a checkpoint. If
-wal_keep_segments is zero (the default), the system
+wal_keep_size is zero (the default), the system
 doesn't keep any extra segments for standby purposes, so the number
 of old WAL segments available to standby servers is a function of
 the location of the previous checkpoint and status of WAL
 archiving.
+If this value is specified without units, it is taken as megabytes.
 This parameter can only be set in the
 postgresql.conf file or on the server command 
line.

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 65c3fc62a9..2d4f9f03c7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -785,7 +785,7 @@ archive_cleanup_command = 'pg_archivecleanup 
/path/to/archive %r'
 archiving, the server might recycle old WAL segments before the standby
 has received them.  If this occurs, the standby will need to be
 reinitialized from a new base backup.  You can avoid this by setting
-wal_keep_segments to a value large enough to ensure that
+wal_keep_size to a value large enough to ensure that
 WAL segments are

Re: OpenSSL 3.0.0 compatibility

2020-07-08 Thread Peter Eisentraut


On 2020-07-08 16:51, Robert Haas wrote:

On Tue, Jul 7, 2020 at 1:46 PM Peter Eisentraut
 wrote:

Trying to move this along, where would be a good place to define
OPENSSL_API_COMPAT?  The only place that's shared between frontend and
backend code is c.h.  The attached patch does it that way.


So, if we go this way, does that mean that we're not going to pursue
removing dependencies on the deprecated interfaces? I wonder if we
really ought to be doing that too, with preprocessor conditionals.
Otherwise, aren't we putting ourselves in an uncomfortable situation
when the deprecated stuff eventually goes away upstream?


I don't think there is a rush.  The 3.0.0 alphas still support 
interfaces deprecated in 0.9.8 (released 2005).  AFAICT, nothing tagged 
under this API compatibility scheme has ever been removed.  If they 
started doing so, they would presumably do it step by step at the tail 
end, which would still give us several steps before it catches up with us.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TAP tests and symlinks on Windows



On 7/8/20 11:07 AM, Juan José Santamaría Flecha wrote:
>
> On Wed, Jul 8, 2020 at 3:54 PM Andrew Dunstan
>  > wrote:
>
>
> Incidentally, perl2host is arguably a bad name for this routine -
> there
> is nothing perl-specific about the paths, they are provided by the
> msys
> environment. Maybe virtual2host or some such would be a better
> name, or
> even just host_path or native_path.
>
>
> There is a utility cygpath [1] meant for the conversion between Unix
> and Windows path formats, that might be a meaningful name also.
>
> [1] http://cygwin.net/cygwin-ug-net/cygpath.html
>
>


Oh, good find. But unfortunately it's not present in my msys1 installations.


So we'll still need to use the 'pwd -W` trick for those.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: max_slot_wal_keep_size and wal_keep_segments

On 2020-Jul-09, Fujii Masao wrote:

> I attached the patch that renames wal_keep_segments to wal_keep_size.

Looks good in a quick once-over.  Just two small wording comments:

>
> Independently of max_wal_size,
> -+ 1 most recent WAL files are
> +   most recent  megabytes
> +   WAL files plus one WAL file are
> kept at all times. Also, if WAL archiving is used, old segments can not be
> removed or recycled until they are archived. If WAL archiving cannot keep 
> up
> with the pace that WAL is generated, or if 
> archive_command

This reads a little strange to me.  Maybe "the N most recent megabytes
plus ..."

>   /* determine how many segments slots can be kept by 
> slots ... */
> - keepSegs = XLogMBVarToSegs(max_slot_wal_keep_size_mb, 
> wal_segment_size);
> - /* ... and override by wal_keep_segments as needed */
> - keepSegs = Max(keepSegs, wal_keep_segments);
> + slotKeepSegs = 
> XLogMBVarToSegs(max_slot_wal_keep_size_mb, wal_segment_size);
> + /* ... and override by wal_keep_size as needed */
> + keepSegs = XLogMBVarToSegs(wal_keep_size_mb, 
> wal_segment_size);

Since you change the way these two variables are used, I would not say
"override" in the above comment, nor keep the ellipses; perhaps just
change them to "determine how many segments can be kept by slots" and
"ditto for wal_keep_size".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TAP tests and symlinks on Windows

On 2020-Jul-08, Andrew Dunstan wrote:

> On 7/8/20 11:07 AM, Juan José Santamaría Flecha wrote:

> > There is a utility cygpath [1] meant for the conversion between Unix
> > and Windows path formats, that might be a meaningful name also.
> >
> > [1] http://cygwin.net/cygwin-ug-net/cygpath.html
> 
> Oh, good find. But unfortunately it's not present in my msys1 installations.
> 
> So we'll still need to use the 'pwd -W` trick for those.

I think his point is not to use that utility, just to use its name as
the name of the perl routine.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-07-08 Thread Justin Pryzby

On Fri, Jun 19, 2020 at 02:02:29PM +1200, David Rowley wrote:
> if (es->format == EXPLAIN_FORMAT_TEXT)
> {
> ExplainIndentText(es);
> appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
...

> As for this patch, I don't think it's unreasonable to have the 3
> possible HashAgg properties on a single line Other people might
> disagree, so here's an example of what the patch changes it to:
> 
> postgres=# explain analyze select a,sum(b) from ab group by a;
>  QUERY PLAN
> 
>  HashAggregate  (cost=175425.12..194985.79 rows=988 width=12) (actual
> time=1551.920..5281.670 rows=1000 loops=1)
>Group Key: a
>Peak Memory Usage: 97 kB Disk Usage: 139760 kB HashAgg Batches: 832
>->  Seq Scan on ab  (cost=0.00..72197.00 rows=5005000 width=8) (actual 
> time=0.237..451.228 rows=5005000 loops=1)

I've just noticed another inconsistency:
For "Sort", there's no space before "kB", but your patch (9bdb300d) uses a
space for text mode.

+   appendStringInfo(es->str, "Peak Memory Usage: " INT64_FORMAT " 
kB",
+memPeakKb);
+
+   if (aggstate->hash_batches_used > 0)
+   appendStringInfo(es->str, "  Disk Usage: " 
UINT64_FORMAT " kB  HashAgg Batches: %d",

...

+   appendStringInfo(es->str, "Peak Memory Usage: " 
INT64_FORMAT " kB",
+memPeakKb);
+
+   if (hash_batches_used > 0)
+   appendStringInfo(es->str, "  Disk 
Usage: " UINT64_FORMAT " kB  HashAgg Batches: %d",

Hopefully there's the final whitespace inconsistency for this release.

-- 
Justin

Re: Multi-byte character case-folding

On 2020-Jul-08, Robert Haas wrote:

> That's not to say that I have any good idea what to do about this. I
> just disagree with labelling the above case as a success.

Yeah, particularly since it works differently in single-char encodings.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-byte character case-folding

Robert Haas  writes:
> That's not to say that I have any good idea what to do about this. I
> just disagree with labelling the above case as a success.

I can't say that I like it either.  But I'm afraid that changing it now
will create many more problems than it solves.

regards, tom lane

Re: posgres 12 bug (partitioned table)

2020-07-08 Thread Soumyadeep Chakraborty

Hey Amit,

On Tue, Jul 7, 2020 at 7:17 PM Amit Langote  wrote:
> Ah, I see.  You might've noticed that ExecInsert() only ever sees leaf
> partitions, because tuple routing would've switched the result
> relation to a leaf partition by the time we are in ExecInsert().  So,
> table_tuple_insert() always refers to a leaf partition's AM.   Not
> sure if you've also noticed but each leaf partition gets to own a slot
> (PartitionRoutingInfo.pi_PartitionTupleSlot), but currently it is only
> used if the leaf partition attribute numbers are not the same as the
> root partitioned table.  How about we also use it if the leaf
> partition AM's table_slot_callbacks() differs from the root
> partitioned table's slot's tts_ops?  That would be the case, for
> example, if the leaf partition is of Zedstore AM.  In the more common
> cases where all leaf partitions are of heap AM, this would mean the
> original slot would be used as is, that is, if we accept hard-coding
> table_slot_callbacks() to return a "heap" slot for partitioned tables
> as I suggest.

Even then, we will still need to fill in the system columns explicitly as
pi_PartitionTupleSlot will not be filled with system columns after
it comes back out of table_tuple_insert() if we have a non-heap AM.


Regards,
Soumyadeep (VMware)

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-07-08 Thread Andres Freund

Hi,

On 2020-05-20 23:44:27 +1200, David Rowley wrote:
> I've attached a patch which implements this.  The new node type is
> called "Result Cache".  I'm not particularly wedded to keeping that
> name, but if I change it, I only want to do it once. I've got a few
> other names I mind, but I don't feel strongly or confident enough in
> them to go and do the renaming.

I'm not convinced it's a good idea to introduce a separate executor node
for this. There's a fair bit of overhead in them, and they will only be
below certain types of nodes afaict. It seems like it'd be better to
pull the required calls into the nodes that do parametrized scans of
subsidiary nodes. Have you considered that?

Greetings,

Andres Freund

Re: Creating a function for exposing memory usage of backend process

2020-07-08 Thread Andres Freund

Hi,

I think this is an incredibly useful feature.


On 2020-07-07 22:02:10 +0900, torikoshia wrote:
> > There can be multiple memory contexts with the same name. So I'm afraid
> > that it's difficult to identify the actual parent memory context from
> > this
> > "parent" column. This is ok when logging memory contexts by calling
> > MemoryContextStats() via gdb. Because child memory contexts are printed
> > just under their parent, with indents. But this doesn't work in the
> > view.
> > To identify the actual parent memory or calculate the memory contexts
> > tree
> > from the view, we might need to assign unique ID to each memory context
> > and display it. But IMO this is overkill. So I'm fine with current
> > "parent"
> > column. Thought? Do you have any better idea?
> 
> Indeed.
> I also feel it's not usual to assign a unique ID, which
> can vary every time the view displayed.

Hm. I wonder if we just could include the address of the context
itself. There might be reasons not to do so (e.g. security concerns
about leaked pointers making attacks easier), but I think it's worth
considering.


> We show each context using a recursive function and this is
> a kind of depth-first search. So, as far as I understand,
> we can identify its parent using both the "parent" column
> and the order of the rows.
> 
> Documenting these things may worth for who want to identify
> the relation between parents and children.
> 
> Of course, in the relational model, the order of relation
> does not have meaning so it's also unusual in this sense..

It makes it pretty hard to write summarizing queries, so I am not a huge
fan of just relying on the order.


> +/*
> + * PutMemoryContextsStatsTupleStore
> + *   One recursion level for pg_get_backend_memory_contexts.
> + */
> +static void
> +PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
> + TupleDesc 
> tupdesc, MemoryContext context,
> + MemoryContext 
> parent, int level)
> +{
> +#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS  9
> + Datum   values[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
> + boolnulls[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
> + MemoryContextCounters stat;
> + MemoryContext child;
> + const char *name = context->name;
> + const char *ident = context->ident;
> +
> + if (context == NULL)
> + return;
> +
> + /*
> +  * To be consistent with logging output, we label dynahash contexts
> +  * with just the hash table name as with MemoryContextStatsPrint().
> +  */
> + if (ident && strcmp(name, "dynahash") == 0)
> + {
> + name = ident;
> + ident = NULL;
> + }
> +
> + /* Examine the context itself */
> + memset(&stat, 0, sizeof(stat));
> + (*context->methods->stats) (context, NULL, (void *) &level, &stat);
> +
> + memset(values, 0, sizeof(values));
> + memset(nulls, 0, sizeof(nulls));
> +
> + values[0] = CStringGetTextDatum(name);
> +
> + if (ident)
> + {
> + int idlen = strlen(ident);
> + char
> clipped_ident[MEMORY_CONTEXT_IDENT_DISPLAY_SIZE];
> +
> + /*
> +  * Some identifiers such as SQL query string can be very long,
> +  * truncate oversize identifiers.
> +  */
> + if (idlen >= MEMORY_CONTEXT_IDENT_DISPLAY_SIZE)
> + idlen = pg_mbcliplen(ident, idlen, 
> MEMORY_CONTEXT_IDENT_DISPLAY_SIZE - 1);
> +

Why?

Greetings,

Andres Freund

Re: TAP tests and symlinks on Windows



On 7/8/20 12:22 PM, Alvaro Herrera wrote:
> On 2020-Jul-08, Andrew Dunstan wrote:
>
>> On 7/8/20 11:07 AM, Juan José Santamaría Flecha wrote:
>>> There is a utility cygpath [1] meant for the conversion between Unix
>>> and Windows path formats, that might be a meaningful name also.
>>>
>>> [1] http://cygwin.net/cygwin-ug-net/cygpath.html
>> Oh, good find. But unfortunately it's not present in my msys1 installations.
>>
>> So we'll still need to use the 'pwd -W` trick for those.
> I think his point is not to use that utility, just to use its name as
> the name of the perl routine.
>


That would be wholly misleading, since it's not needed at all when we're
running running under cygwin.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: min_safe_lsn column in pg_replication_slots view

Alvaro Herrera  writes:
> In absence of other suggestions I gave this the name XLogMBVarToSegs,
> and redefined ConvertToXSegs to use that.  Didn't touch callers in
> xlog.c to avoid pointless churn.  Pushed to both master and 13.

The buildfarm's sparc64 members seem unhappy with this.

regards, tom lane

Reigning in ExecParallelHashRepartitionFirst

2020-07-08 Thread Melanie Plageman

Last week as I was working on adaptive hash join [1] and trying to get
parallel adaptive hash join batch 0 to spill correctly, I noticed what
seemed like a problem with the code to repartition batch 0.

If we run out of space while inserting tuples into the hashtable during
the build phase of parallel hash join and proceed to increase the number
of batches, we need to repartition all of the tuples from the old
generation (when nbatch was x) and move them to their new homes in the
new generation (when nbatch is 2x). Before we do this repartitioning we
disable growth in the number of batches.

Then we repartition the tuples from the hashtable, inserting them either
back into the hashtable or into a batch file. While inserting them into
the hashtable, we call ExecParallelHashTupleAlloc(), and, if there is no
space for the current tuple in the current chunk and growth in the
number of batches is disabled, we go ahead and allocate a new chunk of
memory -- regardless of whether or not we will exceed the space limit.

Below, I've included a test case, which, on master, results in an error
while trying to allocate shared memory. I use a custom data type whose
hash function ensures that the tuples will go to batch 0. With my
attached patch, this test case no longer errors out.

I discussed with Thomas Munro, and it seems this is not the desired
behavior.

We discussed how abandoning a repartitioning effort once we know it is
doomed is an optimization anyway.

To start with, I've attached a patch which bails out of the
ExecParallelHashRepartitionFirst() attempt when allocating a new chunk
of memory would exceed the space limit. We skip
ExecParallelHashRepartitionRest() and engage in the deciding phase as
before.

This means that we will disable growth in the number of batches
if all of the tuples that we attempted to load back into the hashtable
from the evicted tuple queue would stay resident in the hashtable.
Otherwise, we will set growth to indicate we need to try increasing the
number of batches and return, eventually returning NULL to the original
allocation function call and indicating we need to retry repartitioning.

It's important to note that if we disable growth in the deciding phase
due to skew, batch 0, and subsequent batches that had too many tuples to
fit in the space allowed, will simply exceed the space limit while
building the hashtable. This patch does not fix that.

Thomas and I also discussed the potential optimization of bailing out of
repartitioning during repartitioning of all of the other batches (after
batch 0) in ExecParallelHashRepartitionRest(). This would be a good
optimization, however, it isn't addressing a "bug" in the same way that
bailing out in ExecParallelHashRepartitionFirst() is. Also, I hacked on
a few versions of this optimization and it requires more thought. I
would like to propose that as a separate patch and thread.

One note about the code of the attached patch, I added a variable to the
ParallelHashJoinState structure indicating that repartitioning should be
abandoned. Workers only need to check it before allocating a new chunk of
memory during repartitioning. I thought about whether or not it would be
better to make it a ParallelHashGrowth stage, but I wasn't sure whether
or not that made sense.


Test Case


DROP TYPE stub CASCADE;
CREATE TYPE stub AS (value CHAR(8098));

CREATE FUNCTION stub_hash(item stub)
RETURNS INTEGER AS $$
BEGIN
  RETURN 0;
END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;

CREATE FUNCTION stub_eq(item1 stub, item2 stub)
RETURNS BOOLEAN AS $$
BEGIN
  RETURN item1.value = item2.value;
END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;

CREATE OPERATOR = (
  FUNCTION = stub_eq,
  LEFTARG = stub,
  RIGHTARG = stub,
  COMMUTATOR = =,
  HASHES, MERGES
);

CREATE OPERATOR CLASS stub_hash_ops
DEFAULT FOR TYPE stub USING hash AS
  OPERATOR 1 =(stub, stub),
  FUNCTION 1 stub_hash(stub);

DROP TABLE IF EXISTS probeside_batch0;
CREATE TABLE probeside_batch0(a stub);
ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
INSERT INTO probeside_batch0 SELECT '("")' FROM generate_series(1, 13);

DROP TABLE IF EXISTS hashside_wide_batch0;
CREATE TABLE hashside_wide_batch0(a stub, id int);
ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
INSERT INTO hashside_wide_batch0 SELECT '("")', 22 FROM generate_series(1,
200);
ANALYZE probeside_batch0, hashside_wide_batch0;

set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set enable_hashjoin = on;

set max_parallel_workers_per_gather = 1;
set enable_parallel_hash = on;
set work_mem = '64kB';

explain (analyze, costs off)
SELECT TRIM((probeside_batch0.a).value),
  hashside_wide_batch0.id,
  hashside_wide_batch0.ctid as innerctid,
  TRIM((hashside_wide_batch0.a).value), probeside_batch0.ctid as outerctid
FROM probeside_batch0
LEFT OUTER JOIN hashside_wide_batch0 USING (a);

-

Re: language cleanups in code and docs

2020-07-08 Thread Andres Freund

Hi,

I've pushed most of the changes.

On 2020-06-16 18:59:25 -0400, David Steele wrote:
> On 6/16/20 6:27 PM, Andres Freund wrote:
> > On 2020-06-16 17:14:57 -0400, David Steele wrote:
> > > On 6/15/20 2:22 PM, Andres Freund wrote:
> > 
> > > > 0008: docs: WIP multi-master rephrasing.
> > > > I like neither the new nor the old language much. I'd welcome input.
> > > 
> > > Why not multi-primary?
> > 
> > My understanding of primary is that there really can't be two things
> > that are primary in relation to each other.
> 
> Well, I think the same is true for multi-master and that's pretty common.
> 
> > active/active is probably
> > the most common term in use besides multi-master.
> 
> Works for me and can always be updated later if we come up with something
> better. At least active-active will be easier to search for.

What do you think about the attached?

Greetings,

Andres Freund
>From 254c1f45beb8e9dee840a497d5c040b00015a8f3 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 15 Jun 2020 10:39:15 -0700
Subject: [PATCH] docs: use active-active instead of multi-master replication.

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4...@alap3.anarazel.de
---
 doc/src/sgml/high-availability.sgml | 34 +++--
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6a9184f314e..2e090419f18 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -38,12 +38,14 @@
  
   Some solutions deal with synchronization by allowing only one
   server to modify the data.  Servers that can modify data are
-  called read/write, master or primary servers.
-  Servers that track changes in the master are called standby
-  or secondary servers. A standby server that cannot be connected
-  to until it is promoted to a master server is called a warm
-  standby server, and one that can accept connections and serves read-only
-  queries is called a hot standby server.
+  called read/write, primary,
+  active (or master) servers.
+  Servers that track changes in the primary are called
+  standby or secondary servers.
+  A standby server that cannot be connected to until it is promoted to a
+  primary server is called a warm standby server, and
+  one that can accept connections and serves read-only queries is called a
+  hot standby server.
  
 
  
@@ -177,14 +179,14 @@ protocol to make nodes agree on a serializable transactional order.
   
 
   
-   Trigger-Based Master-Standby Replication
+   Trigger-Based Primary-Standby Replication

 
 
- A master-standby replication setup sends all data modification
- queries to the master server.  The master server asynchronously
+ A primary-standby replication setup sends all data modification
+ queries to the primary server.  The primary server asynchronously
  sends data changes to the standby server.  The standby can answer
- read-only queries while the master server is running.  The
+ read-only queries while the primary server is running.  The
  standby server is ideal for data warehouse queries.
 
 
@@ -233,14 +235,14 @@ protocol to make nodes agree on a serializable transactional order.
   
 
   
-   Asynchronous Multimaster Replication
+   Asynchronous Active-Active Replication

 
 
  For servers that are not regularly connected or have slow
  communication links, like laptops or
  remote servers, keeping data consistent among servers is a
- challenge.  Using asynchronous multimaster replication, each
+ challenge.  Using asynchronous active-active replication, each
  server works independently, and periodically communicates with
  the other servers to identify conflicting transactions.  The
  conflicts can be resolved by users or conflict resolution rules.
@@ -250,17 +252,17 @@ protocol to make nodes agree on a serializable transactional order.
   
 
   
-   Synchronous Multimaster Replication
+   Synchronous Active-Active Replication

 
 
- In synchronous multimaster replication, each server can accept
+ In synchronous active-active replication, each server can accept
  write requests, and modified data is transmitted from the
  original server to every other server before each transaction
  commits.  Heavy write activity can cause excessive locking and
  commit delays, leading to poor performance.  Read requests can
  be sent to any server.  Some implementations use shared disk
- to reduce the communication overhead.  Synchronous multimaster
+ to reduce the communication overhead.  Synchronous active-active
  replication is best for mostly read workloads, though its big
  advantage is that any server can accept write requests —
  there is no need to partition workloads between primary and
@@ -351,7 +353,7 @@ protocol to make nodes

Postgres is not able to handle more than 4k tables!?

2020-07-08 Thread Konstantin Knizhnik

I want to explain one bad situation we have encountered with one of our 
customers.
There are ~5000 tables in their database. And what is worse - most of 
them are actively used.

Then several flaws of Postgres make their system almost stuck.

Autovacuum is periodically processing all this 5k relations (because 
them are actively updated).
And as far as most of this tables are small enough autovacuum complete 
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in 
short period of time.


There are several thousand clients, most of which are executing complex 
queries.
So them are not able to process all this invalidation messages and their 
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be 
changed without recompilation of Postgres.

This is problem N1.

As a result resetState is set to true, forcing backends to invalidate 
their caches.
So most of backends loose there cached metadata and have to access 
system catalog trying to reload it.

But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:

#define LOG2_NUM_LOCK_PARTITIONS  4
#define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

Having just 16 LW-Locks greatly increase conflict probability (taken in 
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage 
of some queries is increased from milliseconds to several minutes!

Koda!

This is problem number 2. But there is one more flaw we have faced with. 
We have increased LOG2_NUM_LOCK_PARTITIONS to 8

and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.

But there are several places in Postgres where it tries to hold all 
partition locks (for example in deadlock detector).

Definitely if NUM_LOCK_PARTITIONS > MAX_SIMUL_LWLOCKS we get this error.

So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to 
be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be 
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + 
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.

Re: language cleanups in code and docs

2020-07-08 Thread David Steele


On 7/8/20 4:39 PM, Andres Freund wrote:

Hi,

I've pushed most of the changes.

On 2020-06-16 18:59:25 -0400, David Steele wrote:

On 6/16/20 6:27 PM, Andres Freund wrote:

On 2020-06-16 17:14:57 -0400, David Steele wrote:

On 6/15/20 2:22 PM, Andres Freund wrote:



0008: docs: WIP multi-master rephrasing.
 I like neither the new nor the old language much. I'd welcome input.


Why not multi-primary?


My understanding of primary is that there really can't be two things
that are primary in relation to each other.


Well, I think the same is true for multi-master and that's pretty common.


active/active is probably
the most common term in use besides multi-master.


Works for me and can always be updated later if we come up with something
better. At least active-active will be easier to search for.


What do you think about the attached?


I think this phrasing in the original/updated version is pretty awkward:

+  A standby server that cannot be connected to until it is promoted to a
+  primary server is called a ...

How about:

+ A standby server that must be promoted to a primary server before
+ accepting connections is called a ...

Other than that it looks good to me.

Regards,
--
-David
da...@pgmasters.net

Re: language cleanups in code and docs

On 2020-Jul-08, David Steele wrote:

> On 7/8/20 4:39 PM, Andres Freund wrote:

> I think this phrasing in the original/updated version is pretty awkward:
> 
> +  A standby server that cannot be connected to until it is promoted to a
> +  primary server is called a ...

Yeah.

> How about:
> 
> + A standby server that must be promoted to a primary server before
> + accepting connections is called a ...

How about just reducing it to "A standby server that doesn't accept
connection is called a ..."?  We don't really need to explain that if
you do promote the standby it will start accept connections -- do we?
It should be pretty obvious if you promote a standby, it will cease to
be a standby in the first place.  This verbiage seems excessive.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: language cleanups in code and docs

2020-07-08 Thread David Steele


On 7/8/20 5:17 PM, Alvaro Herrera wrote:

On 2020-Jul-08, David Steele wrote:


On 7/8/20 4:39 PM, Andres Freund wrote:



I think this phrasing in the original/updated version is pretty awkward:

+  A standby server that cannot be connected to until it is promoted to a
+  primary server is called a ...


Yeah.


How about:

+ A standby server that must be promoted to a primary server before
+ accepting connections is called a ...


How about just reducing it to "A standby server that doesn't accept
connection is called a ..."?  We don't really need to explain that if
you do promote the standby it will start accept connections -- do we?
It should be pretty obvious if you promote a standby, it will cease to
be a standby in the first place.  This verbiage seems excessive.


Works for me.

Regards,
--
-David
da...@pgmasters.net

Re: Is this a bug in pg_current_logfile() on Windows?

[ redirecting to pghackers ]

Thomas Kellerer  writes:
> Tom Lane schrieb am 08.07.2020 um 18:41:
>> Somehow, the reading file is being left in binary mode and thus it's
>> failing to convert \r\n back to plain \n.
>> Now the weird thing about that is I'd have expected "r" and "w" modes
>> to imply Windows text mode already, so that I'd have figured that
>> _setmode call to be a useless no-op.  Apparently on some Windows libc
>> implementations, it's not.  How was your installation built exactly?

> That's the build from EnterpriseDB

What I'd momentarily forgotten is that we don't use Windows' native
fopen().  On that platform, we use pgwin32_fopen which defaults to
binary mode (because _open_osfhandle does).  So the _setmode calls in
syslogger.c are *not* no-ops, and the failure in pg_current_logfile()
is clearly explained by the fact that it's doing nothing to strip
carriage returns.

However ... I put in a test case to try to expose this failure, and
our Windows buildfarm critters remain perfectly happy.  So what's up
with that?  After some digging around, I believe the reason is that
PostgresNode::psql is stripping the \r from pg_current_logfile()'s
result, here:

$$stdout =~ s/\r//g if $TestLib::windows_os;

I'm slightly tempted to extend the test case by verifying on the
server side that the result ends in ".log" with no extra characters.
More generally, I wonder if the above behavior is really a good idea.
It seems to have been added in commit 33f3bbc6d as a hack to avoid
having to think too hard about mingw's behavior, but now I wonder if
it isn't masking other bugs too.  At the very least I think we ought
to tighten the coding to

$$stdout =~ s/\r\n/\n/g if $TestLib::windows_os;

so that it won't strip carriage returns at random.

Meanwhile, back at the ranch, how shall we fix pg_current_logfile()?
I see two credible alternatives:

1. Insert
#ifdef WIN32
_setmode(_fileno(fd), _O_TEXT);
#endif
to make this function match the coding in syslogger.c.

2. Manually strip '\r' if present, independent of platform.

The second alternative would conform to the policy we established in
commit b654714f9, that newline-chomping code should uniformly drop \r.
However, that policy is mainly intended to allow non-Windows builds
to cope with text files that might have been made with a Windows text
editor.  Surely we don't need to worry about a cross-platform source
for the log metafile.  So I'm leaning a bit to the first alternative,
so as not to add useless overhead and complexity on non-Windows builds.

Thoughts?

regards, tom lane

Re: Postgres is not able to handle more than 4k tables!?

Konstantin Knizhnik  writes:
> There are several thousand clients, most of which are executing complex 
> queries.

So, that's really the core of your problem.  We don't promise that
you can run several thousand backends at once.  Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.

> So them are not able to process all this invalidation messages and their 
> invalidation message buffer is overflown.
> Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be 
> changed without recompilation of Postgres.
> This is problem N1.

No, this isn't a problem.  Or at least you haven't shown a reason to
think it is.  Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).

> But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
> It is also hardcoded and can't be changed without recompilation:

> #define LOG2_NUM_LOCK_PARTITIONS  4
> #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

> Having just 16 LW-Locks greatly increase conflict probability (taken in 
> account that there are 5k tables and totally about 25k relations).

> It cause huge lw-lock acquisition time for heap_open and planning stage 
> of some queries is increased from milliseconds to several minutes!

Really?

> This is problem number 2. But there is one more flaw we have faced with. 
> We have increased LOG2_NUM_LOCK_PARTITIONS to 8
> and ... clients start to report "too many LWLocks taken" error.
> There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
> which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.

Seems like self-inflicted damage.  I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.

> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to 
> be replaced with GUCs.

I seriously doubt we'd do that.

regards, tom lane

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-07-08 Thread Peter Geoghegan

Hi Michail,

On Sun, Apr 5, 2020 at 10:04 AM Michail Nikolaev
 wrote:
> >  I added
> >  something about this to the nbtree README in commit 9f83468b353.
>
> I have added some updates to your notes in the updated patch version.

My apologies for the extended delay here.

My intention is to commit this patch to the master branch only. While
it seems low risk, I don't see any reason to accept even a small risk
given the lack of complaints from users. We know that this bug existed
many years before you discovered it.

Thanks
-- 
Peter Geoghegan

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

On Thu, 9 Jul 2020 at 04:30, Justin Pryzby  wrote:
> I've just noticed another inconsistency:
> For "Sort", there's no space before "kB", but your patch (9bdb300d) uses a
> space for text mode.

Thanks for the report. I just pushed a fix for that.

David

Re: jsonpath versus NaN

Alexander Korotkov  writes:
> The patchset is attached, sorry for the delay.

> The first patch improves error messages, which appears to be unclear
> for me.  If one applies .double() method to a numeric value, we
> restrict that this numeric value should fit to double precision type.
> If it doesn't fit, the current error message just says the following.

> ERROR: jsonpath item method .double() can only be applied to a numeric value

> But that's confusing, because .double() method is naturally applied to
> a numeric value.  Patch makes this message explicitly report that
> numeric value is out of range for double type.  This patch also adds
> test exercising this error.  When string can't be converted to double
> precision, I think it's better to explicitly say that we expected
> valid string representation of double precision type.

I see your point here, but the English of the replacement error messages
could be improved.  I suggest respectively

numeric argument of jsonpath item method .%s() is out of range for type double 
precision

string argument of jsonpath item method .%s() is not a valid representation of 
a double precision number

As for 0002, I'd rather see the convertJsonbScalar() code changed back
to the way it was, ie just

case jbvNumeric:
numlen = VARSIZE_ANY(scalarVal->val.numeric);
padlen = padBufferToInt(buffer);
...

There is no argument for having an its-not-NaN assertion here when
there aren't similar assertions throughout the jsonb code.

Also, it seems like it'd be smart to reject isinf() and isnan() results
from float8in_internal_opt_error in both executeItemOptUnwrapTarget code
paths, ie numeric source as well as string source.  Yeah, we don't expect
to see those cases in a jbvNumeric (so I wouldn't change the error message
text), but it's cheap insurance.

No other comments.

regards, tom lane

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

On Thu, 9 Jul 2020 at 04:53, Andres Freund  wrote:
>
> On 2020-05-20 23:44:27 +1200, David Rowley wrote:
> > I've attached a patch which implements this.  The new node type is
> > called "Result Cache".  I'm not particularly wedded to keeping that
> > name, but if I change it, I only want to do it once. I've got a few
> > other names I mind, but I don't feel strongly or confident enough in
> > them to go and do the renaming.
>
> I'm not convinced it's a good idea to introduce a separate executor node
> for this. There's a fair bit of overhead in them, and they will only be
> below certain types of nodes afaict. It seems like it'd be better to
> pull the required calls into the nodes that do parametrized scans of
> subsidiary nodes. Have you considered that?

I see 41 different node types mentioned in ExecReScan().  I don't
really think it would be reasonable to change all those.

Here are a couple of examples, one with a Limit below the Result Cache
and one with a GroupAggregate.

postgres=# explain (costs off) select * from pg_Class c1 where relname
= (select relname from pg_Class c2 where c1.relname = c2.relname
offset 1 limit 1);
 QUERY PLAN
-
 Seq Scan on pg_class c1
   Filter: (relname = (SubPlan 1))
   SubPlan 1
 ->  Result Cache
   Cache Key: c1.relname
   ->  Limit
 ->  Index Only Scan using pg_class_relname_nsp_index
on pg_class c2
   Index Cond: (relname = c1.relname)
(8 rows)


postgres=# explain (costs off) select * from pg_Class c1 where relname
= (select relname from pg_Class c2 where c1.relname = c2.relname group
by 1 having count(*) > 1);
 QUERY PLAN
-
 Seq Scan on pg_class c1
   Filter: (relname = (SubPlan 1))
   SubPlan 1
 ->  Result Cache
   Cache Key: c1.relname
   ->  GroupAggregate
 Group Key: c2.relname
 Filter: (count(*) > 1)
 ->  Index Only Scan using pg_class_relname_nsp_index
on pg_class c2
   Index Cond: (relname = c1.relname)
(10 rows)

As for putting the logic somewhere like ExecReScan() then the first
paragraph in [1] are my thoughts on that.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvr-yx9DEJ1Lc9aAy8QZkgEZkTP=3hBRBe83Vwo=kan...@mail.gmail.com

Re: Index Skip Scan (new UniqueKeys)

2020-07-08 Thread Peter Geoghegan

On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:

> +   currItem = &so->currPos.items[so->currPos.lastItem];
> +   itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> +   nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:

> +   /*
> +* To check if we returned the same tuple, try to find a
> +* startItup on the current page. For that we need to update
> +* scankey to match the whole tuple and set nextkey to return
> +* an exact tuple, not the next one. If the nextOffset is the
> +* same as before, it means we are in the loop, return offnum
> +* to the original position and jump further
> +*/

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

> +   startOffset = _bt_binsrch(scan->indexRelation,
> + so->skipScanKey,
> + so->currPos.buf);
> +
> +   page = BufferGetPage(so->currPos.buf);
> +   maxoff = PageGetMaxOffsetNumber(page);
> +
> +   if (nextOffset <= startOffset)
> +   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

-- 
Peter Geoghegan

Re: min_safe_lsn column in pg_replication_slots view

On 2020-Jul-08, Tom Lane wrote:

> Alvaro Herrera  writes:
> > In absence of other suggestions I gave this the name XLogMBVarToSegs,
> > and redefined ConvertToXSegs to use that.  Didn't touch callers in
> > xlog.c to avoid pointless churn.  Pushed to both master and 13.
> 
> The buildfarm's sparc64 members seem unhappy with this.

Hmm.  Some of them are, yeah, but it's not universal.  For example
mussurana and ibisbill are not showing failures.

Anyway the error is pretty strange: only GetWALAvailability is showing a
problem, but the size calculation in the view function def is returning
a negative number, as expected.

So looking at the code in GetWALAvailability, what happens is that
targetSeg >= oldestSlotSeg, but we expect the opposite.  I'd bet for
targetSeg to be correct, since its input is just the slot LSN -- pretty
easy.  But for oldestSlotSeg, we have KeepLogSeg involved.

No immediate ideas ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Is this a bug in pg_current_logfile() on Windows?



On 7/8/20 5:26 PM, Tom Lane wrote:
>
> However ... I put in a test case to try to expose this failure, and
> our Windows buildfarm critters remain perfectly happy.  So what's up
> with that?  After some digging around, I believe the reason is that
> PostgresNode::psql is stripping the \r from pg_current_logfile()'s
> result, here:
>
>   $$stdout =~ s/\r//g if $TestLib::windows_os;
>
> I'm slightly tempted to extend the test case by verifying on the
> server side that the result ends in ".log" with no extra characters.
> More generally, I wonder if the above behavior is really a good idea.
> It seems to have been added in commit 33f3bbc6d as a hack to avoid
> having to think too hard about mingw's behavior, but now I wonder if
> it isn't masking other bugs too.  At the very least I think we ought
> to tighten the coding to
>
>   $$stdout =~ s/\r\n/\n/g if $TestLib::windows_os;
>
> so that it won't strip carriage returns at random.
>

Seems reasonable. If we rip it out completely we'll have to find all the
places it breaks and fix them. And we'll almost certainly get new
breakage. If it's hiding a real bug we'll have to do that, but I'd be
reluctant unless there's actual proof.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: min_safe_lsn column in pg_replication_slots view

Alvaro Herrera  writes:
> On 2020-Jul-08, Tom Lane wrote:
>> The buildfarm's sparc64 members seem unhappy with this.

> Hmm.  Some of them are, yeah, but it's not universal.  For example
> mussurana and ibisbill are not showing failures.

Ah, right, I was thinking they hadn't run since this commit, but they
have.

> Anyway the error is pretty strange: only GetWALAvailability is showing a
> problem, but the size calculation in the view function def is returning
> a negative number, as expected.

We've previously noted what seem to be compiler optimization bugs on
both sparc32 and sparc64; the latest thread about that is
https://www.postgresql.org/message-id/flat/f28f842d-e82b-4e30-a81a-2a1f9fa4a8e1%40www.fastmail.com

This is looking uncomfortably like the same thing.  Tom, could you
experiment with different -O levels on those animals?

regards, tom lane

Re: Is this a bug in pg_current_logfile() on Windows?

Andrew Dunstan  writes:
> On 7/8/20 5:26 PM, Tom Lane wrote:
>> However ... I put in a test case to try to expose this failure, and
>> our Windows buildfarm critters remain perfectly happy.  So what's up
>> with that?  After some digging around, I believe the reason is that
>> PostgresNode::psql is stripping the \r from pg_current_logfile()'s
>> result, here:
>>  $$stdout =~ s/\r//g if $TestLib::windows_os;
>> I'm slightly tempted to extend the test case by verifying on the
>> server side that the result ends in ".log" with no extra characters.
>> More generally, I wonder if the above behavior is really a good idea.
>> It seems to have been added in commit 33f3bbc6d as a hack to avoid
>> having to think too hard about mingw's behavior, but now I wonder if
>> it isn't masking other bugs too.  At the very least I think we ought
>> to tighten the coding to
>>  $$stdout =~ s/\r\n/\n/g if $TestLib::windows_os;
>> so that it won't strip carriage returns at random.

> Seems reasonable. If we rip it out completely we'll have to find all the
> places it breaks and fix them. And we'll almost certainly get new
> breakage. If it's hiding a real bug we'll have to do that, but I'd be
> reluctant unless there's actual proof.

Hard to tell.  What I propose to do right now is change the \r filters
as shown above, and see if the test I added in 004_logrotate.pl starts
to show failures on Windows.  If it does, and no other place does,
I'm willing to be satisfied with that.  If we see *other* failures then
that'd prove that the problem is real, no?

regards, tom lane

Re: min_safe_lsn column in pg_replication_slots view

On 2020-Jul-08, Tom Lane wrote:
> Alvaro Herrera  writes:

> > Anyway the error is pretty strange: only GetWALAvailability is showing a
> > problem, but the size calculation in the view function def is returning
> > a negative number, as expected.
> 
> We've previously noted what seem to be compiler optimization bugs on
> both sparc32 and sparc64; the latest thread about that is
> https://www.postgresql.org/message-id/flat/f28f842d-e82b-4e30-a81a-2a1f9fa4a8e1%40www.fastmail.com
> 
> This is looking uncomfortably like the same thing.

Ouch.  So 12 builds with -O0 but 13 does not?  Did we do something to
sequence.c to work around this problem?  I cannot find anything.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: min_safe_lsn column in pg_replication_slots view

Alvaro Herrera  writes:
> On 2020-Jul-08, Tom Lane wrote:
>> We've previously noted what seem to be compiler optimization bugs on
>> both sparc32 and sparc64; the latest thread about that is
>> https://www.postgresql.org/message-id/flat/f28f842d-e82b-4e30-a81a-2a1f9fa4a8e1%40www.fastmail.com
>> This is looking uncomfortably like the same thing.

> Ouch.  So 12 builds with -O0 but 13 does not?

Unless Tom's changed the animal's config since that thread, yes.

> Did we do something to
> sequence.c to work around this problem?  I cannot find anything.

We did not.  If it's a compiler bug, and one as phase-of-the-moon-
dependent as this seems to be, I'd have zero confidence that any
specific source code change would fix it (barring someone confidently
explaining exactly what the compiler bug is, anyway).  The best we
can do for now is hope that backing off the -O level avoids the bug.

regards, tom lane

RE: Postgres is not able to handle more than 4k tables!?

2020-07-08 Thread tsunakawa.ta...@fujitsu.com

From: Konstantin Knizhnik 
> Autovacuum is periodically processing all this 5k relations (because
> them are actively updated).
> And as far as most of this tables are small enough autovacuum complete
> processing of them almost in the same time.
> As a result autovacuum workers produce ~5k invalidation messages in
> short period of time.

How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)?  It's 
available since PG 12.  It causes autovacuum to not truncate the relation.  
It's the relation truncation what produces those shared invalidation messages.


> But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
> It is also hardcoded and can't be changed without recompilation:
> 
> #define LOG2_NUM_LOCK_PARTITIONS  4
> #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
> 
> Having just 16 LW-Locks greatly increase conflict probability (taken in
> account that there are 5k tables and totally about 25k relations).
> It cause huge lw-lock acquisition time for heap_open and planning stage
> of some queries is increased from milliseconds to several minutes!
> Koda!

The vacuum's relation truncation is also the culprit here, and it can be 
eliminated by the above storage parameter.  It acquires Access Exclusive lock 
on the relation.  Without the strong Access Exclusive lock, just running DML 
statements use the fast path locking, which doesn't acquire the lock manager 
partition lock.

The long lwlock wait is a sad story.  The victim is probably exclusive lockers. 
 When someone holds a shared lock on a lwlock, the exclusive locker has to 
wait.  That's OK.  However, if another share locker comes later, it acquires 
the lwlock even though there're waiting exclusive lockers.  That's unfair, but 
this is the community decision.


Regards
Takayuki Tsunakawa

Re: min_safe_lsn column in pg_replication_slots view

On 2020-Jul-08, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Ouch.  So 12 builds with -O0 but 13 does not?
> 
> Unless Tom's changed the animal's config since that thread, yes.

I verified the configs in branches 12 and 13 in one of the failing
animals, and yes that's still the case.

> > Did we do something to
> > sequence.c to work around this problem?  I cannot find anything.
> 
> We did not.  If it's a compiler bug, and one as phase-of-the-moon-
> dependent as this seems to be, I'd have zero confidence that any
> specific source code change would fix it (barring someone confidently
> explaining exactly what the compiler bug is, anyway).  The best we
> can do for now is hope that backing off the -O level avoids the bug.

An easy workaround might be to add -O0 for that platform in that
directory's Makefile.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reigning in ExecParallelHashRepartitionFirst

2020-07-08 Thread Melanie Plageman

s/reign/rein/ in $subject
https://www.merriam-webster.com/words-at-play/do-you-rein-in-or-reign-in-something

Re: min_safe_lsn column in pg_replication_slots view

Alvaro Herrera  writes:
> On 2020-Jul-08, Tom Lane wrote:
>> We did not.  If it's a compiler bug, and one as phase-of-the-moon-
>> dependent as this seems to be, I'd have zero confidence that any
>> specific source code change would fix it (barring someone confidently
>> explaining exactly what the compiler bug is, anyway).  The best we
>> can do for now is hope that backing off the -O level avoids the bug.

> An easy workaround might be to add -O0 for that platform in that
> directory's Makefile.

"Back off the -O level in one directory" seems about as misguided as
"back off the -O level in one branch", if you ask me.  There's no
reason to suppose that the problem won't bite us somewhere else next
week.

The previous sparc32 bug that we'd made some effort to run to ground
is described here:
https://www.postgresql.org/message-id/15142.1498165...@sss.pgh.pa.us
We really don't know what aspects of the source code trigger that.
I'm slightly suspicious that we might be seeing the same bug in the
sparc64 builds, but it's just a guess.

regards, tom lane

Re: jsonpath versus NaN

2020-07-08 Thread Alexander Korotkov

On Thu, Jul 9, 2020 at 1:20 AM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > The patchset is attached, sorry for the delay.
>
> > The first patch improves error messages, which appears to be unclear
> > for me.  If one applies .double() method to a numeric value, we
> > restrict that this numeric value should fit to double precision type.
> > If it doesn't fit, the current error message just says the following.
>
> > ERROR: jsonpath item method .double() can only be applied to a numeric value
>
> > But that's confusing, because .double() method is naturally applied to
> > a numeric value.  Patch makes this message explicitly report that
> > numeric value is out of range for double type.  This patch also adds
> > test exercising this error.  When string can't be converted to double
> > precision, I think it's better to explicitly say that we expected
> > valid string representation of double precision type.
>
> I see your point here, but the English of the replacement error messages
> could be improved.  I suggest respectively
>
> numeric argument of jsonpath item method .%s() is out of range for type 
> double precision
>
> string argument of jsonpath item method .%s() is not a valid representation 
> of a double precision number

Good, thank you for corrections!

> As for 0002, I'd rather see the convertJsonbScalar() code changed back
> to the way it was, ie just
>
> case jbvNumeric:
> numlen = VARSIZE_ANY(scalarVal->val.numeric);
> padlen = padBufferToInt(buffer);
> ...
>
> There is no argument for having an its-not-NaN assertion here when
> there aren't similar assertions throughout the jsonb code.
>
> Also, it seems like it'd be smart to reject isinf() and isnan() results
> from float8in_internal_opt_error in both executeItemOptUnwrapTarget code
> paths, ie numeric source as well as string source.  Yeah, we don't expect
> to see those cases in a jbvNumeric (so I wouldn't change the error message
> text), but it's cheap insurance.

OK, corrected as you proposed.

> No other comments.

Revised patches are attached.

I understand both patches as fixes and propose to backpatch them to 12
if no objections.

--
Regards,
Alexander Korotkov


0001-Improve-error-reporting-for-jsonpath-.double-metho-2.patch
Description: Binary data


0002-Forbid-numeric-NaN-in-jsonpath-2.patch
Description: Binary data

Re: pg_resetwal --next-transaction-id may cause database failed to restart.

2020-07-08 Thread movead...@highgo.ca


>> ISTM that a reasonable compromise is that if you use -x (or -c, -m, -O)
>> and the input value is outside the range supported by existing files,
>> then it's a fatal error; unless you use --force, which turns it into
>> just a warning.
 
>One potential problem is that you might be using --force for some
>other reason and end up forcing this, too. But maybe that's OK.
Yes it's true, so I try to add a new option to control this behavior, you
can see it in the last mail with attach.
 
>Perhaps we should consider the idea of having pg_resetwal create the
>relevant clog file and zero-fill it, if it doesn't exist already,
>rather than leaving that to to the DBA or the postmaster binary to do
>it. It seems like that is what people would want to happen in this
>situation.
I have considered this idea, but I think it produces files uncontrolled
by postmaster, so I think it may be unacceptable and give up.

In the case we force to write an unsafe value, we can create or extend
related files I think. Do you have any further idea, I can work out a new
patch.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: A patch for get origin from commit_ts.

2020-07-08 Thread movead...@highgo.ca


>There is a conflict in catversion.h.  If you wish to test the patch,
>please feel free to use the attached where I have updated the
>attribute name to roident.
I think everything is ok, but be careful the new patch is in Windows
format now.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian  wrote:
>
> I was going through this thread and testing and reviewing the patches, I 
> think this is a great feature to have and one which customers would 
> appreciate. I wanted to help out, and I saw a request for a test patch for a 
> GUC to always enable streaming on logical replication. Here's one on top of 
> patchset v31, just in case you still need it. By default the GUC is turned 
> on, I ran the regression tests with it and didn't see any errors.
>

Thanks for showing the interest in patch.  How have you ensured that
streaming is happening?  I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

I feel you need to add some DEBUG messages (or some other way) to
ensure that all existing and new test cases related to logical
decoding will perform the streaming.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Implement UNLOGGED clause for COPY FROM

2020-07-08 Thread osumi.takami...@fujitsu.com

Hello, hackers.


I've been thinking to suggest
a peformance-oriented feature for COPY FROM.
It's UNLOGGED clause, which means data loading skipping WAL generation.

How to make it work is the following.
1. Aquire ACCESS EXCLUSIVE mode to lock the target table and its indexes.
2. Mark those relations 'unrecoverable' in pg_class/pg_index.
3. Issue one WAL to indicate when COPY UNLOGGED is executed.  
4. Execute the data loading, bypassing WAL generation for data.
5. Sync the data to disk by performing checkpoint.

During the recovery,
I'd like to make postgres recognize both marked 'unrecoverable' flags of the 
second step
and when the command was issued by the third step
in order to recover data that the target table had before the execution of COPY 
UNLOGGED.

Oracle's SQL*Loader has
such a feature called UNRECOVERABLE to boost the loading speed
for severe time limit of workload.


Best,
Takamichi Osumi

Re: Is this a bug in pg_current_logfile() on Windows?

I wrote:
> Andrew Dunstan  writes:
>> Seems reasonable. If we rip it out completely we'll have to find all the
>> places it breaks and fix them. And we'll almost certainly get new
>> breakage. If it's hiding a real bug we'll have to do that, but I'd be
>> reluctant unless there's actual proof.

> Hard to tell.  What I propose to do right now is change the \r filters
> as shown above, and see if the test I added in 004_logrotate.pl starts
> to show failures on Windows.  If it does, and no other place does,
> I'm willing to be satisfied with that.  If we see *other* failures then
> that'd prove that the problem is real, no?

So I did that, and the first report is from bowerbird and it's still
green.  Unless I'm completely misinterpreting what's happening (always
a possibility), that means we're still managing to remove "data"
occurrences of \r.

The most likely theory about that, I think, is that IPC::Run::run already
translated any \r\n occurrences in the psql command's output to plain \n.
Then, the \r generated by pg_current_logfile() would butt up against the
line-ending \n, allowing the "fix" in sub psql to remove valid data.

One thing I noticed while making 91bdf499b is that some of these
substitutions are conditioned on "if $TestLib::windows_os" while others
are conditioned on "if $Config{osname} eq 'msys'".  What is the reason
for this difference?  Is it possible that we only really need to do it
in the latter case?

regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-08 Thread Ajin Cherian

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila  wrote:

> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian  wrote:
>
> Thanks for showing the interest in patch.  How have you ensured that
> streaming is happening?  I don't think the proposed patch can ensure
> it for every case because we also rely on logical_decoding_work_mem to
> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> think with your patch it will allow streaming for cases where we have
> large amount of WAL to decode.
>
>
Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit,
even there it checks the same function ReorderBufferCanStream () and
decides whether to stream or spill. Did I miss something?

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
 * Pick the largest transaction (or subtransaction) and evict it from
 * memory by streaming, if supported. Otherwise, spill to disk.
 */
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
{
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

ReorderBufferStreamTXN(rb, txn);
}
else
{

I will also add debug and test as you suggested.

regards,
Ajin Cherian
Fujitsu Australia

Re: Resetting spilled txn statistics in pg_stat_replication

On Wed, Jul 8, 2020 at 1:14 PM Masahiko Sawada
 wrote:
>
> On Wed, 8 Jul 2020 at 16:04, Amit Kapila  wrote:
> >
> > On Wed, Jul 8, 2020 at 11:28 AM Masahiko Sawada
> >  wrote:
> > >
> > > >
> > > > If we need them to be persistent across time like that, perhaps we 
> > > > simply need to assign oids to replication slots? That might simplify 
> > > > this problem quite a bit?
> > >
> > > Yeah, I guess assigning oids to replication slots in the same way of
> > > oids in system catalogs might not work because physical replication
> > > slot can be created even during recovery. But using a
> > > monotonically-increasing integer as id seems better and straight
> > > forward.
> > >
> >
> > But don't we need to make it WAL logged as well similar to what we do
> > in GetNewObjectId?
>
> Yes. I was thinking that assigning (the maximum number of the existing
> slot id + 1) to a new slot without WAL logging.
>
> >  I am thinking do we really need Oids for slots or
> > is it okay to have some approximate stats in boundary cases?
>
> I think that using oids has another benefit that we don't need to send
> slot name to the stats collector along with the stats. Since the
> maximum size of slot name is NAMEDATALEN and we don't support the
> pgstat message larger than PGSTAT_MAX_MSG_SIZE (1000 bytes), if the
> user wants to increase NAMEDATALEN they might not be able to build.
>

I think NAMEDATALEN is used for many other objects as well and I don't
think we want to change it in foreseeable future, so that doesn't
sound to be a good reason to invent OIDs for slots.  OTOH, I do
understand it would be better to send OIDs than names for slots but I
am just not sure if it is a good idea to invent a new way to generate
OIDs (which is different from how we do it for other objects in the
system) for this purpose.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers archive broken?

2020-07-08 Thread Tatsuo Ishii

It seems I cannot access to pgsql-hackers archives.
https://www.postgresql.org/list/pgsql-hackers/2020-07/

Error 503 Backend fetch failed

Backend fetch failed
Guru Meditation:

XID: 68609318

Varnish cache server
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian  wrote:
>
> On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila  wrote:
>>
>> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian  wrote:
>>
>> Thanks for showing the interest in patch.  How have you ensured that
>> streaming is happening?  I don't think the proposed patch can ensure
>> it for every case because we also rely on logical_decoding_work_mem to
>> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
>> think with your patch it will allow streaming for cases where we have
>> large amount of WAL to decode.
>>
>
> Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even 
> there it checks the same function ReorderBufferCanStream () and decides 
> whether to stream or spill. Did I miss something?
>
> while (rb->size >= logical_decoding_work_mem * 1024L)
> {

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Implement UNLOGGED clause for COPY FROM





On 2020/07/09 11:36, osumi.takami...@fujitsu.com wrote:

Hello, hackers.


I've been thinking to suggest
a peformance-oriented feature for COPY FROM.
It's UNLOGGED clause, which means data loading skipping WAL generation.


This feature can work safely with wal_level=replica or logical?
Or it can work only with wal_level=minimal? If yes, what is the main
difference between this method and wal_skip_threshold?



How to make it work is the following.
1. Aquire ACCESS EXCLUSIVE mode to lock the target table and its indexes.
2. Mark those relations 'unrecoverable' in pg_class/pg_index.
3. Issue one WAL to indicate when COPY UNLOGGED is executed.
4. Execute the data loading, bypassing WAL generation for data.
5. Sync the data to disk by performing checkpoint.


What happens if the server crashes before #5? Since no WAL for
data-loading can be replayed, the target table should be truncated?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-08 Thread Dilip Kumar

On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila  wrote:
>
> On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian  wrote:
> >
> > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila  wrote:
> >>
> >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian  wrote:
> >>
> >> Thanks for showing the interest in patch.  How have you ensured that
> >> streaming is happening?  I don't think the proposed patch can ensure
> >> it for every case because we also rely on logical_decoding_work_mem to
> >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> >> think with your patch it will allow streaming for cases where we have
> >> large amount of WAL to decode.
> >>
> >
> > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, 
> > even there it checks the same function ReorderBufferCanStream () and 
> > decides whether to stream or spill. Did I miss something?
> >
> > while (rb->size >= logical_decoding_work_mem * 1024L)
> > {
>
> There is a check before above loop:
>
> ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> {
> ReorderBufferTXN *txn;
>
> /* bail out if we haven't exceeded the memory limit */
> if (rb->size < logical_decoding_work_mem * 1024L)
> return;
>
> This will prevent the streaming/spill to occur.

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)

On Thu, Jul 2, 2020 at 9:07 AM Amit Kapila  wrote:
>
> On Tue, Jun 30, 2020 at 9:30 AM Amit Kapila  wrote:
> >
> > >
> > > If I am not missing anything then that change was in
> > > lazy_cleanup_index and after this patch, it won't be required because
> > > we are using a different variable name.
> > >
> > > I have combined both the patches now.
> > >
> >
> > I am planning to push this tomorrow if there are no further
> > suggestions/comments.
> >
>
> Pushed.  Now, coming back to the question of the back patch.  I see a
> point in deferring this for 3-6 months or maybe more after PG13 is
> released.  OTOH, this implementation is mainly triggered by issues
> reported in this area and this doesn't seem to be a very invasive
> patch which can cause some de-stabilization in back-branches. I am not
> in a hurry to get this backpatched but still, it would be good if this
> can be backpatched earlier as quite a few people (onlist and EDB
> customers) have reported issues that could have been narrowed down if
> this patch is present in back-branches.
>
> It seems Alvaro and I are in favor of backpatch whereas Andres and
> Justin seem to think it should be deferred until this change has seen
> some real-world exposure.
>
> Anyone else wants to weigh in?
>

Seeing no more responses, it seems better to defer this backpatch till
PG13 is out and we get some confidence in this functionality.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 8:55 AM Dilip Kumar  wrote:
>
> On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila  wrote:
> >
> > On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian  wrote:
> > >
> > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila  
> > > wrote:
> > >>
> > >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian  wrote:
> > >>
> > >> Thanks for showing the interest in patch.  How have you ensured that
> > >> streaming is happening?  I don't think the proposed patch can ensure
> > >> it for every case because we also rely on logical_decoding_work_mem to
> > >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> > >> think with your patch it will allow streaming for cases where we have
> > >> large amount of WAL to decode.
> > >>
> > >
> > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, 
> > > even there it checks the same function ReorderBufferCanStream () and 
> > > decides whether to stream or spill. Did I miss something?
> > >
> > > while (rb->size >= logical_decoding_work_mem * 1024L)
> > > {
> >
> > There is a check before above loop:
> >
> > ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> > {
> > ReorderBufferTXN *txn;
> >
> > /* bail out if we haven't exceeded the memory limit */
> > if (rb->size < logical_decoding_work_mem * 1024L)
> > return;
> >
> > This will prevent the streaming/spill to occur.
>
> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

ALTER TABLE validate foreign key dependency problem

Hi,

I had an ALTER TABLE dependency problem reported to me.  Here's a
simplified version of it:

CREATE TABLE t (a INT, PRIMARY KEY(a));
ALTER TABLE t ADD CONSTRAINT t_fkey FOREIGN KEY (a) REFERENCES t(a) NOT VALID;
ALTER TABLE t VALIDATE CONSTRAINT t_fkey, ALTER a TYPE BIGINT;

Which results in:

ERROR:  could not read block 0 in file "base/12854/16411": read only 0
of 8192 bytes
CONTEXT:  SQL statement "SELECT fk."a" FROM ONLY "public"."t" fk LEFT
OUTER JOIN ONLY "public"."t" pk ON ( pk."a" OPERATOR(pg_catalog.=)
fk."a") WHERE pk."a" IS NULL AND (fk."a" IS NOT NULL)"

What's going on here is that due to the ALTER TYPE, a table rewrite is
pending. The primary key index of the table is also due to be
rewritten which ATExecAddIndex() delays due to the pending table
rewrite.  When we process AT_PASS_MISC level changes and attempt to
validate the foreign key constraint, the table is still pending a
rewrite and the new index still does not exist.
validateForeignKeyConstraint() executes regardless of the pending
rewrite and bumps into the above error during the SPI call while
trying to check the _bt_getrootheight() in get_relation_info().

I think the fix is just to delay the foreign key validation when
there's a rewrite pending until the rewrite is complete.

I also considered that we could just delay all foreign key validations
until phase 3, but I ended up just doing then only when a rewrite is
pending.

David


delay_altertable_foreignkey_validation_during_table_rewrite.patch
Description: Binary data

Re: pgsql-hackers archive broken?

On Thu, 9 Jul 2020 at 15:15, Tatsuo Ishii  wrote:
>
> It seems I cannot access to pgsql-hackers archives.
> https://www.postgresql.org/list/pgsql-hackers/2020-07/
>
> Error 503 Backend fetch failed

I just hit this too.  Cross-posting to www.

David

Re: pgsql-hackers archive broken?

2020-07-08 Thread vignesh C

On Thu, Jul 9, 2020 at 9:28 AM David Rowley  wrote:
>
> On Thu, 9 Jul 2020 at 15:15, Tatsuo Ishii  wrote:
> >
> > It seems I cannot access to pgsql-hackers archives.
> > https://www.postgresql.org/list/pgsql-hackers/2020-07/
> >
> > Error 503 Backend fetch failed
>
> I just hit this too.  Cross-posting to www.
>

I'm also facing this problem:
Error 503 Backend fetch failed
Backend fetch failed
Guru Meditation:
XID: 67771345

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Modifying data type of slot_keep_segs from XLogRecPtr to XLogSegNo

2020-07-08 Thread Kyotaro Horiguchi

At Wed, 8 Jul 2020 21:27:04 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/07/08 15:22, Fujii Masao wrote:
> > On 2020/07/08 11:55, torikoshia wrote:
> >> On 2020-07-08 11:15, Fujii Masao wrote:
> >>> On 2020/07/08 11:02, torikoshia wrote:
>  Hi,
> 
>  Currently, slot_keep_segs is defined as "XLogRecPtr" in KeepLogSeg(),
>  but it seems that should be "XLogSegNo" because this variable is
>  segment number.
> 
>  How do you think?

Yeah, that's my mistake while made bouncing back and forth between
segments and LSN in the code.  I noticed that once but forgotten until
now. Thanks for finding it.

> >>> I agree that using XLogRecPtr for slot_keep_segs is incorrect.
> >>> But this variable indicates the number of segments rather than
> >>> segment no, uint64 seems better. Thought?
> >>
> >> That makes sense.
> >> The number of segments and segment number are different.
> > Yes, so patch attached. I will commit it later.
> 
> Pushed. Thanks!

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: TAP tests and symlinks on Windows

2020-07-08 Thread Juan José Santamaría Flecha

On Wed, Jul 8, 2020 at 7:18 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 7/8/20 12:22 PM, Alvaro Herrera wrote:
> > On 2020-Jul-08, Andrew Dunstan wrote:
> >
> >> On 7/8/20 11:07 AM, Juan José Santamaría Flecha wrote:
> >>> There is a utility cygpath [1] meant for the conversion between Unix
> >>> and Windows path formats, that might be a meaningful name also.
> >>>
> >>> [1] http://cygwin.net/cygwin-ug-net/cygpath.html
> >> Oh, good find. But unfortunately it's not present in my msys1
> installations.
> >>
> >> So we'll still need to use the 'pwd -W` trick for those.
> > I think his point is not to use that utility, just to use its name as
> > the name of the perl routine.
>
> That would be wholly misleading, since it's not needed at all when we're
> running running under cygwin.
>

MSYS does not include cygpath(), but MSYS2 does. I see why the name could
be confusing outside cygwin, but that is a given and it would point to a
utility that it mimics. Maybe a note for future reference could be enough.

msys_to_windows_path() seems too long, but is hard to misunderstand.

Regards,

Juan José Santamaría Flecha

Re: [doc] modifying unit from characters to bytes





On 2020/07/08 17:12, Daniel Gustafsson wrote:

On 8 Jul 2020, at 10:05, Fujii Masao  wrote:

On 2020/07/08 16:17, Daniel Gustafsson wrote:

On 8 Jul 2020, at 04:25, Fujii Masao  wrote:

On 2020/07/08 10:54, torikoshia wrote:

Hi,
The manual describes the size of pg_stat_activity.query
as below:
  | By default the query text is truncated at 1024 characters;
When considering multibyte characters, it seems more
accurate to change the unit from "characters" to "bytes".


Agreed. Barring any objection, I will commit this patch.

+1 to commit this patch, following the link to track_activity_query_size it's
even specified to be bytes there.  IIRC the NULL terminator is also included in
the 1024 bytes which prevents it from being 1024 characters even for
non-multibyte.


Yes, so we should document "truncated at 1023 bytes" for accuracy, instead?
This might be more confusing for users, though


I think that's overcomplicating things, since we do (will) specify bytes and
not characters.


Agreed. So I pushed the proposed patch. Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: max_slot_wal_keep_size and wal_keep_segments

2020-07-08 Thread Kyotaro Horiguchi

At Thu, 9 Jul 2020 00:37:57 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/07/02 2:18, David Steele wrote:
> > On 7/1/20 10:54 AM, Alvaro Herrera wrote:
> >> On 2020-Jul-01, Fujii Masao wrote:
> >>
> >>> On 2020/07/01 12:26, Alvaro Herrera wrote:
>  On 2020-Jun-30, Fujii Masao wrote:
> 
> > When I talked about max_slot_wal_keep_size as new feature in v13
> > at the conference, I received the question like "Why are the units of
> > setting values in max_slot_wal_keep_size and wal_keep_segments
> > different?"
> > from audience. That difference looks confusing for users and
> > IMO it's better to use the same unit for them. Thought?
> 
>  Do we still need wal_keep_segments for anything?
> >>>
> >>> Yeah, personally I like wal_keep_segments because its setting is very
> >>> simple and no extra operations on replication slots are necessary.
> >>
> >> Okay.  In that case I +1 the idea of renaming to wal_keep_size.
> > +1 for renaming to wal_keep_size.
> 
> I attached the patch that renames wal_keep_segments to wal_keep_size.

It fails on 019_replslot_limit.pl for uncertain reason to me..


@@ -11323,7 +11329,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, 
TimeLineID *stoptli_p)
 * If archiving is enabled, wait for all the required WAL files to be
 * archived before returning. If archiving isn't enabled, the required 
WAL
 * needs to be transported via streaming replication (hopefully with
-* wal_keep_segments set high enough), or some more exotic mechanism 
like
+* wal_keep_size set high enough), or some more exotic mechanism like
 * polling and copying files from pg_wal with script. We have no 
knowledge

Isn't this time a good chance to mention replication slots?


-   "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+   "ALTER SYSTEM SET wal_keep_size to '128MB'; SELECT pg_reload_conf();");

wal_segment_size to 1MB here so, that conversion is not correct.
(However, that test works as long as it is more than
max_slot_wal_keep_size so it's practically no problem.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgsql-hackers archive broken?

On 2020-Jul-09, vignesh C wrote:

> On Thu, Jul 9, 2020 at 9:28 AM David Rowley  wrote:
> >
> > On Thu, 9 Jul 2020 at 15:15, Tatsuo Ishii  wrote:
> > >
> > > It seems I cannot access to pgsql-hackers archives.
> > > https://www.postgresql.org/list/pgsql-hackers/2020-07/
> > >
> > > Error 503 Backend fetch failed

Should be fixed now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Implement UNLOGGED clause for COPY FROM

2020-07-08 Thread David G. Johnston

On Wednesday, July 8, 2020, osumi.takami...@fujitsu.com <
osumi.takami...@fujitsu.com> wrote:
>
> 5. Sync the data to disk by performing checkpoint.
>

This step seems to invalidate the idea outright.  The checkpoint command is
superuser only and isn’t table specific.  This seems to require both those
things to be changed.

Aside from that, though, how does this improve upon the existing capability
to copy into an unlogged temporary table?

David J.

Re: pg_resetwal --next-transaction-id may cause database failed to restart.