date:20201120

Re: Unnecessary delay in streaming replication due to replay lag

2020-11-20 Thread Michael Paquier

On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1...@sina.cn wrote:
> I read the code and test the patch, it run well on my side, and I have 
> several issues on the
> patch.

+   RequestXLogStreaming(ThisTimeLineID,
+startpoint,
+PrimaryConnInfo,
+PrimarySlotName,
+wal_receiver_create_temp_slot);

This patch thinks that it is fine to request streaming even if
PrimaryConnInfo is not set, but that's not fine.

Anyway, I don't quite understand what you are trying to achieve here.
"startpoint" is used to request the beginning of streaming.  It is
roughly the consistency LSN + some alpha with some checks on WAL
pages (those WAL page checks are not acceptable as they make
maintenance harder).  What about the case where consistency is
reached but there are many segments still ahead that need to be
replayed?  Your patch would cause streaming to begin too early, and
a manual copy of segments is not a rare thing as in some environments
a bulk copy of segments can make the catchup of a standby faster than
streaming.

It seems to me that what you are looking for here is some kind of
pre-processing before entering the redo loop to determine the LSN
that could be reused for the fast streaming start, which should match
the end of the WAL present locally.  In short, you would need a
XLogReaderState that begins a scan of WAL from the redo point until it
cannot find anything more, and use the last LSN found as a base to
begin requesting streaming.  The question of timeline jumps can also
be very tricky, but it could also be possible to not allow this option
if a timeline jump happens while attempting to guess the end of WAL
ahead of time.  Another thing: could it be useful to have an extra
mode to begin streaming without waiting for consistency to finish?
--
Michael

signature.asc
Description: PGP signature

Re: Different results between PostgreSQL and Oracle for "for update" statement

2020-11-20 Thread Andy Fan

On Thu, Nov 19, 2020 at 11:49 PM Tom Lane  wrote:

> Andy Fan  writes:
> > create table su (a int, b int);
> > insert into su values(1, 1);
>
> > - session 1:
> > begin;
> > update su set b = 2 where b = 1;
>
> > - sess 2:
> > select * from su where a in (select a from su where b = 1) for update;
>
> This'd probably work the way you expect if there were "for update"
> in the sub-select as well.  As is, the sub-select will happily
> return "1".
>
> regards, tom lane
>

Hi Tom:

Thank you for your attention. Your suggestion would fix the issue.  However
The difference will cause some risks when users move their application from
Oracle
to PostgreSQL. So I'd like to think which behavior is more reasonable.

In our current pg strategy, we would get

select * from su where a in (select a from su where b = 1) for update;

 a | b
---+---
 1 | 2
(1 row)

The data is obtained from 4 steps:

1. In the subquery, it gets su(a=1, b=1).
2. in the outer query, it is blocked.
3. session 1 is committed. sess 2 can continue.
4. outer query gets su(a=1, b=2).

By comparing the result in step 1 & step 4 in the same query, it
looks like we are using 2 different snapshots for the same query.
I think it is a bit strange.

If we think it is an issue, I think we can fix it with something like this:

+/*
+ * We should use the same RowMarkType for the RangeTblEntry
+ * if the underline relation is the same. Doesn't handle SubQuery for now.
+ */
+static RowMarkType
+select_rowmark_type_for_rangetable(RangeTblEntry *rte,
+  List
*rowmarks,
+
 LockClauseStrength strength)
+{
+   ListCell *lc;
+   foreach(lc, rowmarks)
+   {
+   PlanRowMark *rc = lfirst_node(PlanRowMark, lc);
+   if (rc->relid == rte->relid)
+   return rc->markType;
+   }
+   return select_rowmark_type(rte, strength);
+}
+
 /*
  * preprocess_rowmarks - set up PlanRowMarks if needed
  */
@@ -2722,6 +2743,7 @@ preprocess_rowmarks(PlannerInfo *root)
newrc->strength = rc->strength;
newrc->waitPolicy = rc->waitPolicy;
newrc->isParent = false;
+  newrc->relid = rte->relid;

prowmarks = lappend(prowmarks, newrc);
}
@@ -2742,18 +2764,18 @@ preprocess_rowmarks(PlannerInfo *root)
newrc = makeNode(PlanRowMark);
newrc->rti = newrc->prti = i;
newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-   newrc->markType = select_rowmark_type(rte, LCS_NONE);
+  newrc->markType = select_rowmark_type_for_rangetable(rte,
prowmarks, LCS_NONE);
newrc->allMarkTypes = (1 << newrc->markType);
newrc->strength = LCS_NONE;
newrc->waitPolicy = LockWaitBlock;  /* doesn't matter */
newrc->isParent = false;
-
+   newrc->relid = rte->relid;
prowmarks = lappend(prowmarks, newrc);
}
-
root->rowMarks = prowmarks;
 }

+
 /*
  * Select RowMarkType to use for a given table
  */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ac5685da64..926086a69a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1103,6 +1103,7 @@ typedef struct PlanRowMark
LockClauseStrength strength;/* LockingClause's strength, or
LCS_NONE */
LockWaitPolicy waitPolicy;  /* NOWAIT and SKIP LOCKED options */
boolisParent;   /* true if this is a
"dummy" parent entry */
+   Oid relid; /* relation oid */
 } PlanRowMark;

Do you think it is a bug and  the above method is the right way to go?

-- 
Best Regards
Andy Fan

Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

2020-11-20 Thread Nail Carpenter

Yes, It's my fault. You're right.

Pavel Borisov  于2020年11月19日周四 下午11:55写道：

> One thing that doesn't matter is that the modify here seems unnecessary,
>> right?
>>
>> > mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
>> > {
>> > char *path;
>> > - int ret;
>> > + int ret = 0;
>> > path = relpath(rnode, forkNum
>
>
> I suppose it is indeed necessary as otherwise the result of the comparison
> is not defined in case of 'else' block in the mdunlinkfork() :
> 346 else
> 347 {
> 348 /* Prevent other backends' fds from holding on to the disk
> space */
> 349 do_truncate(path);
> .
> 356  * Delete any additional segments.
> 357  */
> 358 if (ret >= 0)
> --^^^
>
> --
> Best regards,
> Pavel Borisov
>
> Postgres Professional: http://postgrespro.com 
>

So in the present logic, *ret* is always 0 if it is not in recovery mode
(and other *if* conditions are not satisfied). But when the *if* condition
is satisfied, it is possible to skip the deletion of additional segments.
Considering that our goal is to always try to unlink additional segments,
*ret* seems unnecessary here. The code flow looks like:

> if (isRedo || .)
> {
> int ret;  /* move to here */
> 
> }
> else
> { }
>
> /* Delete any additional segments. */
> if (true)
> ...

Or is there any reason to allow us to skip the attempt to delete additional
segments in recovery mode?

Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-20 Thread Bharath Rupireddy

On Fri, Nov 20, 2020 at 12:59 PM Peter Eisentraut
 wrote:
>
> On 2020-11-20 06:37, Michael Paquier wrote:
> >>> But if you consider materialized views as a variant of normal views,
> >>> then the INSERT privilege would be applicable if you pass an INSERT on
> >>> the materialized view through to the underlying tables, like for a view.
> > INSERT to materialized views is not supported, but perhaps you mean
> > having a variant of auto updatable for matviews?  I am not sure how to
> > clearly define that.
>
> Not currently, but it could be a future feature.  Basically an insert
> would be passed on to the underlying tables (using INSTEAD triggers),
> and then a refresh would be triggered automatically.
>

Sounds interesting! Just a thought: I think instead of just auto
updating/refreshing materialized view for every single row inserted,
maybe we could do it for a bunch of rows.

If not with triggers, another way to achieve the auto updatable
matviews functionality is by having a dedicated bgworker(which is by
default switched off/not spawned). This worker can get the list of
matviews and if the amount of rows changed in the underlying tables
crosses a certain configurable limit, then refresh them using existing
refresh matview infrastructure. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-20 Thread Daniel Gustafsson

> On 20 Nov 2020, at 03:31, Michael Paquier  wrote

> pg_strong_random.c needs a pgindent run, there are two inconsistent
> diffs.  Looks fine except for those nits.

Agreed, this is the least complicated (and most readable) we can make this
file, especially if we add more providers.  +1.

cheers ./daniel

Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-20 Thread Bharath Rupireddy

On Fri, Nov 20, 2020 at 11:07 AM Michael Paquier  wrote:
>
> Thanks.  Based on what Peter has said, the ACL_INSERT check in
> intorel_startup() could just be removed, and the tests of matview.sql
> and select_into.sql would need some cleanup.  We could keep around
> some scenarios with some follow-up INSERT queries after the initial
> creation.
>

Thanks! Attaching the patch. Please review it.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 1275f235611a4f8badb1edd6f999b557bb1ed291 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 20 Nov 2020 14:36:33 +0530
Subject: [PATCH v1] Do not check INSERT privilege in CTAS and MatView

To keep in sync CTAS with SQL standard, do not check for INSERT
privilege.
---
 .../sgml/ref/create_materialized_view.sgml|   3 +-
 doc/src/sgml/ref/create_table_as.sgml |   5 +-
 src/backend/commands/createas.c   |  26 +
 src/test/regress/expected/matview.out |  30 +++--
 src/test/regress/expected/select_into.out | 109 +-
 src/test/regress/sql/matview.sql  |  16 ++-
 src/test/regress/sql/select_into.sql  |  59 --
 7 files changed, 111 insertions(+), 137 deletions(-)

diff --git a/doc/src/sgml/ref/create_materialized_view.sgml b/doc/src/sgml/ref/create_materialized_view.sgml
index e4ea049eff..c8dee0f99e 100644
--- a/doc/src/sgml/ref/create_materialized_view.sgml
+++ b/doc/src/sgml/ref/create_materialized_view.sgml
@@ -52,8 +52,7 @@ CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
   
CREATE MATERIALIZED VIEW requires
CREATE privilege on the schema used for the materialized
-   view.  If using WITH DATA, the default,
-   INSERT privilege is also required.
+   view. INSERT privilege is not required.
   
  
 
diff --git a/doc/src/sgml/ref/create_table_as.sgml b/doc/src/sgml/ref/create_table_as.sgml
index 2cac4e3ec0..6fb4fa6f44 100644
--- a/doc/src/sgml/ref/create_table_as.sgml
+++ b/doc/src/sgml/ref/create_table_as.sgml
@@ -56,9 +56,8 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 
   
CREATE TABLE AS requires CREATE
-   privilege on the schema used for the table.  If using
-   WITH DATA, the default, INSERT
-   privilege is also required.
+   privilege on the schema used for the table. However, 
+   INSERT privilege is not required as per the SQL standard.
   
  
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 37649eafa8..1f47ea99b5 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -432,7 +432,6 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	DR_intorel *myState = (DR_intorel *) self;
 	IntoClause *into = myState->into;
 	bool		is_matview;
-	char		relkind;
 	List	   *attrList;
 	ObjectAddress intoRelationAddr;
 	Relation	intoRelationDesc;
@@ -443,7 +442,6 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 
 	/* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */
 	is_matview = (into->viewQuery != NULL);
-	relkind = is_matview ? RELKIND_MATVIEW : RELKIND_RELATION;
 
 	/*
 	 * Build column definitions using "pre-cooked" type and collation info. If
@@ -505,29 +503,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	 */
 	intoRelationDesc = table_open(intoRelationAddr.objectId, AccessExclusiveLock);
 
-	/*
-	 * Check INSERT permission on the constructed table.  Skip this check if
-	 * WITH NO DATA is specified as only a table gets created with no tuples
-	 * inserted, that is a case possible when using EXPLAIN ANALYZE or
-	 * EXECUTE.
-	 */
-	if (!into->skipData)
-	{
-		RangeTblEntry *rte;
-
-		rte = makeNode(RangeTblEntry);
-		rte->rtekind = RTE_RELATION;
-		rte->relid = intoRelationAddr.objectId;
-		rte->relkind = relkind;
-		rte->rellockmode = RowExclusiveLock;
-		rte->requiredPerms = ACL_INSERT;
-
-		for (attnum = 1; attnum <= intoRelationDesc->rd_att->natts; attnum++)
-			rte->insertedCols = bms_add_member(rte->insertedCols,
-			   attnum - FirstLowInvalidHeapAttributeNumber);
-
-		ExecCheckRTPerms(list_make1(rte), true);
-	}
+	/* We do not check INSERT privilege here for SQL standard compatibility. */
 
 	/*
 	 * Make sure the constructed table does not have RLS enabled.
diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index 328c3118b6..da4c73aa4f 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -589,22 +589,29 @@ SELECT * FROM mvtest2;
 ERROR:  materialized view "mvtest2" has not been populated
 HINT:  Use the REFRESH MATERIALIZED VIEW command.
 ROLLBACK;
--- INSERT privileges if relation owner is not allowed to insert.
+--
+-- INSERT privilege for CREATE and REFRESH MATERIALIZED VIEW.
+-- Queries succeed even though the owner has no INSERT privilege as we do not
+-- check for it.
+--
 CREATE SCHEMA matview_schema;
 CREATE USER regress_

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs

On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
>
> On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> >
> > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > >
> > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > wrote:
> > > > Patches attached.
> > > > 1. vacuum_anti_wraparound.v2.patch
> > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > >
> > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > else, but I dislike anti-wraparound.
> >
> > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > I'm sure a more descriptive term exists.
>
> Since we use the term aggressive scan in the docs, I personally don't
> feel unnatural about that. But since this option also disables index
> cleanup when not enabled explicitly, I’m concerned a bit if user might
> get confused. I came up with some names like FEEZE_FAST and
> FREEZE_MINIMAL but I'm not sure these are better.

FREEZE_FAST seems good.

> BTW if this option also disables index cleanup for faster freezing,
> why don't we disable heap truncation as well?

Good idea

-- 
Simon Riggshttp://www.2ndQuadrant.com/

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-20 Thread Bharath Rupireddy

On Wed, Nov 18, 2020 at 5:52 PM Fujii Masao 
wrote:
>
> > handle_sigterm() and die() are similar except that die() has extra
> > handling(below) for exiting immediately when waiting for input/command
> > from the client.
> >  /*
> >   * If we're in single user mode, we want to quit immediately - we
can't
> >   * rely on latches as they wouldn't work when stdin/stdout is a
file.
> >   * Rather ugly, but it's unlikely to be worthwhile to invest much
more
> >   * effort just for the benefit of single user mode.
> >   */
> >  if (DoingCommandRead && whereToSendOutput != DestRemote)
> >  ProcessInterrupts();
> >
> > Having this extra handling is correct for normal backends as they can
> > connect directly to clients for reading input commands, the above if
> > condition may become true and ProcessInterrupts() may be called to
> > handle the SIGTERM immediately.
> >
> > Note that DoingCommandRead can never be true in bg workers as it's
> > being set to true only in normal backend PostgresMain(). If we use
> > die()(instead of custom SIGTERM handlers such as handle_sigterm()) in
> > a bg worker/non-backend process, there are no chances that the above
> > part of the code gets hit and the interrupts are processed
> > immediately. And also here are the bg worker process that use die()
> > instead of their own custom handlers: parallel workers, autoprewarm
> > worker, autovacuum worker, logical replication launcher and apply
> > worker, wal sender process
> >
> > I think we can also use die() instead of handle_sigterm() in
> > test_shm_mq/worker.c.
> >
> > I attached a patch for this change.
>
> Thanks for the patch! It looks good to me.
>

Thanks.

>
> BTW, I tried to find the past discussion about why die() was not used,
> but I could not yet.
>

I think the commit 2505ce0 that altered the comment has something:
handle_sigterm() is stripped down to remove if (DoingCommandRead &&
whereToSendOutput != DestRemote) part from die() since test_shm_mq bg
worker or for that matter any bg worker do not have an equivalent of the
regular backend's command-read loop.

--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -60,13 +60,9 @@ test_shm_mq_main(Datum main_arg)
 *
 * We want CHECK_FOR_INTERRUPTS() to kill off this worker process
just as
 * it would a normal user backend.  To make that happen, we
establish a
-* signal handler that is a stripped-down version of die().  We
don't have
-* any equivalent of the backend's command-read loop, where
interrupts can
-* be processed immediately, so make sure ImmediateInterruptOK is
turned
-* off.
+* signal handler that is a stripped-down version of die().
 */
pqsignal(SIGTERM, handle_sigterm);
-   ImmediateInterruptOK = false;
BackgroundWorkerUnblockSignals();

>
> > Attaching another patch that has replaced custom SIGTERM handler
> > worker_spi_sigterm() with die() and custom SIGHUP handler
> > worker_spi_sighup() with standard SignalHandlerForConfigReload().
>
> This patch also looks good to me.
>

Thanks.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs

On Fri, 20 Nov 2020 at 03:54, Masahiko Sawada  wrote:

> > > So +1 for this idea.
> >
> > Patch to do this attached, for discussion.
>
> Thank you for the patch!
>
> +*
> +* Once we decide to dirty the data block we may as well 
> freeze
> +* any tuples that are visible to all, since the additional
> +* cost of freezing multiple tuples is low.
>
> I'm concerned that always freezing all tuples when we're going to make
> the page dirty would affect the existing vacuum workload much. The
> additional cost of freezing multiple tuples would be low but if we
> freeze tuples we would also need to write WAL, which is not negligible
> overhead I guess. In the worst case, if a table has dead tuples on all
> pages we process them, but with this patch, in addition to that, we
> will end up not only freezing all live tuples but also writing
> XLOG_HEAP2_FREEZE_PAGE WAL for all pages. So I think it would be
> better either to freeze all tuples if we find a tuple that needs to be
> frozen or to make this behavior work only if the new VACUUM option is
> specified.

The additional cost of freezing is sizeof(xl_heap_freeze_tuple) = 11 bytes

I guess there is some overhead for writing the WAL record itself, the
only question is how much. If that is a concern then we definitely
don't want to do that only when using FAST_FREEZE, since that would
slow it down when we want to speed it up.

I've updated the patch to match your proposal, so we can compare. It
seems a shorter patch.

(This patch is an optimization that is totally independent to the
other proposals on this thread).

-- 
Simon Riggshttp://www.EnterpriseDB.com/


one_freeze_then_max_freeze.v6.patch
Description: Binary data

Re: Asynchronous Append on postgres_fdw nodes.

2020-11-20 Thread Kyotaro Horiguchi

Hello.

I looked through the nodeAppend.c and postgres_fdw.c part and those
are I think the core of this patch.

-* figure out which subplan we are currently processing
+* try to get a tuple from async subplans
+*/
+   if (!bms_is_empty(node->as_needrequest) ||
+   (node->as_syncdone && 
!bms_is_empty(node->as_asyncpending)))
+   {
+   if (ExecAppendAsyncGetNext(node, &result))
+   return result;

The function ExecAppendAsyncGetNext() is a function called only here,
and contains only 31 lines.  It doesn't seem to me that the separation
makes the code more readable.

-   /* choose new subplan; if none, we're done */
-   if (!node->choose_next_subplan(node))
+   /* wait or poll async events */
+   if (!bms_is_empty(node->as_asyncpending))
+   {
+   Assert(!node->as_syncdone);
+   Assert(bms_is_empty(node->as_needrequest));
+   ExecAppendAsyncEventWait(node);

You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here.  What is the
reason for the change?


+   /* Perform the actual callback. */
+   ExecAsyncRequest(areq);
+   if (ExecAppendAsyncResponse(areq))
+   {
+   Assert(!TupIsNull(areq->result));
+   *result = areq->result;

Putting aside the name of the functions, the first two function are
used only this way at only two places. ExecAsyncRequest(areq) tells
fdw to store the first tuple among the already received ones to areq,
and ExecAppendAsyncResponse(areq) is checking the result is actually
set. Finally the result is retrieved directory from areq->result.
What is the reason that the two functions are separately exists?


+   /* Perform the actual callback. */
+   ExecAsyncNotify(areq);

Mmm. The usage of the function (or its name) looks completely reverse
to me.  I think FDW should NOTIFY to exec nodes that the new tuple
gets available but the reverse is nonsense. What the function is
actually doing is to REQUEST fdw to fetch tuples that are expected to
have arrived, which is different from what the name suggests.


postgres_fdw.c

> postgresIterateForeignScan(ForeignScanState *node)
> {
>   PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
>   TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> 
>   /*
>* If this is the first call after Begin or ReScan, we need to create 
> the
>* cursor on the remote side.
>*/
>   if (!fsstate->cursor_exists)
>   create_cursor(node);

With the patch, cursors are also created in another place so at least
the comment is wrong.  That being said, I think we should unify the
code except the differences between async and sync.  For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.


+
+   /* If this was the second part of an async request, we must 
fetch until NULL. */
+   if (fsstate->async_aware)
+   {
+   /* call once and raise error if not NULL as expected? */
+   while (PQgetResult(conn) != NULL)
+   ;
+   fsstate->conn_state->async_query_sent = false;
+   }

PQgetResult() receives the result of a query at once. This code means
several queries (FETCHes) are queued in, and we discard the result
except the last one.  Actually the res is just PQclear'd just after so
this just discards *all* result of maybe more than one FETCHes.  I
think something's wrong if we need this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Masahiko Sawada

On Fri, Nov 20, 2020 at 12:56 PM Peter Geoghegan  wrote:
>
> On Thu, Nov 19, 2020 at 5:58 PM Masahiko Sawada  wrote:
> > Several months passed from the discussion. We decided not to do
> > anything on back branches but IIUC the fundamental issue is not fixed
> > yet. The issue pointed out by Andres that we should leave the index AM
> > to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is
> > specified is still valid. Is that right?
>
> I don't remember if Andres actually said that publicly, but I
> definitely did. I do remember discussing this with him privately, at
> which point he agreed with what I said. Which you just summarized
> well.
>
> > For HEAD, there was a discussion that we change lazy vacuum and
> > bulkdelete and vacuumcleanup APIs so that it calls these APIs even
> > when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
> > specified, it collects dead tuple TIDs into maintenance_work_mem space
> > and passes the flag indicating INDEX_CLEANUP is specified or not to
> > index AMs.
>
> Right.
>
> > Index AM decides whether doing bulkdelete/vacuumcleanup. A
> > downside of this idea would be that we will end up using
> > maintenance_work_mem even if all index AMs of the table don't do
> > bulkdelete/vacuumcleanup at all.
>
> That is a downside, but I don't think that it's a serious downside.
> But it may not matter, because there are lots of reasons to move in
> this direction.
>
> > The second idea I came up with is to add an index AM API  (say,
> > amcanskipindexcleanup = true/false) telling index cleanup is skippable
> > or not. Lazy vacuum checks this flag for each index on the table
> > before starting. If index cleanup is skippable in all indexes, it can
> > choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
> > maintenance_work_mem. All in-core index AM will set to true. Perhaps
> > it’s true (skippable) by default for backward compatibility.
>
> (The terminology here is very confusing, because the goal of the
> INDEX_CLEANUP feature in v12 is not really to skip a call to
> btvacuumcleanup(). The goal is really to skip a call to
> btbulkdelete(). I will try to be careful.)
>
> I think that the ideal design here would be a new hybrid of two
> existing features:
>
> 1.) Your INDEX_CLEANUP feature from Postgres 12.
>
> and:
>
> 2.) Your vacuum_cleanup_index_scale_factor feature from Postgres 11.
>
> The INDEX_CLEANUP feature is very useful, because skipping indexes
> entirely can be very helpful for many reasons (e.g. faster manual
> VACUUM in the event of wraparound related emergencies).  But
> INDEX_CLEANUP has 2 problems today:
>
> A. It doesn't interact well with vacuum_cleanup_index_scale_factor.
> This is the problem that has been discussed on this thread.
>
> and:
>
> B. It is an "all or nothing" thing. Unlike the
> vacuum_cleanup_index_scale_factor feature, it does not care about what
> the index AM/individual index wants. But it should.

Agreed.

>
> (**Thinks some more***)
>
> Actually, on second thought, maybe INDEX_CLEANUP only has one problem.
> Problem A is actually just a special case of problem B. There are many
> interesting opportunities created by solving problem B
> comprehensively.
>
> So, what useful enhancements to VACUUM are possible once we have
> something like INDEX_CLEANUP, that is sensitive to the needs of
> indexes? Well, you already identified one yourself, so obviously
> you're thinking about this in a similar way already:
>
> > The in-core AMs including btree indexes will work same as before. This
> > fix is to make it more desirable behavior and possibly to help other
> > AMs that require to call vacuumcleanup in all cases. Once we fix it I
> > wonder if we can disable index cleanup when autovacuum’s
> > anti-wraparound vacuum.
>
> Obviously this is a good idea. The fact that anti-wraparound vacuum
> isn't really special compared to regular autovacuum is *bad*.
> Obviously anti-wraparound is in some sense more important than regular
> vacuum. Making it as similar as possible to vacuum simply isn't
> helpful. Maybe it is slightly more elegant in theory, but in the real
> world it is a poor design. (See also: every single PostgreSQL post
> mortem that has ever been written.)
>
> But why stop with this? There are other big advantages to allowing
> individual indexes/index AMs influence of the INDEX_CLEANUP behavior.
> Especially if they're sensitive to the needs of particular indexes on
> a table (not just all of the indexes on the table taken together).
>
> As you may know, my bottom-up index deletion patch can more or less
> eliminate index bloat in indexes that don't get "logically changed" by
> many non-HOT updates. It's *very* effective with non-HOT updates and
> lots of indexes. See this benchmark result for a recent example:
>
> https://postgr.es/m/CAGnEbohYF_K6b0v=2uc289=v67qnhc3n01ftic8x94zp7kk...@mail.gmail.com
>
> The feature is effective enough to make it almost unnecessary to
> VACUUM certain index

Re: Asynchronous Append on postgres_fdw nodes.

2020-11-20 Thread Kyotaro Horiguchi

At Fri, 20 Nov 2020 20:16:42 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
me> +   /* If this was the second part of an async request, we must 
fetch until NULL. */
me> +   if (fsstate->async_aware)
me> +   {
me> +   /* call once and raise error if not NULL as expected? */
me> +   while (PQgetResult(conn) != NULL)
me> +   ;
me> +   fsstate->conn_state->async_query_sent = false;
me> +   }
me> 
me> PQgetResult() receives the result of a query at once. This code means
me> several queries (FETCHes) are queued in, and we discard the result
me> except the last one.  Actually the res is just PQclear'd just after so
me> this just discards *all* result of maybe more than one FETCHes.  I
me> think something's wrong if we need this.

I was wrong, it is worse. That leaks the returned PGresult.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs

On Fri, 20 Nov 2020 at 10:15, Simon Riggs  wrote:
>
> On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
> >
> > On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> > >
> > > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > > >
> > > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > > wrote:
> > > > > Patches attached.
> > > > > 1. vacuum_anti_wraparound.v2.patch
> > > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > > >
> > > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > > else, but I dislike anti-wraparound.
> > >
> > > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > > I'm sure a more descriptive term exists.
> >
> > Since we use the term aggressive scan in the docs, I personally don't
> > feel unnatural about that. But since this option also disables index
> > cleanup when not enabled explicitly, I’m concerned a bit if user might
> > get confused. I came up with some names like FEEZE_FAST and
> > FREEZE_MINIMAL but I'm not sure these are better.
>
> FREEZE_FAST seems good.
>
> > BTW if this option also disables index cleanup for faster freezing,
> > why don't we disable heap truncation as well?
>
> Good idea

Patch attached, using the name "FAST_FREEZE" instead.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


vacuum_fast_freeze.v3.patch
Description: Binary data

Re: a misbehavior of partition row movement (?)

2020-11-20 Thread Amit Langote

On Sat, Oct 3, 2020 at 8:26 PM Amit Langote  wrote:
> On Sat, Oct 3, 2020 at 8:15 PM Tomas Vondra  
> wrote
> > On Sat, Oct 03, 2020 at 11:42:21AM +0900, Amit Langote wrote:
> > >On Fri, Oct 2, 2020 at 11:32 PM David G. Johnston
> > > wrote:
> > >> On Friday, October 2, 2020, Amit Langote 
> > >> wrote:
> > >>>
> > >>>
> > >>> Reporter on that thread says that the last update should have failed
> > >>> and I don't quite see a workable alternative to that.
> > >>
> > >>
> > >> To be clear the OP would rather have it just work, the same as the
> > >> non-row-movement version.  Maybe insert the new row first, execute
> > >> the on update trigger chained from the old row, then delete the old
> > >> row?
> > >
> > >I was thinking yesterday about making it just work, but considering the
> > >changes that would need to be made to how the underlying triggers fire,
> > >it does not seem we would be able to back-port the solution.
> > >
> >
> > I think we need to differentiate between master and backbranches. IMO we
> > should try to make it "just work" in master, and the amount of code
> > should not be an issue there I think (no opinion on whether insert and
> > update trigger is the way to go). For backbranches we may need to do
> > something less intrusive, of course.
>
> Sure, that makes sense.  I will try making a patch for HEAD to make it
> just work unless someone beats me to it.

After working on this for a while, here is my proposal for HEAD.

To reiterate, the problem occurs when an UPDATE of a partitioned PK
table is turned into a DELETE + INSERT.  In that case, the UPDATE RI
triggers are not fired at all, but DELETE ones are, so the foreign key
may fail to get enforced correctly.  For example, even though the new
row after the update is still logically present in the PK table, it
would wrongly get deleted because of the DELETE RI trigger firing if
there's a ON DELETE CASCADE clause on the foreign key.

To fix that, I propose that we teach trigger.c to skip queuing the
events that would be dangerous to fire, such as that for the DELETE on
the source leaf partition mentioned above.  Instead, queue an UPDATE
event on the root target table, matching the actual operation being
performed.  Note though that this new arrangement doesn't affect the
firing of any other triggers except those that are relevant to the
reported problem, viz. the PK-side RI triggers.  All other triggers
fire exactly as they did before.

To make that happen, I had to:

1. Make RI triggers available on partitioned tables at all, which they
are not today.  Also, mark the RI triggers in partitions correctly as
*clones* of the RI triggers in their respective parents.

2. Make it possible to allow AfterTriggerSaveEvent() to access the
query's actual target relation, that is, in addition to the target
relation on which an event was fired.  Also, added a bunch of code to
AFTER TRIGGER infrastructure to handle events fired on partitioned
tables.  Because those events cannot contain physical references to
affected tuples, I generalized the code currently used to handle after
triggers on foreign tables by storing the tuples in and retrieving
them from a tuple store.  I read a bunch of caveats of that
implementation (such as its uselessness for deferred triggers), but
for the limited cases for which it will be used for partitioned
tables, it seems safe, because it won't be used for deferred triggers
on partitioned tables.

Attached patches 0001 and 0002 implement 1 and 2, respectively.
Later, I will post an updated version of the patch for the
back-branches, which, as mentioned upthread, is to prevent the
cross-partition updates on foreign key PK tables.

--
Amit Langote
EDB: http://www.enterprisedb.com

v1-0001-Create-foreign-key-triggers-in-partitioned-tables.patch
Description: Binary data

v1-0002-Enforce-foreign-key-correctly-during-cross-partit.patch
Description: Binary data

RE: Disable WAL logging to speed up data loading

2020-11-20 Thread osumi.takami...@fujitsu.com

Hi

This time, I updated my patch to address comments below only.
Please have a look at updated one.

On Thursday, Nov 19, 2020 4:51 PM Tsunakawa, Takayuki 

> (1)
>  #define RelationNeedsWAL(relation)
>   \
> + (wal_level != WAL_LEVEL_NONE &&
>   \
>   ((relation)->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT &&   \
>(XLogIsNeeded() ||
>   \
> (relation->rd_createSubid == InvalidSubTransactionId &&
>   \
> -relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
> +relation->rd_firstRelfilenodeSubid ==
> InvalidSubTransactionId
> 
> You can remove one pair of parentheses for readability as follows:
> 
>  #define RelationNeedsWAL(relation)
>   \
> + (wal_level != WAL_LEVEL_NONE &&
>   \
>   (relation)->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT &&   \
>(XLogIsNeeded() ||
>   \
> (relation->rd_createSubid == InvalidSubTransactionId &&
>   \
> -relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
> +relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
Done. 

> (2)
>   /*
> +  * Detect if the server previously crashed under wal_level='none' or
> not.
> +  */
> + if (ControlFile->wal_level == WAL_LEVEL_NONE &&
> + (ControlFile->state != DB_SHUTDOWNED &&
> ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))
> + ereport(ERROR,
> + (errmsg("detected an unexpected server
> shutdown when WAL logging was disabled"),
> +  errhint("It looks like you need to deploy a
> new cluster from your full backup again.")));
> +
> + /*
>* Check that contents look valid.
>*/
>   if (!XRecOffIsValid(ControlFile->checkPoint))
> 
> I think the above safeguard should be placed right before the following code,
> because it's better to first check control file contents and emit the message
> for the previous shutdown state.
> 
> #ifdef XLOG_REPLAY_DELAY
> if (ControlFile->state != DB_SHUTDOWNED)
> pg_usleep(6000L);
> #endif
Thank you. Perfectly makes sense. Fixed.


> (3)
> @@ -449,6 +449,13 @@ XLogInsert(RmgrId rmid, uint8 info)
>   return EndPos;
>   }
> 
> + /* Issues WAL only for XLOG resources */
> + if (wal_level == WAL_LEVEL_NONE && rmid != RM_XLOG_ID)
> + {
> + XLogResetInsertion();
> + return GetLatestCheckPointLSN();
> + }
> 
> It should be correct to use GetXLogInsertRecPtr() instead of
> GetLatestCheckPointLSN(), because what XLogInsert() should return is the
> end position of last WAL record inserted (= start position of next WAL).
Ah, sorry. My patch was not correct.


> Why don't we add transaction completion WAL records?  That is, add
> "rmid != RM_XACT_ID" to the if condition.  What we really want to eliminate
> for performance is the data modification WAL records.  This might allow us
> to insert special checks to prevent PREPARE TRANSACTION and other stuff.
I apologize that you mentioned this point in your past review but I took it 
wrongly.
I fixed the condition and the comment for it.


> What happens without these checks? 
> (4)
> @@ -404,6 +404,10 @@
> pg_logical_emit_message_bytea(PG_FUNCTION_ARGS)
>   bytea  *data = PG_GETARG_BYTEA_PP(2);
>   XLogRecPtr  lsn;
> 
> + if (wal_level < WAL_LEVEL_REPLICA)
> + ereport(ERROR,
> + errmsg("propagating a message requires
> wal_level \"replica\" or \"logical\""));
> +
>   lsn = LogLogicalMessage(prefix, VARDATA_ANY(data),
> VARSIZE_ANY_EXHDR(data),
>   transactional);
>   PG_RETURN_LSN(lsn);
> 
> @@ -192,6 +192,10 @@ replorigin_check_prerequisites(bool check_slots,
> bool recoveryOK)
> 
>   (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
>errmsg("cannot manipulate replication
> origins during recovery")));
> 
> + if (wal_level < WAL_LEVEL_REPLICA)
> + ereport(ERROR,
> + errmsg("creating replication origins requires
> wal_level \"replica\" or \"logical\""));
> +
>  }
Those functions doesn't do something meaningful when wal_level <= minimal.
For example, pg_logical_emit_message_bytea calls XLogInsert with 
RM_LOGICALMSG_ID rmid.
This doesn't generate WAL during wal_level=none,
which makes the usage of the function nonsense.

But, the latest patch I attached removed this part of modification,
because this can be applied to wal_level = minimal also and
I should make the patch focus on its purpose only.


> @@ -625,6 +625,9 @@ standard_ProcessUtility(PlannedStmt *pstmt,
>

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-20 Thread Magnus Hagander

On Fri, Nov 20, 2020 at 3:31 AM Michael Paquier  wrote:
>
> On Thu, Nov 19, 2020 at 09:49:05PM +0100, Magnus Hagander wrote:
> > Ugh, that's pretty terrible.
>
> I have spent some time testing this part this morning, and I can see
> that genhtml does not complain with your patch.  It looks like in the
> case of 2771fce the tool got confused because the same function was
> getting compiled twice for the backend and the frontend, but here you
> only get one code path compiled depending on the option used.
>
> +#else /* not OPENSSL or WIN32 */
> I think you mean USE_OPENSSL or just OpenSSL here, but not "OPENSSL".

Yeah. Well, I either meant "OpenSSL or Win32" or "USE_OPENSSL or
WIN32", and ended up with some incorrect mix :) Thanks, fixed.


> pg_strong_random.c needs a pgindent run, there are two inconsistent
> diffs.  Looks fine except for those nits.

I saw only one after this, but maybe I ended up auto-fixing it whenI
changed that define.

That said, pgindent now run, and patch pushed.

Thanks!


--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Improve handling of parameter differences in physical replication

2020-11-20 Thread Peter Eisentraut


On 2020-11-19 20:17, Sergei Kornilov wrote:

Seems WAIT_EVENT_RECOVERY_PAUSE addition was lost during patch simplification.


added


ereport(FATAL,
(errmsg("recovery aborted because of insufficient 
parameter settings"),
 errhint("You can restart the server after making 
the necessary configuration changes.")));


I think we should repeat here conflicted param_name and minValue. 
pg_wal_replay_resume can be called days after recovery being paused. The 
initial message can be difficult to find.


done




errmsg("recovery will be paused")


May be use the same "recovery has paused" as in recoveryPausesHere? It doesn't 
seem to make any difference since we set pause right after that, but there will be a 
little less work translators.


done


Not sure about "If recovery is unpaused". The word "resumed" seems to have been 
usually used in docs.


I think I like "unpaused" better here, because "resumed" would seem to 
imply that recovery can actually continue.


One thing that has not been added to my patch is the equivalent of 
496ee647ecd2917369ffcf1eaa0b2cdca07c8730, which allows promotion while 
recovery is paused.  I'm not sure that would be necessary, and it 
doesn't look easy to add either.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From 99724e2ee14b5f3ec926c7afdc056863a7e2294f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 20 Nov 2020 13:39:09 +0100
Subject: [PATCH v5] Pause recovery for insufficient parameter settings

When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

This patch changes this behavior to pause recovery at that point
instead.  That allows read traffic on the standby to continue while
database administrators figure out next steps.  When recovery is
unpaused, the server shuts down (as before).  The idea is to fix the
parameters while recovery is paused and then restart when there is a
maintenance window.

Discussion: 
https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36...@2ndquadrant.com
---
 doc/src/sgml/high-availability.sgml | 48 +
 src/backend/access/transam/xlog.c   | 38 ---
 2 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..e9a30dd88b 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2120,18 +2120,14 @@ Administrator's Overview

 

-The setting of some parameters on the standby will need reconfiguration
-if they have been changed on the primary. For these parameters,
-the value on the standby must
-be equal to or greater than the value on the primary.
-Therefore, if you want to increase these values, you should do so on all
-standby servers first, before applying the changes to the primary server.
-Conversely, if you want to decrease these values, you should do so on the
-primary server first, before applying the changes to all standby servers.
-If these parameters
-are not set high enough then the standby will refuse to start.
-Higher values can then be supplied and the server
-restarted to begin recovery again.  These parameters are:
+The settings of some parameters determine the size of shared memory for
+tracking transaction IDs, locks, and prepared transactions.  These shared
+memory structures must be no smaller on a standby than on the primary in
+order to ensure that the standby does not run out of shared memory during
+recovery.  For example, if the primary had used a prepared transaction but
+the standby had not allocated any shared memory for tracking prepared
+transactions, then recovery could not continue until the standby's
+configuration is changed.  The parameters affected are:
 
   

@@ -2160,6 +2156,34 @@ Administrator's Overview
 

   
+
+The easiest way to ensure this does not become a problem is to have these
+parameters set on the standbys to values equal to or greater than on the
+primary.  Therefore, if you want to increase these values, you should do
+so on all standby servers first, before applying the changes to the
+primary server.  Conversely, if you want to decrease these values, you
+should do so on the primary server first, before applying the changes to
+all standby servers.  Keep in mind that when a standby is promoted, it
+becomes the new reference for the required parameter settings for the
+standbys that follow it.  Therefore, to avoid this becoming a problem
+during a switchover or failover, it

Re: don't allocate HashAgg hash tables when running explain only

2020-11-20 Thread Heikki Linnakangas


On 20/11/2020 08:31, Michael Paquier wrote:

On Thu, Nov 19, 2020 at 08:47:51AM +0200, Heikki Linnakangas wrote:

Yeah, I believe it's always been like that. Yeah, arguably it should be
backpatched. I felt conservative and didn't backpatch, but feel free to do
it if you think it should be.


+1 for a backpatch.  The difference in runtime for EXPLAIN in this case
is quite something.


That's two votes for backpatching. Ok, I'll go do it.

- Heikki

Re: parsing pg_ident.conf

2020-11-20 Thread Andrew Dunstan



On 11/20/20 2:19 AM, Fabien COELHO wrote:
>
> Hello Andrew,
>
>> I noticed somewhat to my surprise as I was prepping the tests for the
>> "match the whole DN" patch that pg_ident.conf is parsed using the same
>> routines used for pg_hba.conf, and as a result the DN almost always
>> needs to be quoted, because they almost all contain a comma e.g.
>> "O=PGDG,OU=Testing". Even if we didn't break on commas we would probably
>> need to quote most of them, because it's very common to include spaces
>> e.g. "O=Acme Corp,OU=Marketing". Nevertheless it seems rather odd to
>> break on commas, since nothing in the file is meant to be a list - this
>> is a file with exactly three single-valued fields. Not sure if it's
>> worth doing anything about this, or we should just live with it the way
>> it is.
>
> My 0.02 €:
>
> ISTM that having to quote long strings which may contains space or
> other separators is a good thing from a readability point of view,
> even if it would be possible to parse it without the quotes.
>
> So I'm in favor of keeping it that way.
>
> Note that since 8f8154a503, continuations are allowed on "pg_hba.conf"
> and "pg_ident.conf", and that you may continuate within a quoted
> string, which may be of interest when considering LDAP links.


Maybe we should add a comment at the top of the file about when quoting
is needed.


cheers


andrew

Re: Different results between PostgreSQL and Oracle for "for update" statement

2020-11-20 Thread Andreas Karlsson


On 11/20/20 9:57 AM, Andy Fan wrote:

Thank you for your attention. Your suggestion would fix the issue.  However
The difference will cause some risks when users move their application 
from Oracle

to PostgreSQL. So I'd like to think which behavior is more reasonable.


I think PostgreSQL's behavior is more reasonable since it only locks the 
rows it claims to lock and no extra rows. This makes the code easy to 
reason about. And PostgreSQL does not re-evaluate sub queries after 
grabbing the lock which while it might be surprising to some people is 
also a quite nice consistent behavior in practice as long as you are 
aware of it.


I do not see why these two scenarios should behave differently (which I 
think they would with your proposed patch):


== Scenario 1

create table su (a int, b int);
insert into su values(1, 1);

- session 1:
begin;
update su set b = 2 where b = 1;

- sess 2:
select * from su where a in (select a from su where b = 1) for update;

- sess 1:
commit;

== Scenario 2

create table su (a int, b int);
insert into su values(1, 1);

create table su2 (a int, b int);
insert into su2 values(1, 1);

- session 1:
begin;
update su set b = 2 where b = 1;
update su2 set b = 2 where b = 1;

- sess 2:
select * from su where a in (select a from su2 where b = 1) for update;

- sess 1:
commit;

Andreas

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Alvaro Herrera

On 2020-Nov-20, Masahiko Sawada wrote:

> I'm concerned that always freezing all tuples when we're going to make
> the page dirty would affect the existing vacuum workload much. The
> additional cost of freezing multiple tuples would be low but if we
> freeze tuples we would also need to write WAL, which is not negligible
> overhead I guess. In the worst case, if a table has dead tuples on all
> pages we process them, but with this patch, in addition to that, we
> will end up not only freezing all live tuples but also writing
> XLOG_HEAP2_FREEZE_PAGE WAL for all pages. So I think it would be
> better either to freeze all tuples if we find a tuple that needs to be
> frozen or to make this behavior work only if the new VACUUM option is
> specified.

There are two costs associated with this processing.  One is dirtying
the page (which means it needs to be written down when evicted), and the
other is to write WAL records for each change.  The cost for the latter
is going to be the same in both cases (with this change and without)
because the same WAL will have to be written -- the only difference is
*when* do you pay it.  The cost of the former is quite different; with
Simon's patch we dirty the page once, and without the patch we may dirty
it several times before it becomes "stable" and no more writes are done
to it.

(If you have tables whose pages change all the time, there would be no
difference with or without the patch.)

Dirtying the page less times means less full-page images to WAL, too,
which can be significant.

Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Heikki Linnakangas


On 20/11/2020 02:38, Andres Freund wrote:

I locally, on a heavily modified branch (AIO support), started to get
consistent failures in this test. I *suspect*, but am not sure, that
it's the test's fault, not the fault of modifications.

As far as I can tell, after the pg_rewind call, there's no guarantee
that node_c has fully caught up to the 'in A, after C was promoted'
insertion on node a. Thus at the check_query() I sometimes get just 'in
A, before promotion' back.

After adding a wait that problem seems to be fixed. Here's what I did

diff --git i/src/bin/pg_rewind/t/007_standby_source.pl 
w/src/bin/pg_rewind/t/007_standby_source.pl
index f6abcc2d987..48898bef2f5 100644
--- i/src/bin/pg_rewind/t/007_standby_source.pl
+++ w/src/bin/pg_rewind/t/007_standby_source.pl
@@ -88,6 +88,7 @@ $node_c->safe_psql('postgres', "checkpoint");
  # - you need to rewind.
  $node_a->safe_psql('postgres',
  "INSERT INTO tbl1 VALUES ('in A, after C was promoted')");
+$lsn = $node_a->lsn('insert');
  
  # Also insert a new row in the standby, which won't be present in the

  # old primary.
@@ -142,6 +143,8 @@ $node_primary = $node_c;
  # Run some checks to verify that C has been successfully rewound,
  # and connected back to follow B.
  
+$node_b->wait_for_catchup('node_c', 'replay', $lsn);

+
  check_query(
  'SELECT * FROM tbl1',
  qq(in A


Yes, I was able to reproduced that by inserting a strategic sleep in the 
test and pausing replication by attaching gdb to the walsender process.


Pushed a fix similar to your patch, but I put the wait_for_catchup() 
before running pg_rewind. The point of inserting the 'in A, after C was 
promoted' row is that it's present in B when pg_rewind runs.


Thanks!

- Heikki

Re: Different results between PostgreSQL and Oracle for "for update" statement

2020-11-20 Thread Andy Fan

Hi Andreas:

Thanks for your input.

On Fri, Nov 20, 2020 at 9:37 PM Andreas Karlsson  wrote:

> On 11/20/20 9:57 AM, Andy Fan wrote:
> > Thank you for your attention. Your suggestion would fix the issue.
> However
> > The difference will cause some risks when users move their application
> > from Oracle
> > to PostgreSQL. So I'd like to think which behavior is more reasonable.
>
> I think PostgreSQL's behavior is more reasonable since it only locks the
> rows it claims to lock and no extra rows. This makes the code easy to
> reason about. And PostgreSQL does not re-evaluate sub queries after
> grabbing the lock which while it might be surprising to some people is
> also a quite nice consistent behavior in practice as long as you are
> aware of it.
>

I admit my way is bad after reading your below question, but I
would not think *it might be surprising to some people* is a good signal
for a design.  Would you think "re-evaluate the quals" after grabbing the
lock should be a good idea? And do you know if any other database uses
the postgres's way or Oracle's way?   I just heard Oracle might do the
re-check just some minutes before reading your reply and I also found
Oracle doesn't lock the extra rows per my test.

> I do not see why these two scenarios should behave differently (which I
> think they would with your proposed patch):
>
>
Good question!  I think my approach doesn't make sense now!

-- 
Best Regards
Andy Fan

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-20 Thread Peter Eisentraut


On 2020-11-03 21:53, David Rowley wrote:

On Tue, 3 Nov 2020 at 20:08, Peter Eisentraut
 wrote:


On 2020-09-29 11:26, David Rowley wrote:

I've marked this patch back as waiting for review. It would be good if
someone could run some tests on some intel hardware and see if they
can see any speedup.


What is the way forward here?  What exactly would you like to have tested?


It would be good to see some small scale bench -S tests with and
without -M prepared.

Also, small scale TPC-H tests would be good.I really only did
testing on new AMD hardware, so some testing on intel hardware would
be good.


I did tests of elog_ereport_attribute_cold_v4.patch on an oldish Mac 
Intel laptop with pgbench scale 1 (default), and then:


pgbench -S -T 60

master:  tps = 8251.883229 (excluding connections establishing)
patched: tps = 9556.836232 (excluding connections establishing)

pgbench -S -T 60 -M prepared

master:  tps = 14713.821837 (excluding connections establishing)
patched: tps = 16200.066185 (excluding connections establishing)

So from that this seems like an easy win.

I also tested on a newish Mac ARM laptop, and there the patch did not do 
anything, but that was because clang does not support the cold 
attribute, so that part works as well. ;-)


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE

2020-11-20 Thread Dean Rasheed

On Sun, 6 Sept 2020 at 22:42, Tom Lane  wrote:
>
> I think you'd be better off to make transformInsertStmt(), specifically
> its multi-VALUES-rows code path, check for all-DEFAULT columns and adjust
> the tlist itself.  Doing it there might be a good bit less inefficient
> for very long VALUES lists, too, which is a case that we do worry about.
> Since that's already iterating through the sub-lists, you could track
> whether all rows seen so far contain SetToDefault in each column position,
> and avoid extra scans of the sublists.  (A BitmapSet might be a convenient
> representation of that, though you could also use a bool array I suppose.)
>
> I do not care for what you did in rewriteValuesRTE() either: just removing
> a sanity check isn't OK, unless you've done something to make the sanity
> check unnecessary which you surely have not.  Perhaps you could extend
> the initial scan of the tlist (lines 1294-1310) to notice SetToDefault
> nodes as well as Var nodes and keep track of which columns have those.
> Then you could cross-check that one or the other case applies whenever
> you see a SetToDefault in the VALUES lists.

That's not quite right because by the time rewriteValuesRTE() sees the
tlist, it will contain already-rewritten generated column expressions,
not SetToDefault nodes. If we're going to keep that sanity check (and
I think that we should), I think that the way to do it is to have
rewriteTargetListIU() record which columns it has expanded defaults
for, and pass that information to rewriteValuesRTE(). Those columns of
the VALUES RTE are no longer used in the query, so the sanity check
can be amended to ignore them while continuing to check the other
columns.

Incidentally, there is another way of causing that sanity check to
fail -- an INSERT ... OVERRIDING USER VALUE query with some DEFAULTS
in the VALUES RTE (but not necessarily all DEFAULTs) will trigger it.
So even if the parser completely removed any all-DEFAULT columns from
the VALUES RTE, some work in the rewriter would still be necessary.

> BTW, another thing that needs checking is whether a rule containing
> an INSERT like this will reverse-list sanely.  The whole idea of
> replacing some of the Vars might not work so far as ruleutils.c is
> concerned.  In that case I think we might have to implement this
> by having transformInsertStmt restructure the VALUES lists to
> physically remove the all-DEFAULT column, and adjust the target column
> list accordingly --- that is, make a parse-time transformation of
> INSERT INTO gtest0 VALUES (1, DEFAULT), (2, DEFAULT);
> into
> INSERT INTO gtest0(a) VALUES (1), (2);
> That'd have the advantage that you'd not have to hack up the
> rewriter at all.

I think it's actually easier to just do it all in the rewriter -- at
the point where we see that we're about to insert potentially illegal
values from a VALUES RTE into a generated column, scan it to see if
all the values in that column are DEFAULTs, and if so trigger the
existing code to replace the Var in the tlist with the generated
column expression. That way we only do extra work in the case for
which we're currently throwing an error, rather than for every query.
Also, I think that scanning the VALUES lists in this way is likely to
be cheaper than rebuilding them to remove a column.

Attached is a patch doing it that way, along with additional
regression tests that trigger both the original error and the
sanity-check error triggered by INSERT ... OVERRIDING USER VALUES. I
also added a few additional comments where I found the existing code a
little non-obvious.

I haven't touched the existing error messages. I think that's a
subject for a separate patch.

Regards,
Dean
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
new file mode 100644
index 41dd670..966cab5
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -69,13 +69,18 @@ static List *rewriteTargetListIU(List *t
  CmdType commandType,
  OverridingKind override,
  Relation target_relation,
- int result_rti);
+ int result_rti,
+ RangeTblEntry *values_rte,
+ int values_rte_index,
+ Bitmapset **unused_values_attrnos);
 static TargetEntry *process_matched_tle(TargetEntry *src_tle,
 		TargetEntry *prior_tle,
 		const char *attrName);
 static Node *get_assignment_input(Node *node);
+static Bitmapset *findDefaultOnlyColumns(RangeTblEntry *rte);
 static bool rewriteValuesRTE(Query *parsetree, RangeTblEntry *rte, int rti,
-			 Relation target_relation, bool force_nulls);
+			 Relation target_relation, bool force_nulls,
+			 Bitmapset *unused_cols);
 static void markQueryForLocking(Query *qry, Node *jtnode,
 LockClauseStrength strength, LockWaitPolicy waitPolicy,
 bool pushedDown);
@@ -708,13 +713,25 @@ adjustJoinTreeList(Query *parsetree, boo
  * is incorrect by this light, since child relations

Re: Add session statistics to pg_stat_database

2020-11-20 Thread Laurenz Albe

On Tue, 2020-11-17 at 17:33 +0100, Magnus Hagander wrote:
> I've taken a look as well, and here are a few short notes:

Much appreciated!

> * It talks about "number of connections" but "number of aborted sessions". We 
> should probably
>   be consistent about talking either about connections or sessions? In 
> particular, connections
>   seems wrong in this case, because it only starts counting after 
> authentication is complete
>   (since otherwise we send no stats)? (This goes for both docs and actual 
> function names)

Yes, that is true.  I have changed "connections" to "sessions" and renamed the 
new
column "connections" to "session_count".

I think that most people will understand a session as started after a successful
connection.

> * Is there a reason we're counting active and idle in transaction (including 
> aborted),
>   but not fastpath? In particular, we seem to ignore fastpath -- if we don't 
> want to single
>   it out specifically, it should probably be included in active?

The only reason is that I didn't think of it.  Fixed.

> * pgstat_send_connstat() but pgstat_recv_connection(). Let's call both 
> connstat or both
>   connection (I'd vote connstat)?

Agreed, done.

> * Is this actually a fix that's independent of the new stats? It seems in 
> general to be
>   changing the behaviour of "force", which is more generic?
> -   !have_function_stats)
> +   !have_function_stats && !force)

The comment right above that reads:
/* Don't expend a clock check if nothing to do */
So it is just a quick exit if there is nothing to do.

But with that patch we have something to do if "force" (see below) is true:
Report the remaining session duration and if the session was closed normally.

Thus the additional check.

> * in pgstat_send_connstat() you pass the parameter "force" in as "disconnect".
>   That behaviour at least requires a comment saying why, I think. My 
> understanding is
>   it relies on that "force" means this is a "backend is shutting down", but 
> that is not
>   actually documented anywhere. Maybe the "force" parameter should actually 
> be renamed
>   to indicate this is really what it means, to avoid a future mistake in the 
> area?
>   But even with that, how does that turn into disconnect?

"pgstat_report_stat(true)" is only called from "pgstat_beshutdown_hook()", so
it is currently only called when the backend is about to exit.

According the the comments the flag means that "caller wants to force stats 
out".
I guess that the author thought that there may arise other reasons to force 
sending
statistics in the future (commit 641912b4d from 2007).

However, since that has not happened, I have renamed the flag to "disconnect" 
and
adapted the documentation.  This doesn't change the current behavior, but 
establishes
a new rule.

> * Maybe rename pgStatSessionDisconnected to pgStatSessionNormalDisconnected?
>   To avoid having to go back to the setting point and look it up in a comment.

Long, descriptive names are a good thing.
I have decided to use "pgStatSessionDisconnectedNormally", since that is even 
longer
and seems to fit the "yes or no" category better.

> I wonder if there would also be a way to count "sessions that crashed" as 
> well.
>  That is,the ones that failed in a way that caused the postmaster to restart 
> the system.
>  But that's information we'd have to send from the postmaster, but I'm 
> actually unsure
>  if we're "allowed" to send things to the stats collector from the postmaster.
>  But I think it could be quite useful information to have. Maybe we can find 
> some way
>  to piggyback on the fact that we're restarting the stats collector as a 
> result?

Sure, a crash count would be useful.  I don't know if it is easy for the stats 
collector
to tell the difference between a start after a backend crash and - say - 
starting from
a base backup.

Patch v6 attached.

I think that that would be material for another patch, and I don't think it 
should go
to "pg_stat_database", because a) it might be hard to tell to which database 
the crashed
backend was attached, b) it might be a background process that doesn't belong 
to a database
and c) if the crash were caused by - say - corruption in a shared catalog, it 
would be
misleading.

Yours,
Laurenz Albe
From 8feed416f91a5de9011616c1545156b9c8f28943 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 20 Nov 2020 15:11:57 +0100
Subject: [PATCH] Add session statistics to pg_stat_database

If "track_counts" is active, track the following per database:
- total number of connections
- number of sessions that ended other than with a client disconnect
- total time spent in database sessions
- total time spent executing queries
- total idle in transaction time

This is useful to check if connection pooling is working.
It also helps to estimate the size of the connection pool
required to keep the database busy, which depends on the
percentage of the transaction time that is spent idlin

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs

On Fri, 20 Nov 2020 at 14:07, Alvaro Herrera  wrote:
>
> On 2020-Nov-20, Masahiko Sawada wrote:
>
> > I'm concerned that always freezing all tuples when we're going to make
> > the page dirty would affect the existing vacuum workload much. The
> > additional cost of freezing multiple tuples would be low but if we
> > freeze tuples we would also need to write WAL, which is not negligible
> > overhead I guess. In the worst case, if a table has dead tuples on all
> > pages we process them, but with this patch, in addition to that, we
> > will end up not only freezing all live tuples but also writing
> > XLOG_HEAP2_FREEZE_PAGE WAL for all pages. So I think it would be
> > better either to freeze all tuples if we find a tuple that needs to be
> > frozen or to make this behavior work only if the new VACUUM option is
> > specified.
>
> There are two costs associated with this processing.  One is dirtying
> the page (which means it needs to be written down when evicted), and the
> other is to write WAL records for each change.  The cost for the latter
> is going to be the same in both cases (with this change and without)
> because the same WAL will have to be written -- the only difference is
> *when* do you pay it.  The cost of the former is quite different; with
> Simon's patch we dirty the page once, and without the patch we may dirty
> it several times before it becomes "stable" and no more writes are done
> to it.
>
> (If you have tables whose pages change all the time, there would be no
> difference with or without the patch.)
>
> Dirtying the page less times means less full-page images to WAL, too,
> which can be significant.

Summary of patch effects:

one_freeze_then_max_freeze.v5.patch
doesn't increase/decrease the number of pages that are dirtied by
freezing, but when a page will be dirtied **by any activity** it
maximises the number of rows frozen, so in some cases it may write a
WAL freeze record when it would not have done previously.

one_freeze_then_max_freeze.v6.patch
doesn't increase/decrease the number of pages that are dirtied by
freezing, but when a page will be dirtied **by freezing only** it
maximises the number of rows frozen, so the number of WAL records for
freezing is the same, but they may contain more tuples than before and
will increase the probability that the page is marked all_frozen.

So yes, as you say, the net effect will be to reduce the number of
write I/Os in subsequent vacuums required to move forward
relfrozenxid.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Online verification of checksums

2020-11-20 Thread David Steele


Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:

On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?


Agreed.  I am adding Stephen as well in CC.  From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum.  So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/


Indeed, this is not good, as Andres pointed out some time ago. My 
apologies for not getting to this sooner.


Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when 
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure 
it is valid.


These do completely rule out any type of corruption, but they certainly 
narrows the possibility by a lot.


In the future we would also like to scan the WAL to verify that the page 
is definitely being written to.


As for your patch, it mostly looks good but my objection is that a page 
may be reported as invalid after 5 retries when in fact it may just be 
very hot.


Maybe checking for an ascending LSN is a good idea there as well? At 
least in that case we could issue a different warning, instead of 
"checksum verification failed" perhaps "checksum verification skipped 
due to concurrent modifications".


Regards,
--
-David
da...@pgmasters.net

Re: parsing pg_ident.conf

2020-11-20 Thread Fabien COELHO





I noticed somewhat to my surprise as I was prepping the tests for the
"match the whole DN" patch that pg_ident.conf is parsed using the same
routines used for pg_hba.conf, and as a result the DN almost always
needs to be quoted, because they almost all contain a comma e.g.
"O=PGDG,OU=Testing". Even if we didn't break on commas we would probably
need to quote most of them, because it's very common to include spaces


Maybe we should add a comment at the top of the file about when quoting
is needed.


Yes, that is a good place to point that out. Possibly it would also be 
worth to add something in 20.2, including an example?


--
Fabien.

Re: abstract Unix-domain sockets

2020-11-20 Thread Peter Eisentraut


On 2020-11-18 04:35, David G. Johnston wrote:
Given that "port" is a postgresql.conf setting its use here (and 
elsewhere) should be taken to mean the value of that specific variable.  
To that end, I find the current description of port to be lacking - it 
should mention its usage as a qualifier when dealing with unix socket 
files (in addition to the existing wording under unix_socket_directories).


If we are going to debate semantics here "bind unix address" doesn't 
seem correct.  could not create Unix socket file /tmp/.s.PGSQL.5432, it 
already exists.


The hint would be better written: Is another postmaster running with 
unix_socket_directories = /tmp and port = 5432?  If not, remove the unix 
socket file /tmp/.s.PGSQL.5432 and retry.


I don't see much benefit in trying to share logic/wording between the 
various messages and hints for the different ways the server can 
establish communication points.


I agree that there isn't a useful hint for the abstract case as it 
shouldn't happen unless there is indeed another running instance with 
the same configuration.  Though a hint similar to the above, but without 
the "remove and retry" bit, probably wouldn't hurt.


I think we are getting a bit sidetracked here with the message wording. 
The reason I looked at this was that "remove socket file and retry" is 
never an appropriate action with abstract sockets.  And on further 
analysis, it is never an appropriate action with any Unix-domain socket 
(because with file system namespace sockets, you never get an 
EADDRINUSE, so it's dead code).  So my proposal here is to just delete 
that line from the hint and leave the rest the same.  There could be 
value in further refining and rephrasing this, but it ought to be a 
separate thread.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From a4967dd79e490390bc148d3bae92452ba4a34c9c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 22 Oct 2020 08:47:52 +0200
Subject: [PATCH v3 1/2] Add support for abstract Unix-domain sockets

This is a variant of the normal Unix-domain sockets that don't use the
file system but a separate "abstract" namespace.  At the user
interface, such sockets are represented by names starting with "@".
Supported on Linux and Windows right now.

Discussion: 
https://www.postgresql.org/message-id/flat/6dee8574-b0ad-fc49-9c8c-2edc796f0...@2ndquadrant.com
---
 doc/src/sgml/config.sgml  | 24 +++-
 doc/src/sgml/libpq.sgml   |  5 -
 src/backend/libpq/pqcomm.c|  8 
 src/bin/psql/command.c| 15 +--
 src/bin/psql/prompt.c |  3 ++-
 src/common/ip.c   | 24 +++-
 src/include/libpq/pqcomm.h|  9 +
 src/interfaces/libpq/fe-connect.c |  4 ++--
 8 files changed, 76 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a632cf98ba..d05045fb20 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -749,6 +749,21 @@ Connection Settings
 An empty value
 specifies not listening on any Unix-domain sockets, in which case
 only TCP/IP sockets can be used to connect to the server.
+   
+
+   
+A value that starts with @ specifies that a
+Unix-domain socket in the abstract namespace should be created
+(currently supported on Linux and Windows).  In that case, this value
+does not specify a directory but a prefix from which
+the actual socket name is computed in the same manner as for the
+file-system namespace.  While the abstract socket name prefix can be
+chosen freely, since it is not a file-system location, the convention
+is to nonetheless use file-system-like values such as
+@/tmp.
+   
+
+   
 The default value is normally
 /tmp, but that can be changed at build time.
 On Windows, the default is empty, which means no Unix-domain socket is
@@ -763,6 +778,7 @@ Connection Settings
 named .s.PGSQL..lock 
will be
 created in each of the unix_socket_directories 
directories.
 Neither file should ever be removed manually.
+For sockets in the abstract namespace, no lock file is created.

   
  
@@ -787,7 +803,8 @@ Connection Settings
 

 This parameter is not supported on Windows.  Any setting will be
-ignored.
+ignored.  Also, sockets in the abstract namespace have no file owner,
+so this setting is also ignored in that case.

   
  
@@ -834,6 +851,11 @@ Connection Settings
 similar effect by pointing unix_socket_directories 
to a
 directory having search permission limited to the desired audience.

+
+   
+Sockets in the abstract namespace have no file permissions, so this
+setting is also ignored in that case.
+

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-20 Thread Peter Eisentraut


On 2020-09-03 12:14, Michael Paquier wrote:

On Fri, Jun 26, 2020 at 01:11:55PM +0900, Michael Paquier wrote:

 From what I can see on this thread, we could just remove currtid() per
the arguments of the RETURNING ctid clause supported since PG 8.2, but
it would make more sense to me to just remove both currtid/currtid2()
at once.


The CF bot is complaining, so here is a rebase for the main patch.
Opinions are welcome about the arguments of upthread.


It appears that currtid2() is still used, so we ought to keep it.

Re: Issue with server side statement-level rollback

2020-11-20 Thread Gilles Darold


Hi,

Le 19/11/2020 à 21:43, Andres Freund a écrit :

Hi,

On 2020-11-12 11:40:22 +0100, Gilles Darold wrote:

The problem we are encountering is when PostgreSQL is compiled in debug
mode with --enable-cassert. At line 1327 of src/backend/tcop/pquery.c
the following assert fail:

     /*
      * Clear subsidiary contexts to recover temporary memory.
      */
     Assert(portal->portalContext == CurrentMemoryContext);

     MemoryContextDeleteChildren(portal->portalContext);

This extension, although it is a risky implementation, works extremely
well when used in a fully controlled environment. It avoid the latency
of the extra communication for the RELEASE+SAVEPOINT usually controlled at
client side. The client is only responsible to issue the "ROLLBACK TO
autosavepoint"
when needed.  The extension allow a high performances gain for this feature
that helps customers using Oracle or DB2 to migrate to PostgreSQL.


Actually with the extension the memory context is not CurrentMemoryContext
as expected by the assert.

What is it instead? I don't think you really can safely be in a
different context at this point. There's risks of CurrentMemoryContext
pointing to a deleted context, and risks of memory leaks, depending on
the situation.


This is a PortalContext. Yes this implementation has some risks but 
until now I have not met any problem because its use and the environment 
are fully controlled.






So my question is should we allow such use through an extension and in
this case what is the change to PostgreSQL code that could avoid the
assert crash? Or perhaps we have missed something in this extension to
be able to make the assert happy but I don't think so.

Without more detail of what you actually are precisely doing, and what
the hooks / integration you'd like would look like, it's hard to comment
usefully here.



We have implemented an extension to allow server side "statement-level 
rollback" with what is possible to do now with PG but the objective was 
to do the same thing that what was proposed as a core patch submitted by 
Takayuki Tsunakawa [1] . This patch will not be included into core and 
what I'm trying to do now is to have some facilities to allow this 
feature through an extension that does not suffer from the same 
limitation of pg_statement_rollback.



Looking that this patch for example, if we have a hook on 
finish_xact_command(), finish_xact_command() and 
AbortCurrentTransaction() I think we could probably be able to implement 
the feature through an extension in a more "safe" way. A hook on 
start_xact_command() seems useless as it looks it is executed before the 
UtilityProcess and Executor* hooks. See attached patch for an example of 
what could be useful for this kind of extension. Unfortunately my 
knowledge doesn't allow me to see further and especially if there is 
drawbacks. I hope this is more clear, I will work later on a POC to 
demonstrate the use case I want to implement.



[1] 
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6A9286%40G01JPEXMBYT05



--
Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..81e1df27ef 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -314,6 +314,9 @@ typedef struct SubXactCallbackItem
 static SubXactCallbackItem *SubXact_callbacks = NULL;
 
 
+/* Hook for plugins to get control of at end of AbortCurrentTransaction */
+AbortCurrentTransaction_hook_type abort_current_transaction_hook = NULL;
+
 /* local function prototypes */
 static void AssignTransactionId(TransactionState s);
 static void AbortTransaction(void);
@@ -3358,6 +3361,9 @@ AbortCurrentTransaction(void)
 			AbortCurrentTransaction();
 			break;
 	}
+
+	if (abort_current_transaction_hook)
+		(*abort_current_transaction_hook) ();
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c5f7c775b..3fff54ff51 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -192,6 +192,14 @@ static void log_disconnections(int code, Datum arg);
 static void enable_statement_timeout(void);
 static void disable_statement_timeout(void);
 
+/*
+ * Hook for plugins to get control at end/start of
+ * start_xact_command/finish_xact_command. The hook
+ * on start_xact_command might be useless as it happens
+ * before UtilityProccess and the Executor* hooks.
+ */
+XactCommandStart_hook_type start_xact_command_hook = NULL;
+XactCommandFinish_hook_type finish_xact_command_hook = NULL;
 
 /* 
  *		routines to obtain user input
@@ -2634,6 +2642,7 @@ exec_describe_portal_message(const char *portal_name)
 static void
 start_xact_command(void)
 {
+
 	if (!xact_started)
 	{
 		StartTransactionCommand();
@@ -2649,11 +2658,19 @@ start_xact_command(void)
 	 * not desired, the timeout has to be disabled explicitly.
 	 */
 	enable_statement_t

Re: pg_dump, ATTACH, and independently restorable child partitions

2020-11-20 Thread Justin Pryzby

On Fri, Nov 06, 2020 at 11:18:35PM -0300, Alvaro Herrera wrote:
> On 2020-Oct-24, Justin Pryzby wrote:
> 
> > On Fri, Oct 23, 2020 at 12:29:40AM -0500, Justin Pryzby wrote:
> 
> > > Now that I look, it seems like this is calling PQexec(), which sends a 
> > > single,
> > > "simple" libpq message with:
> > > |CREATE TABLE ..; ALTER TABLE .. ATTACH PARTITION;
> > > ..which is transactional, so when the 2nd command fails, the CREATE is 
> > > rolled back.
> > > https://www.postgresql.org/docs/9.5/libpq-exec.html#LIBPQ-EXEC-MAIN
> > 
> > The easy fix is to add an explicit begin/commit.
> 
> Hmm, I think this throws a warning when used with "pg_restore -1",
> right?  I don't think that's sufficient reason to discard the idea, but
> it be better to find some other way.

Worse, right ?  It'd commit in the middle and then continue outside of a txn.
I guess there's no test case for this :(

> I have no ideas ATM :-(

1. Maybe pg_restore ExecuteSqlCommandBuf() should (always?) call
ExecuteSimpleCommands() instead of ExecuteSqlCommand().  It doesn't seem to
break anything (although that surprised me).

2. Otherwise, the createStmt would need to be split into a createStmt2 or a
char *createStmt[], which I think would then require changing the output
format.  It seems clearly better to keep the sql commands split up initially
than to reverse engineer them during restore.

I tried using \x01 to separate commands, and strtok to split them to run them
individually.  But that breaks the pg_dumpall tests.  As an experiment, I used
\x00, which is somewhat invasive but actually works.

Obviously patching pg_dump will affect only future backups, and the pg_restore
patch allows independently restoring parent tables in existing dumps.

-- 
Justin
>From f1ce24cb79dd4cdcd93d602ddede711efff8227e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 19 Nov 2020 19:10:02 -0600
Subject: [PATCH v2 1/2] pg_restore: parse and run separately SQL commands

If pg_dump emits multiple DDL commands, run them in separate transactions, to
avoid failures in the 2ndary (ALTER) commands from rolling back the first,
primary (CREATE) command.

XXX: is there any performance benefit possible if can we call
ExecuteSqlCommand() in some cases?  I don't think it matters much for DDL.
---
 src/bin/pg_dump/pg_backup_db.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index 5ba43441f5..1565155e20 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -477,21 +477,20 @@ ExecuteSqlCommandBuf(Archive *AHX, const char *buf, size_t bufLen)
 	else
 	{
 		/*
-		 * General SQL commands; we assume that commands will not be split
-		 * across calls.
+		 * General SQL commands
 		 *
 		 * In most cases the data passed to us will be a null-terminated
 		 * string, but if it's not, we have to add a trailing null.
 		 */
 		if (buf[bufLen] == '\0')
-			ExecuteSqlCommand(AH, buf, "could not execute query");
+			ExecuteSimpleCommands(AH, buf, bufLen);
 		else
 		{
 			char	   *str = (char *) pg_malloc(bufLen + 1);
 
 			memcpy(str, buf, bufLen);
 			str[bufLen] = '\0';
-			ExecuteSqlCommand(AH, str, "could not execute query");
+			ExecuteSimpleCommands(AH, buf, bufLen);
 			free(str);
 		}
 	}
-- 
2.17.0

>From c6ec806b245a08c47f6e1d7c4909a532f49e69e1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 24 Oct 2020 14:51:18 -0500
Subject: [PATCH v2 2/2] pg_dump: Allow child partitions to be independently
 restored

..even if the parent doesn't exist, or has missing/incompatible columns

This seems to have been intended by commit 33a53130a
---
 src/bin/pg_dump/pg_backup_archiver.c | 50 +--
 src/bin/pg_dump/pg_dump.c| 72 ++--
 src/bin/pg_dump/t/002_pg_dump.pl |  2 +-
 3 files changed, 117 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index b961a24b36..becf9ec34d 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -143,6 +143,8 @@ static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
 static void StrictNamesCheck(RestoreOptions *ropt);
 
+static size_t WriteStrs(ArchiveHandle *AH, const char *c);
+
 
 /*
  * Allocate a new DumpOptions block containing all default values.
@@ -1081,9 +1083,20 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
 	newToc->tableam = opts->tableam ? pg_strdup(opts->tableam) : NULL;
 	newToc->owner = opts->owner ? pg_strdup(opts->owner) : NULL;
 	newToc->desc = pg_strdup(opts->description);
-	newToc->defn = opts->createStmt ? pg_strdup(opts->createStmt) : NULL;
 	newToc->dropStmt = opts->dropStmt ? pg_strdup(opts->dropStmt) : NULL;
 	newToc->copyStmt = opts->copyStmt ? pg_strdup(opts->copyStmt) : NULL;
+	if (opts->createStmt == NULL)
+		newToc->defn = NULL;
+	else
+	{
+		const char *p = opts->createStmt;
+		size_t	len

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Alvaro Herrera

Note on heap_prepare_freeze_tuple()'s fifth parameter, it's not valid to
pass OldestXmin; you need a multixact limit there, not an Xid limit.  I
think the return value of GetOldestMultiXactId is a good choice.  AFAICS
this means that you'll need to add a new output argument to
vacuum_set_xid_limits (You *could* call GetOldestMultiXactId again, but
it seems a waste).

Maybe it's time for vacuum_set_xid_limits to have a caller's-stack-
allocated struct for input and output values, rather than so many
arguments and output pointers to fill.

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Alvaro Herrera

On 2020-Nov-20, Simon Riggs wrote:

> On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:

> > Since we use the term aggressive scan in the docs, I personally don't
> > feel unnatural about that. But since this option also disables index
> > cleanup when not enabled explicitly, I’m concerned a bit if user might
> > get confused. I came up with some names like FEEZE_FAST and
> > FREEZE_MINIMAL but I'm not sure these are better.
> 
> FREEZE_FAST seems good.

VACUUM (CHILL) ?

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Robert Haas

On Fri, Nov 20, 2020 at 9:08 AM Alvaro Herrera  wrote:
> There are two costs associated with this processing.  One is dirtying
> the page (which means it needs to be written down when evicted), and the
> other is to write WAL records for each change.  The cost for the latter
> is going to be the same in both cases (with this change and without)
> because the same WAL will have to be written -- the only difference is
> *when* do you pay it.  The cost of the former is quite different; with
> Simon's patch we dirty the page once, and without the patch we may dirty
> it several times before it becomes "stable" and no more writes are done
> to it.
>
> (If you have tables whose pages change all the time, there would be no
> difference with or without the patch.)
>
> Dirtying the page less times means less full-page images to WAL, too,
> which can be significant.

Yeah, I think dirtying the page fewer times is a big win. However, a
page may have tuples that are not yet all-visible, and we can't freeze
those just because we are freezing others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-20 Thread Peter Eisentraut


On 2020-11-09 13:05, Magnus Hagander wrote:

PFA a rebased version of this patch on top of what has happened since,
and changing the pg_upgrade parameter to be --no-scripts.


It seems were are still finding out more nuances about pg_upgrade, but 
looking at initdb for moment, I think the solution for wrapper scripts 
is to just run initdb with >dev/null.  Or maybe if that looks a bit too 
hackish, a --quiet option that turns everything on stdout off.


I think initdb has gotten a bit too chatty over time.  I think if it 
printed nothing on stdout by default and the current output would be 
some kind of verbose or debug mode, we wouldn't really lose much.  With 
that in mind, I'm a bit concerned about adding options (and thus 
documentation surface area etc.) to select exactly which slice of the 
chattiness to omit.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Re: Improve handling of parameter differences in physical replication

2020-11-20 Thread Sergei Kornilov

Hello

> I think I like "unpaused" better here, because "resumed" would seem to
> imply that recovery can actually continue.

Good, I agree.

> One thing that has not been added to my patch is the equivalent of
> 496ee647ecd2917369ffcf1eaa0b2cdca07c8730, which allows promotion while
> recovery is paused. I'm not sure that would be necessary, and it
> doesn't look easy to add either.

Hmm... Good question. How about putting CheckForStandbyTrigger() in a wait 
loop, but reporting FATAL with an appropriate message, such as "promotion is 
not possible because of insufficient parameter settings"?
Also it suits me if we only document that we ignore promote here. I don't think 
this is an important case. And yes, it's not easy to allow promotion, since we 
have already updated control file.

Probably we need pause only after we reached consistency?

2020-11-20 18:10:23.617 MSK 19722 @ from  [vxid: txid:0] [] LOG:  entering 
standby mode
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] WARNING:  hot 
standby is not possible because of insufficient parameter settings
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] DETAIL:  
max_connections = 100 is a lower setting than on the primary server, where its 
value was 150.
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] LOG:  recovery has 
paused
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] DETAIL:  If 
recovery is unpaused, the server will shut down.
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] HINT:  You can then 
restart the server after making the necessary configuration changes.
2020-11-20 18:13:09.767 MSK 19755 melkij@postgres from [local] [vxid: txid:0] 
[] FATAL:  the database system is starting up

regards, Sergei

Re: Should we document IS [NOT] OF?

2020-11-20 Thread John Naylor

On Thu, Nov 19, 2020 at 6:43 PM Tom Lane  wrote:

> After digging a bit more I noticed that we'd discussed removing
> IS OF in the 2007 thread, but forebore because there wasn't an easy
> replacement.  pg_typeof() was added a year later (b8fab2411), so we
> could have done this at any point since then.
>
> Pushed.
>

Documenting or improving IS OF was a TODO, so I've removed that entry.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jit and explain nontext

2020-11-20 Thread Peter Eisentraut


On 2020-10-17 21:21, Justin Pryzby wrote:

Added at:https://commitfest.postgresql.org/30/2766/

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 41317f1837..7345971507 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -839,7 +839,8 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, 
JitInstrumentation *ji)
instr_time  total_time;
  
  	/* don't print information if no JITing happened */

-   if (!ji || ji->created_functions == 0)
+   if (!ji || (ji->created_functions == 0 &&
+   es->format == EXPLAIN_FORMAT_TEXT))
return;
  
  	/* calculate total time */


Can you show an output example of where this patch makes a difference? 
Just from reading the description, I would expect some kind of 
additional JIT-related output from something like


EXPLAIN (FORMAT YAML) SELECT 1;

but I don't see anything.

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Robert Haas

On Thu, Nov 19, 2020 at 8:58 PM Masahiko Sawada  wrote:
> For HEAD, there was a discussion that we change lazy vacuum and
> bulkdelete and vacuumcleanup APIs so that it calls these APIs even
> when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
> specified, it collects dead tuple TIDs into maintenance_work_mem space
> and passes the flag indicating INDEX_CLEANUP is specified or not to
> index AMs. Index AM decides whether doing bulkdelete/vacuumcleanup. A
> downside of this idea would be that we will end up using
> maintenance_work_mem even if all index AMs of the table don't do
> bulkdelete/vacuumcleanup at all.
>
> The second idea I came up with is to add an index AM API  (say,
> amcanskipindexcleanup = true/false) telling index cleanup is skippable
> or not. Lazy vacuum checks this flag for each index on the table
> before starting. If index cleanup is skippable in all indexes, it can
> choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
> maintenance_work_mem. All in-core index AM will set to true. Perhaps
> it’s true (skippable) by default for backward compatibility.
>
> The in-core AMs including btree indexes will work same as before. This
> fix is to make it more desirable behavior and possibly to help other
> AMs that require to call vacuumcleanup in all cases. Once we fix it I
> wonder if we can disable index cleanup when autovacuum’s
> anti-wraparound vacuum.

It (still) doesn't seem very sane to me to have an index that requires
cleanup in all cases. I mean, VACUUM could error or be killed just
before the index cleanup hase happens anyway, so it's not like an
index AM can licitly depend on getting called just because we visited
the heap. It could, of course, depend on getting called before
relfrozenxid is advanced, or before the heap's dead line pointers are
marked unused, or something like that, but it can't just be like, hey,
you have to call me.

I think this whole discussion is to some extent a product of the
contract between the index AM and the table AM being more than
slightly unclear. Maybe we need to clear up the definitional problems
first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Alvaro Herrera

On 2020-Nov-20, Robert Haas wrote:

> Yeah, I think dirtying the page fewer times is a big win. However, a
> page may have tuples that are not yet all-visible, and we can't freeze
> those just because we are freezing others.

Of course!  We should only freeze tuples that are freezable.  I thought
that was obvious :-)

Re: Disable WAL logging to speed up data loading

2020-11-20 Thread Stephen Frost

Greetings,

* osumi.takami...@fujitsu.com (osumi.takami...@fujitsu.com) wrote:
> On Friday, Nov 20, 2020 9:33 AM Tsunakawa, Takayuki 
>  wrote:
> > From: Kyotaro Horiguchi 
> > > At Thu, 19 Nov 2020 11:04:17 -0500, Stephen Frost 
> > > > * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > > > > I missed that this is only a warning when I looked at it before.
> > > > > Yes, it should be a fatal error.
> > > >
> > > > Yeah, the more that I think about it, the more that I tend to agree
> > > > with this.  Does anyone want to argue against changing this into a
> > FATAL..?
> > >
> > > I don't come up with a use case where someone needs to set
> > > wal_level=minimal for archive recovery. So +1 to change it to FATAL.
> It seems we reached the agreement for this fix.
> 
> > Thank you all.  I'd like to give Osumi-san an opportunity to write a patch 
> > and
> > showing the evidence of successful test if you don't mind.  He is very young
> > and newbie to Postgres, so even a small contribution would encourage him to
> > make further contributions.
> Sure. Let me do that, including the test.

That all sounds good to me.  I would recommend starting a new thread on
-hackers with the patch, once it's ready, since it's really independent
from the topic of this thread.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-20 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed.  I am adding Stephen as well in CC.  From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum.  So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
> 
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.

Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
> 
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Robert Haas

On Fri, Nov 20, 2020 at 11:02 AM Alvaro Herrera  wrote:
> On 2020-Nov-20, Robert Haas wrote:
> > Yeah, I think dirtying the page fewer times is a big win. However, a
> > page may have tuples that are not yet all-visible, and we can't freeze
> > those just because we are freezing others.
>
> Of course!  We should only freeze tuples that are freezable.  I thought
> that was obvious :-)

I didn't mean to imply that anyone in particular thought the contrary.
It's just an easy mistake to make when thinking about these kinds of
topics. Ask me how I know.

It's also easy to forget that both xmin and xmax can require freezing,
and that the time at which you can do one can be different than the
time at which you can do the other.

Or at least, I have found it so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jit and explain nontext

2020-11-20 Thread Justin Pryzby

On Fri, Nov 20, 2020 at 04:56:38PM +0100, Peter Eisentraut wrote:
> On 2020-10-17 21:21, Justin Pryzby wrote:
> > Added at:https://commitfest.postgresql.org/30/2766/
> > 
> > diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> > index 41317f1837..7345971507 100644
> > --- a/src/backend/commands/explain.c
> > +++ b/src/backend/commands/explain.c
> > @@ -839,7 +839,8 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, 
> > JitInstrumentation *ji)
> > instr_time  total_time;
> > /* don't print information if no JITing happened */
> > -   if (!ji || ji->created_functions == 0)
> > +   if (!ji || (ji->created_functions == 0 &&
> > +   es->format == EXPLAIN_FORMAT_TEXT))
> > return;
> > /* calculate total time */
> 
> Can you show an output example of where this patch makes a difference? Just
> from reading the description, I would expect some kind of additional
> JIT-related output from something like
> 
> EXPLAIN (FORMAT YAML) SELECT 1;

It matters if it was planned with jit but executed without jit.

postgres=# DEALLOCATE p; SET jit=on; SET jit_above_cost=0; prepare p as select 
from generate_series(1,9); explain(format yaml) execute p; SET jit=off; 
explain(format yaml) execute p;

Patched shows this for both explains:
   JIT:  +
 Functions: 3+

Unpatched shows only in the first case.

-- 
Justin

Re: [Patch] ALTER SYSTEM READ ONLY

2020-11-20 Thread Robert Haas

On Thu, Oct 8, 2020 at 6:23 AM Amul Sul  wrote:
> On a quick look at the latest 0001 patch, the following hunk to reset leftover
> flags seems to be unnecessary:
>
> + /*
> + * If some barrier types were not successfully absorbed, we will have
> + * to try again later.
> + */
> + if (!success)
> + {
> + ResetProcSignalBarrierBits(flags);
> + return;
> + }
>
> When the ProcessBarrierPlaceholder() function returns false without an error,
> that barrier flag gets reset within the while loop.  The case when it has an
> error, the rest of the flags get reset in the catch block.  Correct me if I am
> missing something here.

Good catch. I think you're right. Do you want to update accordingly?

Andres, do you like the new loop better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] remove pg_archivecleanup and pg_standby

2020-11-20 Thread Peter Eisentraut


On 2020-10-29 03:44, Justin Pryzby wrote:

diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 4e833d79ef..be4292ec33 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -199,6 +199,5 @@ pages.
 part of the core PostgreSQL distribution.

  
- &pgstandby;

   
  


With this removal, that section becomes empty.  So you probably want to 
clean up or reorganize this a bit.


See https://www.postgresql.org/docs/devel/contrib-prog.html for the context.

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman

On Tue, Nov 17, 2020 at 11:20 AM Tomas Vondra
 wrote:
>
>
>
> On 11/17/20 3:03 PM, James Coleman wrote:
> > On Mon, Nov 16, 2020 at 11:23 PM Tomas Vondra
> >  wrote:
> >>
> >> Hmm, I missed that other thread. That indeed seems like a bug in the
> >> same area already tweaked by ebb7ae839d033d0f2 for similar cases.
> >
> > I meant to bring it up in this thread before we got the other patch
> > committed, but I just ended up not having time to look into it.
> >
> >> The attached patch fixes this simply by adding is_parallel_safe to
> >> get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
> >> not entirely sure that's the right place. Maybe it should be done in
> >> find_em_expr_usable_for_sorting_rel (which might make a difference if an
> >> EC clause can contain a mix of parallel safe and unsafe expressions). Or
> >> maybe we should do it in the caller (which would allow using
> >> get_useful_pathkeys_for_relation in contexts not requiring parallel
> >> safety in the future).
> >>
> >> Anyway, while this is not an "incremental sort" bug, it seems like the
> >> root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
> >> sort patches started considering sorting below gather nodes, not
> >> realizing these possible consequences.
> >
> > Yeah. I'd like to investigate a bit if really we should also be adding
> > it to prepare_sort_from_pathkeys (which
> > find_em_expr_usable_for_sorting_rel parallels) or similar, since this
> > seems to be a broader concern that's been previously missed (even if
> > the bug was otherwise not yet exposed). In particular, as I understand
> > it, we did do sorts below gather nodes before, just in fewer places,
> > which meant that "in theory" the code was already broken (or at least
> > not complete) for parallel queries.
> >
>
> True. It's possible the bug was there before, but the affected plans
> were just very unlikely for some reason, and the new plans introduced by
> incremental sort just made it easier to trigger.

For additional patches on the broader topic (e.g., parallel safety) is
it preferable to keep them here or in the thread Luis started (or even
a new one besides that -- since that's on -bugs not -hackers)?

James

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 2020-09-03 12:14, Michael Paquier wrote:
>> Opinions are welcome about the arguments of upthread.

> It appears that currtid2() is still used, so we ought to keep it.

Yeah, if pgODBC were not using it at all then I think it'd be fine
to get rid of, but if it still contains calls then we cannot.
The suggestion upthread that those calls might be unreachable
is interesting, but it seems unproven.

regards, tom lane

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread Robert Haas

On Wed, Oct 7, 2020 at 6:22 PM Tomas Vondra
 wrote:
> I'm still not entirely sure I understand what's happening, or what the
> exact rule is. Consider this query:
>
>explain (verbose) select distinct i, t, md5(t) from ref_0;
>
> which on PG12 (i.e. before incremental sort) is planned like this:
>
>   QUERY PLAN
> --
>   Unique  (cost=78120.92..83120.92 rows=50 width=65)
> Output: i, t, (md5(t))
> ->  Sort  (cost=78120.92..79370.92 rows=50 width=65)
>   Output: i, t, (md5(t))
>   Sort Key: ref_0.i, ref_0.t, (md5(ref_0.t))
>   ->  Seq Scan on public.ref_0  (cost=0.00..10282.00 rows=50 
> width=65)
> Output: i, t, md5(t)
> (7 rows)
>
> i.e. the (stable) function is pushed all the way to the scan node. And
> even if we replace it with a volatile expression it gets pushed down:
>
> explain (verbose) select distinct i, t, md5(random()::text || t) from ref_0;
>
>   QUERY PLAN
> --
>   Unique  (cost=83120.92..88120.92 rows=50 width=65)
> Output: i, t, (md5(((random())::text || t)))
> ->  Sort  (cost=83120.92..84370.92 rows=50 width=65)
>   Output: i, t, (md5(((random())::text || t)))
>   Sort Key: ref_0.i, ref_0.t, (md5(((random())::text || ref_0.t)))
>   ->  Seq Scan on public.ref_0  (cost=0.00..15282.00 rows=50 
> width=65)
> Output: i, t, md5(((random())::text || t))
> (7 rows)
>
>
> But perhaps I just don't understand the assumption correctly?

This isn't a counterexample, because there's no join tree here -- or,
well, there is, but it's trivial, because there's only one relation
involved. You can't have a non-Var expression computed before you
finish all the joins, because there are no joins.

What I said was: "target lists for any nodes below the top of the join
tree were previously always just Var nodes." The topmost join allowed
non-Var nodes before, but not lower levels.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Andres Freund

Hi,

On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote:
> Pushed a fix similar to your patch, but I put the wait_for_catchup() before
> running pg_rewind. The point of inserting the 'in A, after C was promoted'
> row is that it's present in B when pg_rewind runs.

Hm - don't we possibly need *both*? Since post pg_rewind recovery starts
at the previous checkpoint, it's quite possible for C to get ready to
answer queries before that record has been replayed?

Thanks,

Andres Freund

Re: abstract Unix-domain sockets

2020-11-20 Thread David G. Johnston

On Friday, November 20, 2020, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-11-18 04:35, David G. Johnston wrote:
>
>>
>>
>> I agree that there isn't a useful hint for the abstract case as it
>> shouldn't happen unless there is indeed another running instance with the
>> same configuration.  Though a hint similar to the above, but without the
>> "remove and retry" bit, probably wouldn't hurt.
>>
>
> I think we are getting a bit sidetracked here with the message wording.
> The reason I looked at this was that "remove socket file and retry" is
> never an appropriate action with abstract sockets.  And on further
> analysis, it is never an appropriate action with any Unix-domain socket
> (because with file system namespace sockets, you never get an EADDRINUSE,
> so it's dead code).  So my proposal here is to just delete that line from
> the hint and leave the rest the same.  There could be value in further
> refining and rephrasing this, but it ought to be a separate thread.
>

If there is dead code there is an underlying problem to address/discover,
not just removing the dead code.  In this case are we saying that a new
server won’t ever fail to start because the socket file exists but instead
will just clobber the file with its own?  Because given that error, and a
server process that failed to clean up after itself, the correction to take
would indeed seem to be to manually remove the file as the hint says.  IOW,
fix the code, not the message?

David J.

[patch] CLUSTER blocks scanned progress reporting

2020-11-20 Thread Matthias van de Meent

Hi,

The pg_stat_progress_cluster view can report incorrect
heap_blks_scanned values when synchronize_seqscans is enabled, because
it allows the sequential heap scan to not start at block 0. This can
result in wraparounds in the heap_blks_scanned column when the table
scan wraps around, and starting the next phase with heap_blks_scanned
!= heap_blks_total. This issue was introduced with the
pg_stat_progress_cluster view.

The attached patch fixes the issue by accounting for a non-0
heapScan->rs_startblock and calculating the correct number with a
non-0 heapScan->rs_startblock in mind.

The issue is reproducible starting from PG12 by stopping a
CLUSTER-command while it is sequential-scanning the table (seqscan
enforceable through enable_indexscan = off), and then re-starting the
seqscanning CLUSTER command (without other load/seq-scans on the
table).


Any thoughts?

Matthias van de Meent
From 12f7617f4999b7bf85cc2c9ef4c3b96aac3b26e8 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 20 Nov 2020 16:23:59 +0100
Subject: [PATCH v1] Fix CLUSTER progress reporting of number of blocks
 scanned.

The heapScan need not start at block 0, so heapScan->rs_cblock need not be the
correct value for amount of blocks scanned. A more correct value is
 ((heapScan->rs_cblock - heapScan->rs_startblock + heapScan->rs_nblocks) %
   heapScan->rs_nblocks), as it accounts for the wraparound and the initial
offset of the heapScan.

Signed-off-by: Matthias van de Meent 
---
 src/backend/access/heap/heapam_handler.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..3d0433633b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -798,9 +798,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			/*
 			 * In scan-and-sort mode and also VACUUM FULL, set heap blocks
 			 * scanned
+			 *
+			 * Note that heapScan may start at an offset and wrap around, i.e.
+			 * rs_startblock may be >0, and rs_cblock may end with a number
+			 * below rs_startblock. To prevent showing this wraparound to the
+			 * user, we offset rs_cblock by rs_startblock (modulo rs_nblocks).
 			 */
 			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
-		 heapScan->rs_cblock + 1);
+		 (heapScan->rs_cblock +
+		 heapScan->rs_nblocks -
+		 heapScan->rs_startblock
+		 ) % heapScan->rs_nblocks + 1);
 		}
 
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
-- 
2.20.1

Re: [Patch] ALTER SYSTEM READ ONLY

2020-11-20 Thread Amul Sul

On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas  wrote:

> On Thu, Oct 8, 2020 at 6:23 AM Amul Sul  wrote:
> > On a quick look at the latest 0001 patch, the following hunk to reset
> leftover
> > flags seems to be unnecessary:
> >
> > + /*
> > + * If some barrier types were not successfully absorbed, we will have
> > + * to try again later.
> > + */
> > + if (!success)
> > + {
> > + ResetProcSignalBarrierBits(flags);
> > + return;
> > + }
> >
> > When the ProcessBarrierPlaceholder() function returns false without an
> error,
> > that barrier flag gets reset within the while loop.  The case when it
> has an
> > error, the rest of the flags get reset in the catch block.  Correct me
> if I am
> > missing something here.
>
> Good catch. I think you're right. Do you want to update accordingly?


Sure, Ill update that. Thanks for the confirmation.


> Andres, do you like the new loop better?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: Probable documentation errors or improvements

2020-11-20 Thread Щекин Ярослав

Thank you for preparing the patches, and Heikki Linnakangas for pushing those!Yaroslav, could you prepare a patch for your remaining suggestions ?I suggest to include just the handful of items which are most important (mostwrong/ambiguous/confusing).Let me write this again: I'm not a native speaker, so not sure those are actually incorrect, and can't offer non-trivial suggestions. Anyway, I've tried to remove the already fixed things (also by other patches mentioned in the thread) from my initial list, and attach it. Also, I've looked over the pushed patches, and spotted these: -- https://www.postgresql.org/docs/current/catalog-pg-authid.html Table 51.8. pg_authid Columns-- Some rows in the description have point at the end, some don't (seemingly at random). The applied patch fixed only one of those cases... also seemingly at random. Would be nice if someone read the whole table and made a decision about how it should be. If (just guessing here) the rule is "there should be no point if it's just one sentence", then "Role bypasses every row level security policy, see Section 5.8 for more information." is still wrong. Then, while ";" after "END" were added in doc/src/sgml/plpgsql.sgml, but in doc/src/sgml/func.sgml they still were not. Finally, while the patches fix a lot of whitespace in function arguments, the original complaint about div() and mod() is still not fixed. -- WBR, Yaroslav Schekin.General ones:
. "boolean" vs "Boolean" --- usage seems to be inconsistent, even in the same
  context.

. Transaction isolation levels are sometimes called "transaction isolation
  modes", is that correct?

. In https://www.postgresql.org/docs/current/tableam.html, links to source code
  are also hyperlinks into git, like (from SGML source):

  For details, please refer to the https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/tableam.h;hb=HEAD";>
  src/include/access/tableam.h file.

  Perhaps, other similar links in documentation should also be made into
  hyperlinks?


Specific ones:
-- 
https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

4.1.2.4 Dollar-quoted String Constants

While the standard syntax for specifying string constants is usually
convenient, it can be difficult to understand when the desired string contains
many single quotes or backslashes, since each of those must be doubled.
  -- Not so for backslashes (perhaps, this sentence is from
 pre-standard_conforming_strings ages).

-  --

Notice that inside the dollar-quoted string, single quotes can be used without
needing to be escaped. Indeed, no characters inside a dollar-quoted string are
ever escaped: the string content is always written literally. Backslashes are
not special, and neither are dollar signs, unless they are part of a sequence
matching the opening tag.
  -- Backslashes, again. Though here in may be justified, not sure.

-  --

$function$
BEGIN
RETURN ($1 ~ $q$[\t\r\n\v\\]$q$);
END;
$function$
  -- While it's understandable what the example is trying to demonstrate,
 single-quoted string would work here, too (so, no actual advantage, in
 this case).

-  --

With single-quote syntax, each backslash in the above example would have to be
written as four backslashes, which would be reduced to two backslashes in
parsing the original string constant, and then to one when the inner string
constant is re-parsed during function execution.
  -- Nothing needs to be changed about backslashes, yet again.


-- https://www.postgresql.org/docs/current/ddl-basics.html
5.1. Table Basics

A table in a relational database is much like a table on paper: It consists of
rows and columns.
  -- Why "I" in It after ":" is capitalized?

-  --
Some of the frequently used data types are integer for whole numbers, numeric
for possibly fractional numbers, text for character strings, date for dates,
time for time-of-day values, and timestamp for values containing both date and
time.
  -- Perhaps, add (or replace with) timestamptz for storing moments in time (or
 something like that)?


-- 
https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-EXCLUSION

5.4.6. Exclusion Constraints

Exclusion constraints ensure that if any two rows are compared on the specified
columns or expressions using the specified operators, at least one of these
operator comparisons will return false or null. The syntax is:

CREATE TABLE circles (
c circle,
EXCLUDE USING gist (c WITH &&)
);
  -- Not only the definition is hard to grasp, but the example doesn't clarify
 a lot, as it's not explained what it actually achieves (especially given
 geometric data types and operators are described several chapter

Re: [Proposal] Global temporary tables

2020-11-20 Thread Pavel Stehule

Hi

pá 11. 9. 2020 v 17:00 odesílatel 曾文旌  napsal:

> I have written the README for the GTT, which contains the GTT requirements
> and design.
> I found that compared to my first email a year ago, many GTT Limitations
> are now gone.
> Now, I'm adding comments to some of the necessary functions.
>

There are problems with patching. Please, can you rebase your patch?

Regards

Pavel


>
> Wenjing
>
>
>
>
>
> > 2020年7月31日 上午4:57，Robert Haas  写道：
> >
> > On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng 
> wrote:
> >> Please continue to review the code.
> >
> > This patch is pretty light on comments. Many of the new functions have
> > no header comments, for example. There are comments here and there in
> > the body of the new functions that are added, and in places where
> > existing code is changed there are comments here and there, but
> > overall it's not a whole lot. There's no documentation and no README,
> > either. Since this adds a new feature and a bunch of new SQL-callable
> > functions that interact with that feature, the feature itself should
> > be documented, along with its limitations and the new SQL-callable
> > functions that interact with it. I think there should be either a
> > lengthy comment in some suitable file, or maybe various comments in
> > various files, or else a README file, that clearly sets out the major
> > design principles behind the patch, and explaining also what that
> > means in terms of features and limitations. Without that, it's really
> > hard for anyone to jump into reviewing this code, and it will be hard
> > for people who have to maintain it in the future to understand it,
> > either. Or for users, for that matter.
> >
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
>
>

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman

On Fri, Nov 20, 2020 at 12:06 PM Robert Haas  wrote:
>
> On Wed, Oct 7, 2020 at 6:22 PM Tomas Vondra
>  wrote:
> > I'm still not entirely sure I understand what's happening, or what the
> > exact rule is. Consider this query:
> >
> >explain (verbose) select distinct i, t, md5(t) from ref_0;
> >
> > which on PG12 (i.e. before incremental sort) is planned like this:
> >
> >   QUERY PLAN
> > --
> >   Unique  (cost=78120.92..83120.92 rows=50 width=65)
> > Output: i, t, (md5(t))
> > ->  Sort  (cost=78120.92..79370.92 rows=50 width=65)
> >   Output: i, t, (md5(t))
> >   Sort Key: ref_0.i, ref_0.t, (md5(ref_0.t))
> >   ->  Seq Scan on public.ref_0  (cost=0.00..10282.00 rows=50 
> > width=65)
> > Output: i, t, md5(t)
> > (7 rows)
> >
> > i.e. the (stable) function is pushed all the way to the scan node. And
> > even if we replace it with a volatile expression it gets pushed down:
> >
> > explain (verbose) select distinct i, t, md5(random()::text || t) from ref_0;
> >
> >   QUERY PLAN
> > --
> >   Unique  (cost=83120.92..88120.92 rows=50 width=65)
> > Output: i, t, (md5(((random())::text || t)))
> > ->  Sort  (cost=83120.92..84370.92 rows=50 width=65)
> >   Output: i, t, (md5(((random())::text || t)))
> >   Sort Key: ref_0.i, ref_0.t, (md5(((random())::text || ref_0.t)))
> >   ->  Seq Scan on public.ref_0  (cost=0.00..15282.00 rows=50 
> > width=65)
> > Output: i, t, md5(((random())::text || t))
> > (7 rows)
> >
> >
> > But perhaps I just don't understand the assumption correctly?
>
> This isn't a counterexample, because there's no join tree here -- or,
> well, there is, but it's trivial, because there's only one relation
> involved. You can't have a non-Var expression computed before you
> finish all the joins, because there are no joins.
>
> What I said was: "target lists for any nodes below the top of the join
> tree were previously always just Var nodes." The topmost join allowed
> non-Var nodes before, but not lower levels.

As I understand what you're saying, the attached (from the repro case
in [1]'s discussion about parallel safety here) is a counterexample.

Specifically we have a plan like:

Merge Right Join
  -> Unique
-> Gather Merge
  -> Sort
-> Nested Loop

The pathtarget of the nested loop contains non-var expressions (in
this case a CASE expression).

Am I misunderstanding what you're saying?

I've attached verbose output (and the query).

James
test=# explain (verbose)
select m.id2, m.id, s.description
FROM main m
LEFT JOIN (
  SELECT DISTINCT
m.id,
CASE
WHEN m.id2 = 15 AND (SELECT name FROM secondary x WHERE x.id = s2.id AND 
x.id2 = 10) = md5(123::text) THEN 'description'
WHEN m.id2 = 15 THEN (SELECT name FROM secondary x WHERE x.id = s2.id AND 
x.id2 = 5)
END AS description
  FROM main m
  JOIN secondary s2 ON m.id = s2.id
  WHERE m.id2 = 15 and type = 0
) s ON s.id = m.id
WHERE m.id2 IN (15) and type = 0 ;

 QUERY PLAN 

   
-

  
 Merge Right Join  (cost=126496.41..9205100.33 rows=34236 width=44)
   Output: m.id2, m.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)


   Inner Unique: true
   Merge Cond: (m_1.id = m.id)
   ->  Unique  (cost=125495.95..9092259.53 rows=34236 width=40)
 Output: m_1.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)

   
 ->  Gather Merge  (cost=125495.95..9089667.83 rows=518341 width=40)
   Output: m_1.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Peter Geoghegan

On Fri, Nov 20, 2020 at 3:17 AM Masahiko Sawada  wrote:
> I had missed your bottom-up index deletion patch but it's a promising
> improvement. With that patch, the number of dead tuples in individual
> indexes may differ. So it's important that we make index vacuuming a
> per-index thing.

Right, but it's important to remember that even bottom-up index
deletion isn't really special in theory. Bottom-up index deletion is
"just" a reliable version of the existing LP_DEAD index deletion thing
(which has been around since Postgres 8.2). In theory it doesn't
change the fundamental nature of the problem. In practice it does,
because it makes it very obvious to pgsql-hackers that indexes on the
same table can have very different needs from VACUUM. And the actual
differences we see are much bigger now. Even still, the fact that you
had certain big differences across indexes on the same table is not a
new thing. (Actually, you can even see this on the master branch in
Victor's bottom-up deletion benchmark, where the primary key index
actually doesn't grow on the master branch, even after 8 hours.)

The bottom-up index deletion patch (and the enhancements we're talking
about here, for VACUUM itself) are based on "the generational
hypothesis" that underlies generational garbage collection. The
philosophy is the same. See:

https://plumbr.io/handbook/garbage-collection-in-java/generational-hypothesis

In theory, "most garbage comes from new objects" is "just" an
empirical observation, that may or may not be true with each
workload/Java program/whatever. In practice it is important enough to
be a big part of how every modern garbage collector works -- it's
almost always true, and even when it isn't true it doesn't actually
hurt to make the assumption that it is true and then be wrong. I
believe that we have to take a holistic view of the problem to make
real progress.

Andres said something similar in a recent blog post:

https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon

"In most workloads the majority of accesses are to live tuples, and
when encountering non-live tuple versions they are either very old, or
very new."

(This was just a coincidence, but it was good to see that he made the
same observation.)

> Given that patch, it seems to me that it would be better to ask
> individual index AM before calling to bulkdelete about the needs of
> bulkdelete. That is, passing VACUUM options and the number of
> collected dead tuples etc. to index AM, we ask index AM via a new
> index AM API whether it wants to do bulkdelete or not. We call
> bulkdelete for only indexes that answered 'yes'. If we got 'no' from
> any one of the indexes, we cannot have a second heap pass.
> INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false,
> we expect index AMs to return 'no' unless they have a special reason
> for the needs of bulkdelete.

I don't have a very detailed idea of the interface or anything. There
are a few questions that naturally present themselves, that I don't
have good answers to right now. Obviously vacuumlazy.c will only treat
this feedback from each index as an advisory thing. So what happens
when 50% of the indexes say yes and 50% say no? This is a subproblem
that must be solved as part of this work. Ideally it will be solved by
you.  :-)

> One possible benefit of this idea even without bottom-up index
> deleteion patch would be something like
> vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the
> case where the amount of dead tuple is slightly larger than
> maitenance_work_mem the second time calling to bulkdelete will be
> called with a small amount of dead tuples, which is less efficient. If
> an index AM is able to determine not to do bulkdelete by comparing the
> number of dead tuples to a threshold, it can avoid such bulkdelete
> calling.

I agree. Actually, I thought the same thing myself, even before I
realized that bottom-up index deletion was possible.

> Also, as a future extension, once we have retail index deletion
> feature, we might be able to make that API return a ternary value:
> 'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can
> choose the appropriate method of index deletion based on the
> statistics.

I agree again!

We may eventually be able to make autovacuum run very frequently
against each table in many important cases, with each VACUUM taking
very little wall-clock time. We don't have to change the fundamental
design to fix most of the current problems. I suspect that the "top
down" nature of VACUUM is sometimes helpful. We just need to
compensate when this design is insufficient. Getting the "best of both
worlds" is possible.

> But for making index vacuuming per-index thing, we need to deal with
> the problem that we cannot know which indexes of the table still has
> index tuples pointing to the collect

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread Robert Haas

On Fri, Nov 20, 2020 at 1:51 PM James Coleman  wrote:
> > This isn't a counterexample, because there's no join tree here -- or,
> > well, there is, but it's trivial, because there's only one relation
> > involved. You can't have a non-Var expression computed before you
> > finish all the joins, because there are no joins.
> >
> > What I said was: "target lists for any nodes below the top of the join
> > tree were previously always just Var nodes." The topmost join allowed
> > non-Var nodes before, but not lower levels.
>
> As I understand what you're saying, the attached (from the repro case
> in [1]'s discussion about parallel safety here) is a counterexample.
>
> Specifically we have a plan like:
>
> Merge Right Join
>   -> Unique
> -> Gather Merge
>   -> Sort
> -> Nested Loop
>
> The pathtarget of the nested loop contains non-var expressions (in
> this case a CASE expression).

Well, in this case there are two join trees, one for the subquery and
one for the outer query. The expressions for the subquery are computed
at the top of the plan tree for that subquery.

I guess I didn't specify before that I meant "at the top of the join
tree for a subquery level," but I did mean that. On principle, the
planner can't very realistically postpone evaluating a subquery's
expressions until the top of the outer query's join tree, because, as
in your example here, the subquery might contain an ORDER BY or
DISTINCT clause. And anyway, if you look at the planner code, you'll
see that every subquery is basically planned separately, unless it
gets flattened into the parent query, so even if it were possible to
postpone expression evaluation to outer subquery levels in some case,
the current code structure definitely would fail to achieve that goal.

However, apart from incremental sort, you can postpone evaluating a
join tree's expressions until you reach the top of the join tree, and
I think that's what we have always done in the past. Each baserel in
the join tree gets a target list containing a list of vars which
corresponds to its column list, and the joinrels get lists of vars
which correspond to the columns from the inputs are needed at higher
levels of the plan tree. At the top of the join tree, we stick in the
expressions, replacing the target list of the top node in the plan
tree; the expressions have to be computable from the inputs because
the inputs contain all the columns that we figured out were needed at
this level at the beginning of planning. But, if there are scans or
joins below the top level of the plan tree, then do they bubble up to
higher levels of the plan? Should they? It's pretty complicated.
Theoretically if I've computed length(somelongvar) then perhaps I can
just bubble the computed value up the plan tree, rather than the
original column, which might be cheaper. But that only works if the
higher levels only need length(somelongvar) and not somelongvar
itself. I don't think there's any logic to work stuff like this out,
unless the incremental sort patch added some, because I don't think we
ever did it before. And it may be that it doesn't really matter, at
least when there aren't volatile functions: perhaps the worst case is
that we evaluate a non-volatile function twice, and maybe that's just
a performance consequence that we can live with. But on the other
hand, maybe it breaks stuff.

Or, on the third hand, maybe I'm wrong about what the rule was in the
first place. This is certainly a complicated topic. I don't think I'm
wrong, or I wouldn't be bothering to write emails about it, but that
doesn't mean I'm not wrong anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Robert Haas

On Fri, Nov 20, 2020 at 2:37 PM Peter Geoghegan  wrote:
> Right. Maybe we don't ask the index AMs for discrete yes/no answers.
> Maybe we can ask them for a continuous answer, such as a value between
> 0.0 and 1.0 that represents the urgency/bloat, or something like that.
> And so the final yes/no answer that really does have to be made for
> the table as a whole (does VACUUM do a second pass over the heap to
> make LP_DEAD items into LP_UNUSED items?) can at least consider the
> worst case for each index. And maybe the average case, too.

That's an interesting idea. We should think about the needs of brin
indexes when designing something better than the current system. They
have the interesting property that the heap deciding to change LP_DEAD
to LP_UNUSED doesn't break anything even if nothing's been done to the
index, because they don't store TIDs anyway. So that's an example of
an index AM that might want to do some work to keep performance up,
but it's not actually required. This might be orthogonal to the
0.0-1.0 scale you were thinking about, but it might be good to factor
it into the thinking somehow.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Strange behavior with polygon and NaN

2020-11-20 Thread Tom Lane

I spent some more time looking at this patch.

Some experimentation shows that the changes around bounding box
calculation (ie float8_min_nan() and its call sites) seem to be
completely pointless: removing them doesn't change any of the regression
results.  Nor does using float8_min_nan() in the other two bounding-box
calculations I'd asked about.  So I think we should just drop that set
of changes and stick with the rule that bounding box upper and lower
values are sorted as per float.h comparison rules.  This isn't that hard
to work with: if you want to know whether any NaNs are in the box, test
the upper limit (and *not* the lower limit) for isnan().  Moreover, even
if we wanted a different coding rule, we really can't have it because we
will still need to work with existing on-disk values that have bounding
boxes computed the old way.

I don't much like anything about float8_coef_mul().  In the first place,
FPzero() is an ugly, badly designed condition that we should be trying
to get rid of not add more dependencies on.  In the second place, it's
really unclear why allowing 0 times Inf to be something other than NaN
is a good idea, and it's even less clear why allowing small-but-not-zero
times Inf to be zero rather than Inf is a good idea.  In the third
place, the asymmetry between the two inputs looks more like a bug than
something we should actually want.

After some time spent staring at the specific case of line_closept_point
and its callers, I decided that the real problems there are twofold.
First, the API, or at least the callers' interpretation of this
undocumented point, is that whether the distance is undefined (NaN) is
equivalent to whether the closest point is undefined.  This is not so;
in some cases we know that the distance is infinite even though we can't
calculate a unique closest point.  Second, it's not making any attempt
to eliminate undefined cases up front.  We can do that pretty easily
by plugging the point's coordinates into the line's equation Ax+By+C
and seeing whether we get a NaN.  The attached 0002 is a subset patch
that just fixes these two issues, and I like the results it produces.

I wonder now whether the problems around line_interpt_line() and the
other intersection-ish functions wouldn't be better handled in a similar
way, by making changes to their API specs to be clearer about what
happens with NaNs and trying to eliminate ill-defined cases explicitly.
I've not tried to code that though.

Changing pg_hypot() the way you've done here is right out.  See the
comment for the function: what it is doing now is per all the relevant
standards, and your patch breaks that.  It's extremely unlikely that
doing it differently from IEEE and POSIX is a good idea.

Attached are the same old 0001 (adding the extra point_tbl entry)
and a small 0002 that fixes just line_closept_point.  I've not
tried to figure out just which of the rest of your diffs should be
dropped given that.  I did note though that the example you add
to func.sgml doesn't apply to this version of line_closept_point:

regression=# select point '(Infinity,Infinity)' <-> line '{-1,0,5}';
 ?column? 
--
  NaN
(1 row)

regression=# select point '(Infinity,Infinity)' <-> line '{0,-1,5}';
 ?column? 
--
  NaN
(1 row)

regards, tom lane

diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 93a8736a3f..76679bae8d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -157,7 +157,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
  count 
 ---
- 3
+ 4
 (1 row)
 
 SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
@@ -169,7 +169,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
  count 
 ---
- 4
+ 5
 (1 row)
 
 SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
@@ -188,10 +188,11 @@ SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
  (10,10)
  (-5,-12)
  (5.1,34.5)
+ (Infinity,1e+300)
  (1e+300,Infinity)
  (NaN,NaN)
  
-(10 rows)
+(11 rows)
 
 SELECT * FROM point_tbl WHERE f1 IS NULL;
  f1 
@@ -202,16 +203,17 @@ SELECT * FROM point_tbl WHERE f1 IS NULL;
 SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
 f1 
 ---
- (1e-300,-1e-300)
  (0,0)
+ (1e-300,-1e-300)
  (-3,4)
  (-10,0)
  (10,10)
  (-5,-12)
  (5.1,34.5)
  (1e+300,Infinity)
+ (Infinity,1e+300)
  (NaN,NaN)
-(9 rows)
+(10 rows)
 
 SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
 f1
@@ -464,7 +466,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
  count 
 ---
- 3
+ 4
 (1 row)
 
 EXPLAIN (COSTS OFF)
@@ -494,7 +496,7 @@ SELECT count(*) FROM

Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Heikki Linnakangas


On 20/11/2020 19:14, Andres Freund wrote:

Hi,

On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote:

Pushed a fix similar to your patch, but I put the wait_for_catchup() before
running pg_rewind. The point of inserting the 'in A, after C was promoted'
row is that it's present in B when pg_rewind runs.


Hm - don't we possibly need *both*? Since post pg_rewind recovery starts
at the previous checkpoint, it's quite possible for C to get ready to
answer queries before that record has been replayed?


No, C will not reach consistent state until all the WAL in the source 
system has been replayed. pg_rewind will set minRecoveryPoint to the 
minRecoveryPoint of the source system, after copying all the files. (Or 
its insert point, if it's not a standby server, but in this case it is). 
Same as when taking an online backup.


- Heikki

Default role -> Predefined role

2020-11-20 Thread Stephen Frost

Greetings,

Attached is a patch to move from 'default role' terminology to
'predefined role' in the documentation.  In the code, I figured it made
more sense to avoid saying either one and instead opted for just
'ROLE_NAME_OF_ROLE' as the #define, which we were very close to (one
removing the 'DEFAULT' prefix) but didn't actually entirely follow.

This would only apply to HEAD, and we'd use the doc redirect + doc alias
features of the website to prevent links to things like /current from
breaking and to provide a smooth path.  In the back-branches there would
be a smaller update that just notes that in v14 'default roles' were
renamed to 'predefined roles'.

Compiles and passes regressions, and I didn't find any other obvious
references to 'default roles'.  Where I did find such comments in the
code, I opted generally to just remove the 'default' qualification as it
wasn't really needed.

I had contemplated calling these 'capabilities' or similar but ended up
deciding not to- they really are roles, after all.

Thoughts?

Thanks,

Stephen
diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index d064b5a080..fe066e402c 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -79,10 +79,13 @@ convert_and_check_filename(text *arg)
 	 * files on the server as the PG user, so no need to do any further checks
 	 * here.
 	 */
-	if (is_member_of_role(GetUserId(), DEFAULT_ROLE_WRITE_SERVER_FILES))
+	if (is_member_of_role(GetUserId(), ROLE_PG_WRITE_SERVER_FILES))
 		return filename;
 
-	/* User isn't a member of the default role, so check if it's allowable */
+	/*
+	 * User isn't a member of the pg_write_server_files role, so check if it's
+	 * allowable
+	 */
 	if (is_absolute_path(filename))
 	{
 		/* Disallow '/a/b/data/..' */
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 072a6dc1c1..414cedec67 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -269,13 +269,13 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 			 * otherwise there'd still be a security hole.
 			 */
 			if (strcmp(def->defname, "filename") == 0 &&
-!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+!is_member_of_role(GetUserId(), ROLE_PG_READ_SERVER_FILES))
 ereport(ERROR,
 		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 		 errmsg("only superuser or a member of the pg_read_server_files role may specify the filename option of a file_fdw foreign table")));
 
 			if (strcmp(def->defname, "program") == 0 &&
-!is_member_of_role(GetUserId(), DEFAULT_ROLE_EXECUTE_SERVER_PROGRAM))
+!is_member_of_role(GetUserId(), ROLE_PG_EXECUTE_SERVER_PROGRAM))
 ereport(ERROR,
 		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 		 errmsg("only superuser or a member of the pg_execute_server_program role may specify the program option of a file_fdw foreign table")));
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index dd963c4644..315e03cabd 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1566,7 +1566,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 	pgssEntry  *entry;
 
 	/* Superusers or members of pg_read_all_stats members are allowed */
-	is_allowed_role = is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS);
+	is_allowed_role = is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS);
 
 	/* hash table must exist already */
 	if (!pgss || !pgss_hash)
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index 714398831b..669a7d7730 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -130,7 +130,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 	aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
   ACL_SELECT);
 	if (aclresult != ACLCHECK_OK)
-		aclresult = is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
+		aclresult = is_member_of_role(GetUserId(), ROLE_PG_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
 
 	if (aclresult != ACLCHECK_OK)
 		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
index 8831f5911f..2e21806f48 100644
--- a/doc/src/sgml/file-fdw.sgml
+++ b/doc/src/sgml/file-fdw.sgml
@@ -187,8 +187,8 @@
 
  
   Changing table-level options requires being a superuser or having the privileges
-  of the default role pg_read_server_files (to use a filename) or
-  the default role pg_execute_server_program (to use a program),
+  of the role pg_read_server_files (to use a filename) or
+  the role pg_execute_server_program (to use a program),
   for security reasons: only certain users should be able to control which file is
   read or which program is run.  In principle regular users could be allowed to
   change the other options, but that's not supported at present.
diff --gi

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Peter Geoghegan

On Fri, Nov 20, 2020 at 12:04 PM Robert Haas  wrote:
> That's an interesting idea. We should think about the needs of brin
> indexes when designing something better than the current system. They
> have the interesting property that the heap deciding to change LP_DEAD
> to LP_UNUSED doesn't break anything even if nothing's been done to the
> index, because they don't store TIDs anyway. So that's an example of
> an index AM that might want to do some work to keep performance up,
> but it's not actually required. This might be orthogonal to the
> 0.0-1.0 scale you were thinking about, but it might be good to factor
> it into the thinking somehow.

I actually made this exact suggestion about BRIN myself, several years ago.

As I've said, it seems like it would be a good idea to ask the exact
same generic question of each index in turn (which is answered using
local smarts added to the index AM). Again, the question is: How
important is it that you get vacuumed now, from your own
narrow/selfish point of view? The way that BRIN answers this question
is not the novel thing about BRIN among other index access methods,
though. (Not that you claimed otherwise -- just framing the discussion
carefully.)

BRIN has no selfish reason to care if the table never gets to have its
LP_DEAD line pointers set to LP_UNUSED -- that's just not something
that it can be expected to understand directly. But all index access
methods should be thought of as not caring about this, because it's
just not their problem. (Especially with bottom-up index deletion, but
even without it.)

The interesting and novel thing about BRIN here is this: lazyvacuum.c
can be taught that a BRIN index alone is no reason to have to do a
second pass over the heap (to make the LP_DEAD/pruned-by-VACUUM line
pointers LP_UNUSED). A BRIN index never gets affected by the usual
considerations about the heapam invariant (the usual thing about TIDs
in an index not pointing to a line pointer that is at risk of being
recycled), which presents us with a unique-to-BRIN opportunity. Which
is exactly what you said.

(***Thinks some more***)

Actually, now I think that BRIN shouldn't be special to vacuumlazy.c
in any way. It doesn't make sense as part of this future world in
which index vacuuming can be skipped for individual indexes (which is
what I talked to Sawada-san about a little earlier in this thread).
Why should it be useful to exploit the "no-real-TIDs" property of BRIN
in this future world? It can only solve a problem that the main
enhancement is itself expected to solve without any special help from
BRIN (just the generic am callback that asks the same generic question
about index vacuuming urgency).

The only reason we press ahead with a second scan (the
LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table
problem. The bloat eventually gets out of hand *in the table*. We have
now conceptually decoupled the problems experienced in the table/heap
from the problems for each index (mostly), so this actually makes
sense. The theory behind AV scheduling becomes much closer to reality
-- by changing the reality! (The need to "prune the table to VACUUM
any one index" notwithstanding -- that's still necessary, of course,
but we still basically decouple table bloat from index bloat at the
conceptual level.)

Does that make sense?
-- 
Peter Geoghegan

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman

Thanks much for the detailed followup; this is super helpful.

On Fri, Nov 20, 2020 at 2:57 PM Robert Haas  wrote:
>
> On Fri, Nov 20, 2020 at 1:51 PM James Coleman  wrote:
> > > This isn't a counterexample, because there's no join tree here -- or,
> > > well, there is, but it's trivial, because there's only one relation
> > > involved. You can't have a non-Var expression computed before you
> > > finish all the joins, because there are no joins.
> > >
> > > What I said was: "target lists for any nodes below the top of the join
> > > tree were previously always just Var nodes." The topmost join allowed
> > > non-Var nodes before, but not lower levels.
> >
> > As I understand what you're saying, the attached (from the repro case
> > in [1]'s discussion about parallel safety here) is a counterexample.
> >
> > Specifically we have a plan like:
> >
> > Merge Right Join
> >   -> Unique
> > -> Gather Merge
> >   -> Sort
> > -> Nested Loop
> >
> > The pathtarget of the nested loop contains non-var expressions (in
> > this case a CASE expression).
>
> Well, in this case there are two join trees, one for the subquery and
> one for the outer query. The expressions for the subquery are computed
> at the top of the plan tree for that subquery.
>
> I guess I didn't specify before that I meant "at the top of the join
> tree for a subquery level," but I did mean that. On principle, the
> planner can't very realistically postpone evaluating a subquery's
> expressions until the top of the outer query's join tree, because, as
> in your example here, the subquery might contain an ORDER BY or
> DISTINCT clause. And anyway, if you look at the planner code, you'll
> see that every subquery is basically planned separately, unless it
> gets flattened into the parent query, so even if it were possible to
> postpone expression evaluation to outer subquery levels in some case,
> the current code structure definitely would fail to achieve that goal.

Ah, the "for a subquery level" clarification is helpful.

> However, apart from incremental sort, you can postpone evaluating a
> join tree's expressions until you reach the top of the join tree, and
> I think that's what we have always done in the past. Each baserel in
> the join tree gets a target list containing a list of vars which
> corresponds to its column list, and the joinrels get lists of vars
> which correspond to the columns from the inputs are needed at higher
> levels of the plan tree. At the top of the join tree, we stick in the
> expressions, replacing the target list of the top node in the plan
> tree; the expressions have to be computable from the inputs because
> the inputs contain all the columns that we figured out were needed at
> this level at the beginning of planning. But, if there are scans or
> joins below the top level of the plan tree, then do they bubble up to
> higher levels of the plan? Should they? It's pretty complicated.
> Theoretically if I've computed length(somelongvar) then perhaps I can
> just bubble the computed value up the plan tree, rather than the
> original column, which might be cheaper. But that only works if the
> higher levels only need length(somelongvar) and not somelongvar
> itself.

Apologies for some rambling/overlapping thoughts below; I'm try to
absorb all of this.

A join rel (at the top of a subquery join tree) would have to somehow
know that it needs to pass up both length(somelongvar) and somelongvar
right? Since both might be needed by a higher level?

If I understand correctly you're saying that:
1. Target list entries for join rels contain vars and non-var expressions
2. Target list entries for base rels contain only vars
3. By implication the planner knows how to propagate non-var
expression results up from join rels, but not from base rels.

Point 3 is the one that would surprise me, or at least, I'd be
interested in knowing what makes the two different in that case (is it
an artificial restriction, a result of how things are structured in
code, something else?). I feel like I'm still missing context on why
this is inherently a problem.

Does this mean a simple "SELECT bar * 2 FROM foo" ends up with a join
rel at the top? Or that a projection handles it (maybe this is the
wrong question -- perhaps the project results in a join rel? I don't
know enough here to tell if this ends up being an odd question)? And
if a projection, then would inserting a projection above a base rel
but before the top of the join tree solve the problem while allowing
us to push expression evaluation down?

Perhaps the key (to what I'm missing) is just saying something like:
"the work hasn't been down to push expression evaluation down and
ensure it remains in the target lists at every level of the join tree
so as to be propagated up, and so adding it to the base rel will
almost certainly mean duplicate evaluation". That still leaves my
question about how propagating it up from a subquery join tree works
(maybe that's an already hand

Re: Additional Chapter for Tutorial - arch-dev.sgml

2020-11-20 Thread Erik Rijkers


On 2020-11-15 19:45, Jürgen Purtz wrote:




(smallish) Changes to arch-dev.sgml

Erik
--- ./doc/src/sgml/arch-dev.sgml.orig	2020-11-20 19:14:21.576775798 +0100
+++ ./doc/src/sgml/arch-dev.sgml	2020-11-20 21:43:10.435370787 +0100
@@ -7,7 +7,7 @@
Author

 This chapter originated as part of
-, Stefan Simkovics'
+ Stefan Simkovics'
 Master's Thesis prepared at Vienna University of Technology under the direction
 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

@@ -17,10 +17,7 @@
This chapter gives an overview of the internal structure of the
backend of PostgreSQL.  After having
read the following sections you should have an idea of how a query
-   is processed. This chapter does not aim to provide a detailed
-   description of the internal operation of
-   PostgreSQL, as such a document would be
-   very extensive. Rather, this chapter is intended to help the reader
+   is processed.  This chapter is intended to help the reader
understand the general sequence of operations that occur within the
backend from the point at which a query is received, to the point
at which the results are returned to the client.
@@ -30,8 +27,8 @@
The Path of a Query
 

-Here we give a short overview of the stages a query has to pass in
-order to obtain a result.
+Here we give a short overview of the stages a query has to pass 
+to obtain a result.

 

@@ -125,7 +122,7 @@
 use a supervisor process (also
 master process) that spawns a new
 server process every time a connection is requested. This supervisor
-process is called postgres and listens at a
+process is called postgres (formerly 'postmaster') and listens at a
 specified TCP/IP port for incoming connections. Whenever a request
 for a connection is detected the postgres
 process spawns a new server process. The server tasks
@@ -230,7 +227,7 @@
 
  A detailed description of bison or
  the grammar rules given in gram.y would be
- beyond the scope of this paper. There are many books and
+ beyond the scope of this manual. There are many books and
  documents dealing with flex and
  bison. You should be familiar with
  bison before you start to study the
@@ -343,8 +340,8 @@

 
  In some situations, examining each possible way in which a query
- can be executed would take an excessive amount of time and memory
- space. In particular, this occurs when executing queries
+ can be executed would take an excessive amount of time and memory.
+ In particular, this occurs when executing queries
  involving large numbers of join operations. In order to determine
  a reasonable (not necessarily optimal) query plan in a reasonable amount
  of time, PostgreSQL uses a Genetic
@@ -411,7 +408,7 @@
 merge join: Each relation is sorted on the join
 attributes before the join starts. Then the two relations are
 scanned in parallel, and matching rows are combined to form
-join rows. This kind of join is more
+join rows. This kind of join is
 attractive because each relation has to be scanned only once.
 The required sorting might be achieved either by an explicit sort
 step, or by scanning the relation in the proper order using an
@@ -442,7 +439,7 @@
  If the query uses fewer than 
  relations, a near-exhaustive search is conducted to find the best
  join sequence.  The planner preferentially considers joins between any
- two relations for which there exist a corresponding join clause in the
+ two relations for which there exists a corresponding join clause in the
  WHERE qualification (i.e., for
  which a restriction like where rel1.attr1=rel2.attr2
  exists). Join pairs with no join clause are considered only when there

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Robert Haas

On Fri, Nov 20, 2020 at 4:21 PM Peter Geoghegan  wrote:
> Actually, now I think that BRIN shouldn't be special to vacuumlazy.c
> in any way. It doesn't make sense as part of this future world in
> which index vacuuming can be skipped for individual indexes (which is
> what I talked to Sawada-san about a little earlier in this thread).
> Why should it be useful to exploit the "no-real-TIDs" property of BRIN
> in this future world? It can only solve a problem that the main
> enhancement is itself expected to solve without any special help from
> BRIN (just the generic am callback that asks the same generic question
> about index vacuuming urgency).
>
> The only reason we press ahead with a second scan (the
> LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table
> problem. The bloat eventually gets out of hand *in the table*. We have
> now conceptually decoupled the problems experienced in the table/heap
> from the problems for each index (mostly), so this actually makes
> sense. The theory behind AV scheduling becomes much closer to reality
> -- by changing the reality! (The need to "prune the table to VACUUM
> any one index" notwithstanding -- that's still necessary, of course,
> but we still basically decouple table bloat from index bloat at the
> conceptual level.)
>
> Does that make sense?

I *think* so. For me the point is that the index never has a right to
insist on being vacuumed, but it can offer an opinion on how helpful
it would be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-11-20 Thread Daniel Gustafsson

> On 20 Nov 2020, at 01:33, Michael Paquier  wrote:

>> This seems like a complication which brings little benefit since the only 
>> real
>> errorcase is OOM in creating the context?  The built-in crypto support is
>> designed to never fail, and reading the OpenSSL code the only failure cases 
>> are
>> ENGINE initialization (which we don't do) and memory allocation.  Did you
>> consider using EVP_MD_CTX_init instead which place the memory allocation
>> responsibility with the caller?  Keeping this a void API and leaving the 
>> caller
>> to decide on memory allocation would make the API a lot simpler IMHO.
> 
> Yes.  That's actually the main take and why EVP callers *have* to let
> OpenSSL do the allocation: we cannot know the size of EVP_MD_CTX in
> advance in newer versions,

Yeah, there is that.

> knowing that we still need to deal with the OOM failure handling
> and pass down the error to the callers playing with SHA2, it feels
> like the most consistent API to me for the frontend and the backend.

For the backend I'd prefer an API where the allocation worked like palloc, if
not then it's a straight exit through the giftshop.  But if we want an API for
both the frontend and backend, I guess this is what we'll have to do.

>> +#ifndef _PG_CRYPTOHASH_H_
>> +#define _PG_CRYPTOHASH_H_
>> This should probably be CRYPTOHASH_H to be consistent?
> 
> cryptohash.h sounds like a header file we could find somewhere else,
> hence the extra PG_.

Ok, then at least I think we should use PG_CRYPTOHASH_H to be more consistent
with the tree, and since leading underscores in C are problematic spec-wise.

>> +/* Include internals of each implementation here */
>> +#include "sha2.c"
>> Do we really want to implement this by including a .c file?  Are there no 
>> other
>> options you see?
> 
> That was the least intrusive option I could figure out.  Two other
> options I have thought about:
> - Compile those fallback implementations independently and publish the
> internals in a separate header, but nobody is going to need that if we
> have a generic entry point.
> - Include the internals of each implementation in cryptohash.c, but
> this bloats the file with more implementations added (HMAC and MD5
> still need to be added on top of SHA2), and it messes up with the
> existing copyright entries.
> So splitting and just including them sounds like the cleanest option
> of the set.

Personally I think the first option of using an internal header seems cleaner,
but MMV so I'll leave it to others to weigh in too.

cheers ./daniel

Re: [HACKERS] Custom compression methods

2020-11-20 Thread Robert Haas

On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar  wrote:
> There were a few problems in this rebased version, basically, the
> compression options were not passed while compressing values from the
> brin_form_tuple, so I have fixed this.

Since the authorship history of this patch is complicated, it would be
nice if you would include authorship information and relevant
"Discussion" links in the patches.

Design level considerations and overall notes:

configure is autogenerated from configure.in, so the patch shouldn't
include changes only to the former.

Looking over the changes to src/include:

+   PGLZ_COMPRESSION_ID,
+   LZ4_COMPRESSION_ID

I think that it would be good to assign values to these explicitly.

+/* compresion handler routines */

Spelling.

+   /* compression routine for the compression method */
+   cmcompress_function cmcompress;
+
+   /* decompression routine for the compression method */
+   cmcompress_function cmdecompress;

Don't reuse cmcompress_function; that's confusing. Just have a typedef
per structure member, even if they end up being the same.

 #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
-   (((toast_compress_header *) (ptr))->rawsize = (len))
+do { \
+   Assert(len > 0 && len <= RAWSIZEMASK); \
+   ((toast_compress_header *) (ptr))->info = (len); \
+} while (0)

Indentation.

+#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
+   ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);

What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
something? It seems not great to have separate functions each setting
part of a 4-byte quantity. Too much chance of failing to set both
parts. I guess you've got a function called
toast_set_compressed_datum_info() for that, but it's just a wrapper
around two macros that could just be combined, which would reduce
complexity overall.

+   T_CompressionRoutine,   /* in access/compressionapi.h */

This looks misplaced. I guess it should go just after these:

T_FdwRoutine,   /* in foreign/fdwapi.h */
T_IndexAmRoutine,   /* in access/amapi.h */
T_TableAmRoutine,   /* in access/tableam.h */

Looking over the regression test changes:

The tests at the top of create_cm.out that just test that we can
create tables with various storage types seem unrelated to the purpose
of the patch. And the file doesn't test creating a compression method
either, as the file name would suggest, so either the file name needs
to be changed (compression, compression_method?) or the tests don't go
here.

+-- check data is okdd

I guess whoever is responsible for this comment prefers vi to emacs.

I don't quite understand the purpose of all of these tests, and there
are some things that I feel like ought to be tested that seemingly
aren't. Like, you seem to test using an UPDATE to move a datum from a
table to another table with the same compression method, but not one
with a different compression method. Testing the former is nice and
everything, but that's the easy case: I think we also need to test the
latter. I think it would be good to verify not only that the data is
readable but that it's compressed the way we expect. I think it would
be a great idea to add a pg_column_compression() function in a similar
spirit to pg_column_size(). Perhaps it could return NULL when
compression is not in use or the data type is not varlena, and the
name of the compression method otherwise. That would allow for better
testing of this feature, and it would also be useful to users who are
switching methods, to see what data they still have that's using the
old method. It could be useful for debugging problems on customer
systems, too.

I wonder if we need a test that moves data between tables through an
intermediary. For instance, suppose a plpgsql function or DO block
fetches some data and stores it in a plpgsql variable and then uses
the variable to insert into another table. Hmm, maybe that would force
de-TOASTing. But perhaps there are other cases. Maybe a more general
way to approach the problem is: have you tried running a coverage
report and checked which parts of your code are getting exercised by
the existing tests and which parts are not? The stuff that isn't, we
should try to add more tests. It's easy to get corner cases wrong with
this kind of thing.

I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
least not by 0001, which reinforces my feeling that the tests here are
not as thorough as they could be.

+NOTICE:  pg_compression contains unpinned initdb-created object(s)

This seems wrong to me - why is it OK?

-   result = (struct varlena *)
-   palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
-   SET_VARSIZE(result, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
+   cmoid = GetCompressionOidFromCompressionId(TOAST_COMPRESS_METHOD(attr));

Why does create_gather_merge_plan need make_sort?

2020-11-20 Thread James Coleman

While looking at another issue I noticed that create_gather_merge_plan
calls make_sort if the subplan isn't sufficiently sorted. In all of
the cases I've seen where a gather merge path (not plan) is created
the input path is expected to be properly sorted, so I was wondering
if anyone happened to know what case is being handled by the make_sort
call. Removing it doesn't seem to break any tests.

Thanks,
James

Re: Strange behavior with polygon and NaN

2020-11-20 Thread Tom Lane

Further to this ...

I realized after looking at things some more that one of
line_closept_point's issues is really a bug in line_construct:
it fails to draw a horizontal line through a point with x = Inf,
though surely that's not particularly ill-defined.  The reason
is that somebody thought they could dispense with a special case
for m == 0, but then we end up doing

result->C = float8_mi(pt->y, float8_mul(m, pt->x));

and if m = 0 and pt->x = Inf, we get NaN.

It also annoyed me that the code was still using DBL_MAX instead of a
true Inf to represent infinite slope.  That's sort of okay as long as
it's just a communication mechanism between line_construct and places
like line_sl, but it's not really okay, because in some places you can
get a true infinity from a slope calculation.  Thus in HEAD you get
different results from

regression=# select line(point(1,2),point(1,'inf'));
   line   
--
 {-1,0,1}
(1 row)

regression=# select line(point(1,2),point(4,'inf'));
  line   
-
 {Infinity,-1,-Infinity}
(1 row)

which is completely silly: we ought to "round off" that infinitesimal
slope to a true vertical, rather than producing a line representation
we can't do anything with.

So I fixed that too, but then I got a weird regression test diff:
the case of
lseg '[(-10,2),(-10,3)]' ?|| lseg '[(-10,2),(-10,3)]'
was no longer returning true.  The reason turned out to be that
lseg_parallel does

PG_RETURN_BOOL(FPeq(lseg_sl(l1), lseg_sl(l2)));

and now lseg_sl is returning true infinities for vertical lines, and
FPeq *gets the wrong answer* when asked to compare Inf to Inf.  It
should say equal, surely, but internally it computes a NaN and ends up
with false.

So the attached 0003 patch also fixes FPeq() and friends to give
sane answers for Inf-vs-Inf comparisons.  That part seems like
a fairly fundamental bug fix, and so I feel like we ought to
go ahead and apply it before we do too much more messing with
the logic in this area.

(Note that the apparently-large results diff in 0003 is mostly
a whitespace change: the first hunk just reflects slopes coming
out as Infinity not DBL_MAX.)

I'm reposting 0001 and 0002 just to keep the cfbot happy,
they're the same as in my previous message.

regards, tom lane

diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 93a8736a3f..76679bae8d 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -157,7 +157,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
  count 
 ---
- 3
+ 4
 (1 row)
 
 SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
@@ -169,7 +169,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
  count 
 ---
- 4
+ 5
 (1 row)
 
 SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
@@ -188,10 +188,11 @@ SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
  (10,10)
  (-5,-12)
  (5.1,34.5)
+ (Infinity,1e+300)
  (1e+300,Infinity)
  (NaN,NaN)
  
-(10 rows)
+(11 rows)
 
 SELECT * FROM point_tbl WHERE f1 IS NULL;
  f1 
@@ -202,16 +203,17 @@ SELECT * FROM point_tbl WHERE f1 IS NULL;
 SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
 f1 
 ---
- (1e-300,-1e-300)
  (0,0)
+ (1e-300,-1e-300)
  (-3,4)
  (-10,0)
  (10,10)
  (-5,-12)
  (5.1,34.5)
  (1e+300,Infinity)
+ (Infinity,1e+300)
  (NaN,NaN)
-(9 rows)
+(10 rows)
 
 SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
 f1
@@ -464,7 +466,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
  count 
 ---
- 3
+ 4
 (1 row)
 
 EXPLAIN (COSTS OFF)
@@ -494,7 +496,7 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
 SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
  count 
 ---
- 4
+ 5
 (1 row)
 
 EXPLAIN (COSTS OFF)
@@ -530,10 +532,11 @@ SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
  (10,10)
  (-5,-12)
  (5.1,34.5)
+ (Infinity,1e+300)
  (1e+300,Infinity)
  (NaN,NaN)
  
-(10 rows)
+(11 rows)
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM point_tbl WHERE f1 IS NULL;
@@ -568,9 +571,10 @@ SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
  (10,10)
  (-5,-12)
  (5.1,34.5)
+ (Infinity,1e+300)
  (1e+300,Infinity)
  (NaN,NaN)
-(9 rows)
+(10 rows)
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
diff --git a/src/test/regress/expected/geometry.out b/src/test/regress/expected/geometry.out
index 5b9d37030f..81202a4c33 100644
--- a/src/test/regress/expected/geometry.out
+++ b/src/test/regress/expected/geometry.out
@@ -120,6 +120,7 @@ SELECT p1.f1, p2.f1, slope(p1.f1, p2.f1)

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-11-20 Thread Peter Geoghegan

On Fri, Nov 20, 2020 at 2:17 PM Robert Haas  wrote:
> > Does that make sense?
>
> I *think* so. For me the point is that the index never has a right to
> insist on being vacuumed, but it can offer an opinion on how helpful
> it would be.

Right, that might be the single most important point. It's a somewhat
more bottom-up direction for VACUUM, that is still fundamentally
top-down. Because that's still necessary.

Opportunistic heap pruning is usually very effective, so today we
realistically have these 4 byte line pointers accumulating in heap
pages. The corresponding "bloatum" in index pages is an index tuple +
line pointer (at least 16 bytes + 4 bytes). Meaning that we accumulate
that *at least* 20 bytes for each 4 bytes in the table. And, indexes
care about *where* items go, making the problem even worse. So in the
absence of index tuple LP_DEAD setting/deletion (or bottom-up index
deletion in Postgres 14), the problem in indexes is probably at least
5x worse.

The precise extent to which this is true will vary. It's a mistake to
try to reason about it at a high level, because there is just too much
variation for that approach to work. We should just give index access
methods *some* say. Sometimes this allows index vacuuming to be very
lazy, other times it allows index vacuuming to be very eager. Often
this variation exists among indexes on the same table.

Of course, vacuumlazy.c is still responsible for not letting the
accumulation of LP_DEAD heap line pointers get out of hand (without
allowing index TIDs to point to the wrong thing due to dangerous TID
recycling issues/bugs). The accumulation of LP_DEAD heap line pointers
will often take a very long time to get out of hand. But when it does
finally get out of hand, index access methods don't get to veto being
vacuumed. Because this isn't actually about their needs anymore.

Actually, the index access methods never truly veto anything. They
merely give some abstract signal about how urgent it is to them (like
the 0.0 - 1.0 thing). This difference actually matters. One index
among many on a table saying "meh, I guess I could benefit from some
index vacuuming if it's no real trouble to you vacuumlazy.c" rather
than saying "it's absolutely unnecessary, don't waste CPU cycles
vacuumlazy.c" may actually shift how vacuumlazy.c processes the heap
(at least occasionally). Maybe the high level VACUUM operation decides
that it is worth taking care of everything all at once -- if all the
indexes together either say "meh" or "now would be a good time", and
vacuumlazy.c then notices that the accumulation of LP_DEAD line
pointers is *also* becoming a problem (it's also a "meh" situation),
then it can be *more* ambitious. It can do a traditional VACUUM early.
Which might still make sense.

This also means that vacuumlazy.c would ideally think about this as an
optimization problem. It may be lazy or eager for the whole table,
just as it may be lazy or eager for individual indexes. (Though the
eagerness/laziness dynamic is probably much more noticeable with
indexes in practice.)

-- 
Peter Geoghegan

Re: Different results between PostgreSQL and Oracle for "for update" statement

2020-11-20 Thread Andreas Karlsson

On 11/20/20 3:25 PM, Andy Fan wrote:> On Fri, Nov 20, 2020 at 9:37 PM 
Andreas Karlsson 
> wrote:

On 11/20/20 9:57 AM, Andy Fan wrote:
 > Thank you for your attention. Your suggestion would fix the
issue.  However
 > The difference will cause some risks when users move their
application
 > from Oracle
 > to PostgreSQL. So I'd like to think which behavior is more
reasonable.

I think PostgreSQL's behavior is more reasonable since it only locks
the
rows it claims to lock and no extra rows. This makes the code easy to
reason about. And PostgreSQL does not re-evaluate sub queries after
grabbing the lock which while it might be surprising to some people is
also a quite nice consistent behavior in practice as long as you are
aware of it.

I admit my way is bad after reading your below question, but I
would not think *it might be surprising to some people* is a good signal
for a design.  Would you think "re-evaluate the quals" after grabbing the
lock should be a good idea? And do you know if any other database uses
the postgres's way or Oracle's way?   I just heard Oracle might do the
re-check just some minutes before reading your reply and I also found
Oracle doesn't lock the extra rows per my test.

Re-evaluating the sub queries is probably a very bad idea in practice 
since a sub query can have side effects, side effects which could really 
mess up some poor developer's database if they are unaware of it. The 
tradeoff PostgreSQL has made is not perfect but on top of my head I 
cannot think of anything less bad.

I am sadly not familiar enough with Oracle or have access to any Oracle 
license so I cannot comment on how Oracle have implemented their behvior 
or what tradeoffs they have made.

Andreas

Re: Different results between PostgreSQL and Oracle for "for update" statement

2020-11-20 Thread Peter Geoghegan

On Fri, Nov 20, 2020 at 3:04 PM Andreas Karlsson  wrote:
> I am sadly not familiar enough with Oracle or have access to any Oracle
> license so I cannot comment on how Oracle have implemented their behvior
> or what tradeoffs they have made.

I bet that Oracle does a statement-level rollback for READ COMMITTED
mode's conflict handling. I'm not sure if this means that it locks
multiple rows or not. I think that it only uses one snapshot, which
isn't quite what we do in the Postgres case. It's really complicated
in both systems.

Andy is right to say that it looks like Postgres is using 2 different
snapshots for the same query. That's *kind of* what happens here.
Technically the executor doesn't take a new snapshot, but it does the
moral equivalent. See the EvalPlanQual() section of the executor
README.

FWIW this area is something that isn't very well standardized, despite
what you may hear. For example, InnoDBs REPEATABLE READ doesn't even
use the transaction snapshot for UPDATEs and DELETEs at all:

https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html

Worst of all, you can update rows that were not visible to the
transaction snapshot, thus rendering them visible (see the "Note" box
in the documentation for an example of this). InnoDB won't throw a
serialization error at any isolation level. So it could be worse!

--
Peter Geoghegan

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-20 Thread Michael Paquier

On Fri, Nov 20, 2020 at 11:53:11AM -0500, Tom Lane wrote:
> Yeah, if pgODBC were not using it at all then I think it'd be fine
> to get rid of, but if it still contains calls then we cannot.
> The suggestion upthread that those calls might be unreachable
> is interesting, but it seems unproven.

Yeah, I am not 100% sure that there are no code paths calling
currtid2(), and the ODBC is too obscure to me to get to a clear
conclusion.  currtid() though, is a different deal thanks to
RETURNING.  What about cutting the cake in two and just remove
currtid() then?
--
Michael


signature.asc
Description: PGP signature

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-11-20 Thread Michael Paquier

On Fri, Nov 20, 2020 at 11:17:44PM +0100, Daniel Gustafsson wrote:
> On 20 Nov 2020, at 01:33, Michael Paquier  wrote:
>> knowing that we still need to deal with the OOM failure handling
>> and pass down the error to the callers playing with SHA2, it feels
>> like the most consistent API to me for the frontend and the backend.
> 
> For the backend I'd prefer an API where the allocation worked like palloc, if
> not then it's a straight exit through the giftshop.  But if we want an API for
> both the frontend and backend, I guess this is what we'll have to do.

The fallback implementations can somewhat achieve that as we know all
the internals of the structures used.

> Ok, then at least I think we should use PG_CRYPTOHASH_H to be more consistent
> with the tree, and since leading underscores in C are problematic spec-wise.

Makes sense.  Will fix.

>> That was the least intrusive option I could figure out.  Two other
>> options I have thought about:
>> - Compile those fallback implementations independently and publish the
>> internals in a separate header, but nobody is going to need that if we
>> have a generic entry point.
>> - Include the internals of each implementation in cryptohash.c, but
>> this bloats the file with more implementations added (HMAC and MD5
>> still need to be added on top of SHA2), and it messes up with the
>> existing copyright entries.
>> So splitting and just including them sounds like the cleanest option
>> of the set.
> 
> Personally I think the first option of using an internal header seems cleaner,
> but MMV so I'll leave it to others to weigh in too.

What you meant and what I meant was slightly different here.  I meant
publishing a header in src/include/common/ that would get installed,
and I'd rather avoid that.  And you mean to have the header for local
consumption in src/common/.  I would be fine with your third option as
well.  Your suggestion is more consistent with what we do for the rest
of src/common/ and libpq actually.  So I don't mind switching to
that.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-20 Thread Michael Paquier

On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> David Steele (da...@pgmasters.net) wrote:
>> Our current plan for pgBackRest:
>> 
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
> 
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
> 
> *don't :)

Have you considered the possibility of only using pd_checksums for the
validation?  This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
> 
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
> 
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
> 
> +1.

I don't quite understand how you can make sure that the page is not
corrupted here?  It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification.  So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row.  The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
--
Michael


signature.asc
Description: PGP signature

Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-20 Thread James Coleman

Over on the -bugs list we had a report [1] of a seg fault with
incremental sort. The short of the investigation there was that a
subplan wasn't being serialized since it wasn't parallel safe, and
that attempting to initialize that subplan resulted in the seg fault.
Tom pushed a change to raise an ERROR when this occurs so that at
least we don't crash the server.

There was some small amount of discussion about this on a thread about
a somewhat similar issue [2] (where volatile expressions were the
issue), but that thread resulted in a committed patch, and beyond that
it seems that this issue deserves its own discussion.

I've been digging further into the problem, and my first question was
"why doesn't this result in a similar error but with a full sort when
incremental sort is disabled?" After all, we have code to consider a
gather merge + full sort on the cheapest partial path in
generate_useful_gather_paths(), and the plan that I get on Luis's test
case with incremental sort disabled has an full sort above a gather,
which presumably it'd be cheaper to push down into the parallel plan
(using gather merge instead of gather).

It turns out that there's a separate bug here: I'm guessing that in
the original commit of generate_useful_gather_paths() we had a
copy/paste thinko in this code:

/*
 * If the path has no ordering at all, then we can't use either
 * incremental sort or rely on implicit sorting with a gather
 * merge.
 */
if (subpath->pathkeys == NIL)
 continue;

The comment claims that we should skip subpaths that aren't sorted at
all since we can't possibly use a gather merge directly or with an
incremental sort applied first. It's true that we can't do either of
those things unless the subpath is already at least partially sorted.
But subsequently we attempt to construct a gather merge path on top of
a full sort (for the cheapest path), and there's no reason to limit
that to partially sorted paths (indeed in the presence of incremental
sort it seems unlikely that that would be the cheapest path).

We still don't want to build incremental sort paths if the subpath has
no pathkeys, but that will imply presorted_keys = 0, which we already
check.

So 0001 removes that branch entirely. And, as expected, we now get a
gather merge + full sort the resulting error about subplan not being
initialized.

With that oddity out of the way, I started looking at the actually
reported problem, and nominally issue is that we have a correlated
subquery (in the final target list and the sort clause), and
correlated subqueries (not sure why exactly in this case; see [3])
aren't considered parallel-safe.

As to why that's happening:
1. find_em_expr_usable_for_sorting_rel doesn't exclude that
expression, and arguably correctly so in the general case since one
might (though we don't now) use this same kind of logic for
non-parallel plans.
2. generate_useful_pathkeys_for_relation is noting that it'd be useful
to sort on that expression and sees it as safe in part due to the (1).
3. We create a sort node that includes that expression in its output.
That seems pretty much the same (in terms of safety given generalized
input paths/plans, not in terms of what Robert brought up in [4]) as
what would happen in the prepare_sort_from_pathkeys call in
create_gather_merge_plan.

So to fix this problem I've added the option to find_em_expr_for_rel
to restrict to parallel-safe expressions in 0002.

Given point 3 above (and the discussion with Robert earlier) above it
seems to me that there's a bigger problem here that just hasn't
happened to have been exposed yet: in particular the
prepare_sort_from_pathkeys call in create_gather_merge_plan seems
suspect to me. But it's also possible other usages of
prepare_sort_from_pathkeys could be suspect given other seemingly
unrelated changed to path generation, since nothing other than
implicit knowledge about the conventions here prevents us from
introducing paths generating sorts with expressions not in the target
list (in fact a part of the issue here IMO is that non-var expression
pathkeys are added to the target list after path generation and
selection is completed).

Alternatively we could "fix" this by being even more conservative in
find_em_expr_usable_for_sorting_rel and disregard any expressions not
currently found in the target list. But that seems unsatisfactory to
me since, given we know that there are cases where
prepare_sort_from_paths is explicitly intended to add pathkey
expressions to the target list, we'd be excluding possible useful
cases (and indeed we already have, albeit contrived, test cases that
fail if that change is made).

There exists a very similar path in the postgres fdw code (in fact the
function name is the same: get_useful_pathkeys_for_relation) since
find_em_expr_for_rel is much simpler (it doesn't do many safety checks
at all like we do in find_em_expr_usable_for_sorting_rel), but I think
that's safe (though I haven't thought about it a ton) because I
be

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-20 Thread Tom Lane

Michael Paquier  writes:
> What about cutting the cake in two and just remove
> currtid() then?

+1.  That'd still let us get rid of setLastTid() which is
the ugliest part of the thing, IMO.

regards, tom lane

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-20 Thread Michael Paquier

On Fri, Nov 20, 2020 at 09:50:08PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> What about cutting the cake in two and just remove
>> currtid() then?
> 
> +1.  That'd still let us get rid of setLastTid() which is
> the ugliest part of the thing, IMO.

Indeed, this could go.  There is a recursive call for views, but in
order to maintain compatibility with that we can just remove one
function and move the second to use a regclass as argument, like the
attached, while removing setLastTid().  Any thoughts?
--
Michael
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 92b19dba32..54b2eb7378 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -129,7 +129,6 @@ extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
    bool *all_dead, bool first_call);
 
 extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid);
-extern void setLastTid(const ItemPointer tid);
 
 extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index c6da0df868..7d1f6ec234 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202011191
+#define CATALOG_VERSION_NO	202011211
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 33dacfd340..63ddfe99d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -2549,13 +2549,10 @@
 { oid => '1292',
   proname => 'tideq', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'tid tid', prosrc => 'tideq' },
-{ oid => '1293', descr => 'latest tid of a tuple',
-  proname => 'currtid', provolatile => 'v', proparallel => 'u',
-  prorettype => 'tid', proargtypes => 'oid tid', prosrc => 'currtid_byreloid' },
 { oid => '1294', descr => 'latest tid of a tuple',
   proname => 'currtid2', provolatile => 'v', proparallel => 'u',
-  prorettype => 'tid', proargtypes => 'text tid',
-  prosrc => 'currtid_byrelname' },
+  prorettype => 'tid', proargtypes => 'regclass tid',
+  prosrc => 'currtid_byreloid' },
 { oid => '1265',
   proname => 'tidne', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'tid tid', prosrc => 'tidne' },
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29e07b7228..e0f24283b8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -645,10 +645,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 
 	if (canSetTag)
-	{
 		(estate->es_processed)++;
-		setLastTid(&slot->tts_tid);
-	}
 
 	/*
 	 * If this insert is the result of a partition key update that moved the
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index 509a0fdffc..4d191d91b6 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -275,14 +275,6 @@ hashtidextended(PG_FUNCTION_ARGS)
  *	Maybe these implementations should be moved to another place
  */
 
-static ItemPointerData Current_last_tid = {{0, 0}, 0};
-
-void
-setLastTid(const ItemPointer tid)
-{
-	Current_last_tid = *tid;
-}
-
 /*
  *	Handle CTIDs of views.
  *		CTID should be defined in the view and it must
@@ -367,11 +359,6 @@ currtid_byreloid(PG_FUNCTION_ARGS)
 	TableScanDesc scan;
 
 	result = (ItemPointer) palloc(sizeof(ItemPointerData));
-	if (!reloid)
-	{
-		*result = Current_last_tid;
-		PG_RETURN_ITEMPOINTER(result);
-	}
 
 	rel = table_open(reloid, AccessShareLock);
 
@@ -401,46 +388,3 @@ currtid_byreloid(PG_FUNCTION_ARGS)
 
 	PG_RETURN_ITEMPOINTER(result);
 }
-
-Datum
-currtid_byrelname(PG_FUNCTION_ARGS)
-{
-	text	   *relname = PG_GETARG_TEXT_PP(0);
-	ItemPointer tid = PG_GETARG_ITEMPOINTER(1);
-	ItemPointer result;
-	RangeVar   *relrv;
-	Relation	rel;
-	AclResult	aclresult;
-	Snapshot	snapshot;
-	TableScanDesc scan;
-
-	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
-	rel = table_openrv(relrv, AccessShareLock);
-
-	aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
-  ACL_SELECT);
-	if (aclresult != ACLCHECK_OK)
-		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
-	   RelationGetRelationName(rel));
-
-	if (rel->rd_rel->relkind == RELKIND_VIEW)
-		return currtid_for_view(rel, tid);
-
-	if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-		elog(ERROR, "cannot look at latest visible tid for relation \"%s.%s\"",
-			 get_namespace_name(RelationGetNamespace(rel)),
-			 RelationGetRelationName(rel));
-
-	result = (ItemPointer) palloc(sizeof(ItemPointerData));
-	ItemPointerCopy(tid, result);
-
-	snapshot = RegisterSnapshot(GetLatestSnapshot());
-	scan = table_beginscan_tid(rel, snapshot);
-	table_tuple_get_latest_tid(scan, result);
-	table_endscan(scan);
-	UnregisterSnapshot(snapshot);
-
-	table_close(rel, AccessShareLock);
-
-	PG_RETURN_ITEMPOINTER(result);
-}
diff -

Re: [doc] improve tableoid description

2020-11-20 Thread Peter Eisentraut


On 2020-10-19 14:28, Ian Lawrence Barwick wrote:

On further reflection, I think trying to explain all that is going to
end up as a
mini-tutorial which is beyond the scope of the explanation of a column, so
the existing reference to pg_class should be enough.

Revised patch attached just mentioning partitioned tables.


committed

Re: jit and explain nontext

2020-11-20 Thread Peter Eisentraut


On 2020-11-20 17:16, Justin Pryzby wrote:

It matters if it was planned with jit but executed without jit.

postgres=# DEALLOCATE p; SET jit=on; SET jit_above_cost=0; prepare p as select 
from generate_series(1,9); explain(format yaml) execute p; SET jit=off; 
explain(format yaml) execute p;

Patched shows this for both explains:
JIT:  +
  Functions: 3+

Unpatched shows only in the first case.


In this context, I don't see the point of this change.  If you set 
jit=off explicitly, then there is no need to clutter the EXPLAIN output 
with a bunch of zeroes about JIT.

83 matches

Mail list logo